OCR Text Correction is a Good Project for Crowdsourcing

Correcting text created by OCR (optical character recognition) is a great project for crowdsourcing because it can be isolated and scaled. Essentially, it can be made into a small task and the overall need can benefit from loads of small contributions, made through the small task interface. A great deal of digital library work can’t be sliced/scaled/isolated like this, and with so much work to do, it’s always nice when something can involve others for the benefit of everyone.

The National Library of Finland recently came out with new games-as-tools for correcting OCR text, and their website explains:

We need your help. Most of the information in the library’s newspaper archives has already been copied into computer databases using computerized text recognition. The problem is that computers fail to recognize all the words. Especially when the quality of the source material is poor, the results need to be fixed by hand. This requires a lot of manual work.

At the moment, when you play games in Digitalkoot you help correct words. Later this year you will also be able to help structure the documents and tag images.

I’m interested if anyone has reports on the success of this method. This is a higher level of investment than many given its contextualization of the work within a game-as-tool interface, and I don’t know if this would lead to greater or less success. The National Library of Australia has been phenomenally successful by allowing people to simply contribute the corrected text through an easy, no-frills interface as seen here.

I’m partial to the National Library of Australia’s method because it requires less initial resource investment, it’s proven to continue to return on design investment for the long-term, and it appeals to such a large and wide demographic that I would think it would be the most successful model. Of course, I’m most partial to whatever works so I’d love to know if folks have reports on the success rates of Finland’s games or other methods for crowdsourcing OCR correction.

Comments are closed.