The Leddy Library is a lead participant in a project to make a large Indonesian newspaper collection available online.
The Optical Character Recognition (OCR) of newspaper pages collected for the Violent Conflict in Indonesia Study is carried out at night using grid processing techniques and library workstations.
The study was conducted by the World Bank Conflict and Development team, and used local newspaper monitoring to track incidents of violence. More than 1,000,000 newspaper pages undergo OCR to make the text captured in the page images searchable and reusable.
“Until this point, it has seemed almost impossible to imagine that this collection could be indexed to the keyword level without requiring formidable financial resources,” says Samuel Clark, who spent five years working in the social development unit of the World Bank. He now coordinates the study’s images while pursuing a doctoral degree at Oxford University.
He credits the Project Conifer integrated library system with making the work possible.
“Research libraries typically invest $1 to $1.50 per page for newspaper digitization projects,” Clark says. “Project Conifer has achieved efficiencies through open collaboration that are impressive and unique. This collection could literally be the largest digital Indonesian newspaper archive on the planet and will benefit Indonesian researchers around the world.”
Art Rhyno, head of the Leddy’s systems department, led the development of the open source OCR grid processing suite in the Project Conifer systems group.
He notes that: “Open Source Software is the software equivalent of Open Access. It is as fundamental to library values as sharing books and creating forums for community empowerment.”
Project Conifer is a provincial consortium started by the University of Windsor and Algoma and Laurentian universities to share an Integrated Library System. It now includes more than 20 organizations and stretches over 1,600 kilometres between participating sites in Ontario.