Jump to content

2023:Program/GLAM, Heritage, and Culture/WVZNL9-Helping Wikisource recognize handwritten documents

From Wikimania

Title: Helping Wikisource recognize handwritten documents


Satdeep Gill (WMF)

Senior Program Officer, Culture and Heritage, Wikimedia Foundation

Kinneret Gordon

My name is Kinneret and I am based in Tel Aviv, Israel. I work on the Strategic Partnerships team at the Wikimedia Foundation and I help lead the Transkribus partnership which aims to support Wikisource by providing handwriting recognition models.

Sara Mansutti

I work as Education Manager for READ-COOP, a European Cooperative Society responsible for maintaining and advancing the Transkribus platform. Our goal is to unlock our written past and make historical documents more accessible thanks to AI. Additionally, I am pursuing my PhD in Digital Humanities at University College Cork.

Pretalx link

Etherpad link

Room: Room 311

Start time: Wed, 16 Aug 2023 15:20:00 +0800

End time: Wed, 16 Aug 2023 15:50:00 +0800

Type: No (pretalx) session type id specified

Track: GLAM, Heritage, and Culture

Submission state: confirmed

Duration: 30 minutes

Do not record: false

Presentation language: en

Abstract & description

[edit source]


[edit source]

The Wikimedia Foundation has partnered with READ-COOP to bring Transkribus, an AI-driven handwriting recognition tool, on Wikisource. This session will introduce Transkribus to the Wikimedia community and share resources around creating new handwriting recognition models to support Wikisource.


[edit source]

The Wikimedia Foundation has partnered with READ-COOP to bring Transkribus on Wikisource in order to support the Wikisource Loves Manuscripts project. The project aims to digitize and transcribe more than 20,000 pages of Indonesian manuscripts to foster their preservation and accessibility.

As the first phase of the project is focused in Indonesia, handwriting models for Indonesian languages are being created with the help of our partner IIIT Hyderabad. Because Google OCR and Tesseract, the two OCR engines already integrated in Wikisource, do not support Balinese and Javanese, Transkribus came into play. Transkribus is AI-powered platform for text recognition, transcription and searching of historical documents. With adequate training material, OCR models can be trained for any language or script and produce automatic transcriptions with a typical accuracy of around 95%. In this session, members of READ-COOP and WMF will share a demonstration on how Transkribus works and share resources which will help other Wikimedia communities to create new OCR models. We hope that this collaboration will inspire other Wikimedia communities to engage contributors in preserving and transcribing manuscripts as well as unlocking their written past. *Other relevant tracks: GLAM, Heritage, and Culture

Further details

[edit source]

Qn. How does your session relate to the event themes: Diversity, Collaboration Future?

This partnership is helping to improve the diversity of languages and document-types on Wikimedia projects so that the future of free knowledge is rich with historical texts and documents.

Qn. What is the experience level needed for the audience for your session?

Everyone can participate in this session

Qn. What is the most appropriate format for this session?

  • Empty Onsite in Singapore
  • Empty Remote online participation, livestreamed
  • Empty Remote from a satellite event
  • Tick Hybrid with some participants in Singapore and others dialing in remotely
  • Empty Pre-recorded and available on demand