2019:Transcription/Building Locally Relevant Knowledge with Wikisource

‎

Transcription/Building Locally Relevant Knowledge with Wikisource

Gbowee · Sunday 14:00 – 15:00

SDGs

Description

Wikisource is the free library of public domain and freely licenced texts and in many contexts, it is often the only project that helps communities that use non-latin scripts in building locally relevant knowledge on the web. Even developed communities find great value in Wikisource because it provides annotated texts which contains interwiki links to various topics on different projects.

This session aims to be a conversation with Wikisource organizers from various communities (such as the Indic, Armenian and minority languages in Italy and France). The communities will be sharing how they are organizing their communities to support the use and contribution to Wikisource in their different contexts: from digitizing old works for Wikisource and organizing proofreading contests to encouraging Wikisource contribution in order to advance mother language learning.

We will be discussing what the major obstacles and opportunities for supporting new communities to contributing to the projects? What kinds of support would help them organize more contributors to the Wikisource community? What are their biggest achievements so far? How should we encourage other communities to digitize their knowledge through Wikisource?

Relationship to the theme

This session will address the conference theme — Wikimedia, Free Knowledge and the Sustainable Development Goals — in the following manner:

This sessions relates to the fourth SDG, which is Quality Education, as Wikisource helps to brings freely-licensed locally relevant source texts online.

Session outcomes

At the end of the session, the following will have been achieved:

Demonstrate the value of Wikisource, especially for Emerging communities
Initiate conversation about potential opportunities for organizing and collaborations to grow language representation in Wikisource

Session leader(s)

Session type

Each Space at Wikimania 2019 will have specific format requests. The program design prioritises submissions which are future-oriented and directly engage the audience. The format of this submission is a:

Panel with audience Question & Answer session

Requirements

The session will work best with these conditions:

Room:

A normal room with 5-6 chairs for the panel and a projector

Audience:

30-50 participants

Everyone is welcome, especially Wikisource enthusiasts and community members from emerging communities

Recording:

Single-static camera will work just fine

SESSION SUMMARY

Copy of the notes from the Etherpad, taken by Ruthven, Björg, Peter Isotalo

Chair of session Satdeep Gill (Punjabi Wikisource)

Wikiclubs (in Armenia): school groups active on Armenian Wikisource

A source for Wikipedia articles

Acquire technical skills (e.g. native keyboard)

Cooperate with Wikipedia of Armenia

Western Armenian and Easter Armenian language. It is the official language of Armenia as well as the de facto Republic of Artsakh.

Most Armenian Wikisource editors are quite young (maybe around 11-12 years). The older Armenian editors work on Wikipedia.

Telugu language Whatsapp groups to discuss on different topics. Some of them are active and discuss on difficulties, and the language.

Some of the groups have meetings; which is a good occasion to bring new users to Wikisource, as an opportunity to defend the language

Groups select one theme, like a book, to discuss on for a short period of time (1 month); this text can be imported on WS

Wikiclub in a college of engineering. WS can be the focus project for 1 year.

Told about an example where Wikipedians who had for one reason or another failed on Wikipedia moved on to Wikisource and had more success there. The means they might be a possible source for recruitment.

Initially, OCR had to be done through Google, but a tool has now been developed. However, it requires constant internet access.

Noted that a lot of other languages of India still need better Wikisource technical support and infrastructure.

Telugu ranks fourth among the languages with the highest number of native speakers in India, and fifteenth in the Ethnologue list of most widely-spoken languages worldwide

Language supported by Google

Punjabi Wikisource community, > 100M speakers, but few texts online. Copyright problems. Cooperated with a ruined library for free of copyright works. Scanned 7000 pages 2 months

Proofreading contests, campaigns to increase the community Proofread in 40 days. The fastest growing wiki project in the world.

Living auhtors giving their works under cc-by

"important" books. Like well known authors books (somehow badly digitalized) or encyclopedias

Issues in scanning the indic language with OCR

Nicolas vest part of Breton Wikisource. Breton is spoken in Brittany, and is the only Celtic language not mainly spoken in the British Isles. Because Brittany is part of France, the language is in danger of becoming extinct, having declined from more than 1,000,000 speakers around 1950 to about 200,000 in the first decade of the 21st century, Breton is classified as "severely endangered" by the UNESCO Atlas of the World's Languages in Danger. However, the number of children attending bilingual classes has risen 33% between 2006 and 2012 to 14,709

Around 3 active editors. Might be possible to expand the user base since there are a lot older speakers (who are retired and have time), but who are also not very good at tech. One problem is that Breton orthography changed quite drastically around 70 years ago (which means that most works on Breton Wikisource can be difficult for modern speakers to read). A lot of the books are also in both Breton and French, so to some extent it's a two-project effort. But there's actually a realistic possibility of collecting all Breton literature since the corpus is relatively small. Univ of Toronto have digitalized most of the literature.

Q & A

Q: Do you have any GLAM support?
A: Universities (Breton), Government (Telugu), National Library and Museums, sometimes Wikimedia Armenia (Armenian)

Q: How it is working in projects with so few users?
A: Keep the community intact, training to have more people, ...

Q: OCR are developed for Latin script. How it is with other alphabets?
A: Telugu works well with Google, other small languages not supported by Google face more challenges, Breton is a Latin script language but OCR is not good because there is not a corpus