2019:Transcription/Sharing best practices among Wikisource communities
This is an Accepted submission for the Transcription space at Wikimania 2019. |
Description
[edit | edit source]This session will provide the opportunity to share best practices adopted by different Wikisource communities with each other. Like other Wikimedia projects, Wikisource projects of different languages are also not equal in content, quality, editor base, problems etc. While some Wikisource communities are more experienced and mature, others are not. To attain equity, it is therefore a responsibility for mature projects to share their best practices, so that others don't have to re-invent the wheel again and again. This session will bring together different Wikisource community members to sit together and learn from each other, may be, this session will help to develop a common space for future interaction in an organized way, to serve this purpose.
Relationship to the theme
[edit | edit source]This session will address the conference theme — Wikimedia, Free Knowledge and the Sustainable Development Goals — in the following manner:
- SDG - 10: Reducing Inequality : By empowering Wikisource communities as a whole with a goal to attain equity.
Session outcomes
[edit | edit source]At the end of the session, the following will have been achieved:
- Tools, scripts, gadgets etc. used by some communities will be used by few of the other communities, at the end of session.
- Attendees will address problems, common to almost all, and find out and implement likely solutions practiced by others.
- Creation of a space for future interaction to share best practices in an organized way, if possible.
Session leader(s)
[edit | edit source]- Bodhisattwa (talk) 18:03, 12 May 2019 (UTC) - Bengali Wikisource (I would expect one community member attending from each language Wikisource as other session leaders)
- Ankry (talk) 18:07, 2 June 2019 (UTC) - Polish Wikisource
Session type
[edit | edit source]Each Space at Wikimania 2019 will have specific format requests. The program design prioritises submissions which are future-oriented and directly engage the audience. The format of this submission is a:
- Roundtable discussion forum
Requirements
[edit | edit source]The session will work best with these conditions:
- Room:
Round-table seating
- Audience:
Active Wikisource community members and leaders participating in Wikimania.
- Recording:
Multiple conversation will happen on a round table, so a single 360 degree video camera would have been suitable. Otherwise a single movable video camera with an operator will be just sufficient.
SESSION SUMMARY
[edit | edit source]- Copy of the notes from the Etherpad, taken by VIGNERON, Björg
PART 1 (Bengali Wikisource)
[edit | edit source]- Why this session :
- Small communities, where everything was build from scratch, and reinventing the wheel everywhere and every time! Things are happening on other languages
- This is stressful and time consuming.
- The goal is to built a common platform where we can share best practice with each other
- Create a shared space on mul.source or on meta
- Bengali Wikisource, I will start with few minutes and 5 examples :
- scraping and upload script, script in Python to mass download from the British Library, https://bn.wikisource.org/s/g9ay
- OCR OCR4Wikisource (https://tools.wmflabs.org/ws-google-ocr/) all integrated in Wikisource
- Catalogue system, based on listeria generator for authors and works, works with data stored on wikidata (https://tools.wmflabs.org/listeria/ https://www.wikidata.org/wiki/Wikidata:Listeria )
- Proofread contest: https://tools.wmflabs.org/wscontest
- QRcode gadget (stored on Jayprakash12345 userpage )
Awareness ! Last year, we asked for a rapid grant to create small videos for readers! https://www.youtube.com/watch?v=ffD2xOml8MQ
After that, the readership increase by 10 (?) %
PART 2 (Polish Wikisource)
[edit | edit source]Ankry for Polish Wikisource
OCR
- 3-level proofreading red- yellow - green
Changed the name of the first "red" quality level, removed the "not" who has negative impact.
Pure OCR is in blue ("problematic") or just don't upload it at all!
Yellow button is hidden by default for page creator (done with abusefilter and js )
IP users can change level
Newbies welcome
Anybody can respond
Pages with only image (no text at all) are "grey", considered to not be proofread process.
Header are optional and don't influence the quality levels.
No use of noinclude (but a specific template has been created for text "only in a page")
Technical issues
- rewriten ("Przepisana") instead of not proofread ("Nieskorygowana ") label red category
- Creating a new index using an existing one as template, only for editor group members /editor = advanced user = 100 edits
- List of unfinished indexes
- a bit hard to generate, one pixel is one page, not really readable :(
- update by bot
- Problem with page size limit 2MB wikistandard
- with or without references handled different
- limits the browser workload
- hacky work-around using substitution (works fine but it breaks the update, done by a bot weekly, this method is used on ~30 pages) or client-side with js templates
- maximum "worst-case-scenario" is 17 MB and is an almanach about life of science, full of images and references, it's not presented by default to the readers
- (some) People want to download the whole book in one piece, and not 50 chapters one by one
- Text collections, by topics and organizes – in a separate namespace
- LUA based page splitting for dictionaries, not using the section marker (which can make the code less readable).
Useful gadgets
- "old" toolbar proofread toolbar
- interface changed
- download next scan automatically cached by the browser from DejaVu eller pdf (can be imported as JS from multilingual wikisource, with a "preview" button – see https://wikisource.org/wiki/User:Alex_brollo/eis.js)