Jump to content

2023:Program/GLAM, Heritage, and Culture/ZWCWTT-Archival Speech Data For The Win: How Archival Audio-Visual Recordings Could Help The Future Of The Wikimedia Movement

From Wikimania

Title: Archival Speech Data For The Win: How Archival Audio-Visual Recordings Could Help The Future Of The Wikimedia Movement


Subhashish Panigrahi

Subhashish Panigrahi is a public-interest archivist, researcher, non-fiction filmmaker, and civil society leader from India. A National Geographic Explorer, he has directed and produced nine critically acclaimed nonfiction films, including “MarginalizedAadhaar”, “Nani Ma,” the first documentary in the Baleswari-Odia dialect, and “Gyani Maiya”. His films often focus on conserving endangered languages, digital rights, and the open internet movement, particularly in South Asia. He has created a speech data repository in his native language Odia and dialect Baleswari-Odia, that has a total of over 62,000 audio recordings, the largest in the language and all under a Universal Public Domain Release (CC0 1.0).

Pretalx link

Etherpad link

Room: Room 326

Start time: Sat, 19 Aug 2023 14:15:00 +0800

End time: Sat, 19 Aug 2023 14:35:00 +0800

Type: Lecture

Track: GLAM, Heritage, and Culture

Submission state: confirmed

Duration: 20 minutes

Do not record: false

Presentation language: en

Abstract & description[edit source]

Abstract[edit source]

Low-resource languages lack archival speech data, hindering the development of Automatic Speech Recognition (ASR) technology both for broader access to knowledge and accessibility. The recent growth of corporate ownership of language data underlines the need for more community-owned ASR development. The workshop will demonstrate the process of archiving oral history through a filmmaking project and how audiovisual media was used to build high-quality speech data.

Description[edit source]

Most low-resource languages have very little archival data, particularly speech data, i.e. audio and video recordings of people. As a result, the speaker communities suffer research and development of speech recognition, an area that could both support small and underresourced communities, especially for accessibility and access. Automatic Speech Recognition or ASR, is an important Human-Computer Interaction (HCI) discipline and a technology using which human beings can interact with computers by using their voices. Its application can be very helpful for making HCI for native speakers more accessible and ensuring that language is not a barrier to access to information. On the other hand, large corporations or political parties could also use ASR-powered applications for the exploitation of minoritized communities. Nevertheless, community-owned ASR development is still critical. Considering the Wikimedia movement’s focus on GLAM, Wikimedians often have the opportunity to engage their own communities and even document cultural heritage as audio and video.

This workshop would delve into the process of archiving oral history during a filmmaking project and use audiovisual media for building high-quality speech data. The documentary footage used is from the 2022 openly-licensed short “Nani Ma” which was recorded in the rare 1900s register of the Baleswari dialect, which was never recorded before, of India’s Odia language. The demonstration would include the behind-the-scenes of this process. As many as 300 words, with varied intonations based on different emotional settings, are extracted from an archival recording of about 6 minutes. As a bonus, the entire process was done only using Free and Open Source Software. In this case, the elder whose speech data was recorded and many of her generation are no longer there, and it underlines how documenting oral history could be vital for the society and public, open and annotated/transcribed media can help language communities.

Further details[edit source]

Qn. How does your session relate to the event themes: Diversity, Collaboration Future?

Diversity is central to this workshop as it focuses on low-resource languages: indigenous, endangered and other languages that have less number of speakers who also have lower access to resources. Furthermore, it highlights the use of frugal and open source technology and collaborations within a community, and inter-community collaborations (because of use of community-led tech initiatives and platforms).

Qn. What is the experience level needed for the audience for your session?

Average knowledge about Wikimedia projects or activities

Qn. What is the most appropriate format for this session?

  • Empty Onsite in Singapore
  • Tick Remote online participation, livestreamed
  • Empty Remote from a satellite event
  • Empty Hybrid with some participants in Singapore and others dialing in remotely
  • Empty Pre-recorded and available on demand