Within a few years, Wikidata has developed into a central knowledge base for structured data through the collaborative efforts of Wikidata’s peer production community. One of the benefits of peer production is that knowledge is curated and maintained by a wide range of editors, with different cultural, experience and educational backgrounds, which hopefully results in potentially fewer biases and content-wise in a more diverse knowledge base.
Ensuring data quality is, thus, of utmost importance, as the goal of Wikidata is to “give more people more access to knowledge” and therefore, the data needs to be “fit for use by data consumers” (Wang et al., 1996). The Wikidata community has already developed methods and tools that monitor relative completeness (e.g., Recoin gadget Balaraman et al., 2018), encourage link validation and correction (e.g. Mix’N’Match) and help e
ditors observe recent changes and identify vandalism. Moreover, the community started global discussions about relevant dimensions of data quality in a recent RFC that used a survey of Linked Data Quality methods as the debate’s starting point to better describe and categorize quality issues and add more quality aspects/ dimensions, with the goal of developing a data quality framework for Wikidata. Despite this progress, recent research has shown the dominant role of a Western perspective in the represented languages (Kaffee and Simperl, 2018), thus, more work needs to be done to strive for more knowledge diversity. It is therefore a major concern of data quality, to support such knowledge diversity and ensure that Wikidata covers a wide variety of topics, from various trustworthy sources, where facts can be contradictory.
In this talk, we would like to present a classification of existing tools for data quality monitoring and data quality assurance in the context of Wikidata (extending previous work), drawing the Wikimedia community’s attention to gaps and opportunities for editors and developers to improve the collaborative data management cycle. Additionally, we will provide a comparison of data quality management strategies in Wikidata and Wikipedia, and present a summary of scientific findings relevant to the topic.
This session will address the conference theme — Wikimedia, Free Knowledge and the Sustainable Development Goals — in the following manner:
Data Quality can be interpreted as an orthogonal topic to the SDGs listed in the conference’s theme. Data in any of the domains of the SDGs needs to be of high quality in order to be consumed and used, let the purpose be to analyse gender inequality via Wikidata item descriptions (SDG 5 Gender Inequality), implement apps that help people gain knowledge in a specific domain (SDG 4 Education), or visualize a map to show the evolution of the climate disaster (SDG 13 Climate Action).
At the end of the session, the following will have been achieved:
Attendees will learn about the status quo of data quality in Wikidata and we (hopefully together in a Q&A slot) will define a roadmap of concrete edit types, MediaWiki features and external tools to be developed in the upcoming future.
Cristina Sarasua (Username:Criscod), Universität Zürich email@example.com
Mariam Farda-Sarbas (Username:Mariamfs), Freie Universität Berlin firstname.lastname@example.org
Claudia Müller-Birn (Username:Claudiamuellerbirn), Freie Universität Berlin email@example.com
Lydia Pintscher (Username:Lydia_Pintscher_(WMDE)), Wikimedia Deutschland firstname.lastname@example.org
Each Space at Wikimania 2019 will have specific format requests. The program design prioritises submissions which are future-oriented and directly engage the audience. The format of this submission is a:
- Discussion-based training workshop
The session will work best with these conditions:
Small classroom / lecture hall or round-table seating (with projector).
30 - 50 people
It would be desirable to have basic understanding of Wikidata’s data model and workflow. However, we would like to be inclusive and invite people from other related Wikimedia projects. So, time permitted, we could provide a flash introduction to what people need to know before we dive into the details of data quality in Wikidata.
We are in agreement with having this talk recorded and shared under a free-license.