2019:Poster session/Towards more consistent import of external data sources to Wikidata
This is an Accepted submission for the Poster space at Wikimania 2019. |
Description
[edit | edit source]This poster (and, if available, demo) overviews the vision and results of the Wikidata & ETL project. Brief description of the problem taken from the project's page:
Currently, Wikidata, or any other Wikibase instance, is being populated from external data sources mostly manually, by creating ad-hoc data transformation scripts. Usually, these scripts are run once, and that is it. Given the heterogeneity of the source data and languages used to transform them, this means the scripts are hard or impossible to maintain and unable to run periodically in an automated fashion to keep Wikidata up-to-date.
At the same time, more and more interesting open data sources emerge, ready to be ingested into Wikidata. Without a methodology, tooling support and a set of best practices and examples, they will be left unexploited, or they will be transformed in the same, chaotic way as before.
Existing approaches to programmatically load data into such as Pywikibot and Wikidata integrator require coding in Python. QuickStatements is not suited for automatic bulk data loading and OpenRefine is limited to tabular data and focuses more on manual tinkering with the data than the bulk loading process. Wikibase_Universal_Bot is a Python library that automatically load bulk data from tabular format (csv file) into a Wikibase instance (such as Wikidata) once given a user-defined data model (in yaml format).
In this poster, we will show our approach using LinkedPipes ETL and how data from external sources can be loaded into a Wikibase instance in a repeatable and consistent manner.