2023:Program/Open Data/XKBU7Z-Duplicating Everywhere All at Once
Title: Duplicating Everywhere All at Once
Alex Lum / Canley
Alex Lum / Canley has been an editor on Wikipedia since 2005, an administrator on the English Wikipedia since 2008, and a prolific contributor to Wikidata and OpenStreetMap. He has a background in computer science and statistics, and is currently working in data science in the higher education sector. Alex was President of Wikimedia Australia from 2020 to 2022, and is currently on the WMAU committee as Secretary of the chapter.
Room: Room 309
Start time: Sat, 19 Aug 2023 10:00:00 +0800
End time: Sat, 19 Aug 2023 10:30:00 +0800
Type: No (pretalx) session type id specified
Track: Open Data
Submission state: confirmed
Duration: 30 minutes
Do not record: false
Presentation language: en
Abstract & description[edit source]
Five years ago, bots created millions of articles on several Wikipedia language editions and corresponding Wikidata items, resulting in thousands of duplicated Wikidata items for geographic places. This session will cover how this happened, use data visualisation to show the scope of the issues, and suggest some novel ways of cleaning up Wikidata, Wikipedia and the original data sources.
Five years ago, Lsjbot created millions of articles on several Wikipedia language editions, for which other bots created corresponding Wikidata items. The result has been hundreds of thousands of duplicated items for geographic places on Wikidata.
This session will look at the history of how this happened, use data visualisation to show the scope and scale of the issue, and propose some ways of cleaning up Wikidata, Wikipedia and even the original data sources. It will concentrate primarily on geographic places in Aotearoa New Zealand and some parts of Australia, but will be relevant to other countries where the issue of bot-created duplicates of geographic entities is significant.
Further details[edit source]
Qn. How does your session relate to the event themes: Diversity, Collaboration Future?
This issue is very much one which requires a collaborative approach as it is too much for one person to tackle, even in an automated way, and also needs cooperation between the diverse editing communities of Wikidata and several Wikipedia language editions. It even hopes to work with the data source (Geonames) to “round-trip” the cleaned-up data from Wikidata to their site to help de-duplicate their data. I will be concentrating primarily on New Zealand and parts of Australia, but the aim is to share techniques and code to inspire other Wikidata and Wikipedia editors to tackle the cleanup tasks in their local region or language. This will hopefully result in a future where Wikidata is an accurate and mostly definitive geographic database of the whole world.
Qn. What is the experience level needed for the audience for your session?
Average knowledge about Wikimedia projects or activities
Qn. What is the most appropriate format for this session?