Jump to content

2023:Program/Open Data/XKBU7Z-Duplicating Everywhere All at Once

From Wikimania

Title: Duplicating Everywhere All at Once


Alex Lum / Canley

Alex Lum / Canley has been an editor on Wikipedia since 2005, an administrator on the English Wikipedia since 2008, and a prolific contributor to Wikidata and OpenStreetMap. He has a background in computer science and statistics, and is currently working in data science in the higher education sector. Alex was President of Wikimedia Australia from 2020 to 2022, and is currently on the WMAU committee as Secretary of the chapter.

Pretalx link

Etherpad link

Room: Room 309

Start time: Sat, 19 Aug 2023 10:00:00 +0800

End time: Sat, 19 Aug 2023 10:30:00 +0800

Type: No (pretalx) session type id specified

Track: Open Data

Submission state: confirmed

Duration: 30 minutes

Do not record: false

Presentation language: en

Abstract & description[edit source]

Abstract[edit source]

Five years ago, bots created millions of articles on several Wikipedia language editions and corresponding Wikidata items, resulting in thousands of duplicated Wikidata items for geographic places. This session will cover how this happened, use data visualisation to show the scope of the issues, and suggest some novel ways of cleaning up Wikidata, Wikipedia and the original data sources.

Description[edit source]

Five years ago, Lsjbot created millions of articles on several Wikipedia language editions, for which other bots created corresponding Wikidata items. The result has been hundreds of thousands of duplicated items for geographic places on Wikidata.

This session will look at the history of how this happened, use data visualisation to show the scope and scale of the issue, and propose some ways of cleaning up Wikidata, Wikipedia and even the original data sources. It will concentrate primarily on geographic places in Aotearoa New Zealand and some parts of Australia, but will be relevant to other countries where the issue of bot-created duplicates of geographic entities is significant.

Further details[edit source]

Qn. How does your session relate to the event themes: Diversity, Collaboration Future?

This issue is very much one which requires a collaborative approach as it is too much for one person to tackle, even in an automated way, and also needs cooperation between the diverse editing communities of Wikidata and several Wikipedia language editions. It even hopes to work with the data source (Geonames) to “round-trip” the cleaned-up data from Wikidata to their site to help de-duplicate their data. I will be concentrating primarily on New Zealand and parts of Australia, but the aim is to share techniques and code to inspire other Wikidata and Wikipedia editors to tackle the cleanup tasks in their local region or language. This will hopefully result in a future where Wikidata is an accurate and mostly definitive geographic database of the whole world.

Qn. What is the experience level needed for the audience for your session?

Average knowledge about Wikimedia projects or activities

Qn. What is the most appropriate format for this session?

  • Tick Onsite in Singapore
  • Empty Remote online participation, livestreamed
  • Empty Remote from a satellite event
  • Empty Hybrid with some participants in Singapore and others dialing in remotely
  • Empty Pre-recorded and available on demand