Jump to content

2025:Program/Behind the Edits: Exploring Human-Bot Collaboration

From Wikimania

Session title: Behind the Edits: Exploring Human-Bot Collaboration

Session type: Lightning talk
Track: Lightning Talk Showcase
Language: en

The aim of this research is to identify and characterize the different types of content creation on Wikipedia. LLMs have given rise to the trend of organizations ingesting as much data as they can to train those LLMs on. Wikipedia provides rich curation standards that allow greater visibility into the kinds of content LLMs train on. Along the spectrum from purely human edits to purely automated edits, humans also use an array of automated tools that could be classified as light automation (e.g. spell checking) to heavier automation (e.g. content translation from one language to another) to generative AI tools with light human review. We take advantage of a unique opportunity to identify tags that may help with this categorization, and then explore how degrees of automation impact content quality, breadth, and update frequency.

Description

Our session aims to highlight the work me and my research team have done to identify a potential scale of automation using the special edit tags in Wikipedia.

Content in Wikipedia is added by both human volunteers and automated bots. Along the spectrum from purely human edits to purely automated edits, humans also use an array of automated tools that could be classified as light automation (e.g. spell checking) to heavier automation (e.g. content translation from one language to another) to generative AI tools with light human review. Unlike many other corpora of data, Wikipedia keeps detailed metadata about the source of edits (human or bot) and the tools used in the editing process. Such categorization empowers policymakers to make informed, data-driven decisions about the appropriate balance between bot and human involvement when using the resulting data for various purposes.

Wikipedia has “tags” that its software marks edits and logged actions with. These tags may tell us which edits were machine generated or that a variable mix of both AI and human was involved in the edit. We define edits as any changes to the text content of a page.

Our goal was to find which tags were related to machine generation or an interaction between human and machine. In the latter case, because we assume there are varying degrees of human and machine input, we also wanted to see if the interaction existed on a scale that measured differing ratios of machine to human generation. We wish to showcase how content quality, breadth, and update frequency may change across the above types of content generation.

Analyzing Wikipedia’s tag data can also reveal trends and highlight areas where increased human intervention may be needed, fostering a more effective and collaborative editing ecosystem.

How does your session relate to the event theme, Wikimania@20 – Inclusivity. Impact. Sustainability?

At its heart, our research deals with the truth, falsity, and quality of information on Wikipedia. It makes the underlying assumption that not all information may act as quality training data, directly challenging the current trend of LLMs ingesting as much information as they possibly can without regard to ethics or robustness.

What is the experience level needed for the audience for your session?

Everyone can participate in this session

Resources

Speakers

  • Max Wang
I'm a rising senior computer science student at Clarkson University who's interested in ethical computing. I'm ultimately interested in the intersection between social change and technological advancement. I also am a Resident Advisor at my school and have been on my school's swim team for 3 years. I'm just beginning to get involved in the Wikimedia movement, having joined my professor on Wikipedia metadata research work a couple of months ago.