2019:Research/Discovering and Analyzing Wikimedia Images

From Wikimania


In this tutorial: "Discovering and Analyzing Wikimedia Images", we will learn how large-scale analysis techniques can help editors, developers and researchers discovering, downloading, analyzing and classifying large amount of images.

Overview[edit | edit source]

The good organization and categorization of Wikimedia Commons images is crucial to ensure high-quality visual enrichment of all Wikimedia projects. Downloading and manually categorizing individual images from Commons is a fairly simple task. However, manual curation of large-scale repositories like Commons is not always scalable.

In this tutorial, we will first look at how to generate lists of images with specific characteristics, e.g. images belonging to specific categories, returning from specific search queries, or with specific roles in the Wikimedia spaces (e.g. Wikipedia page images, or Wikidata image statements). We will then learn how to download these images and their metadata at scale. We will then learn about basic tools for content-based image analysis, like color extraction and image tagging with existing computer vision-based models and APIs. Finally, we will learn how to design simple classifiers that can help automatically tag images from specific categories.

Detailed Agenda[edit | edit source]

  1. Introduction to Images and Wikimedia: quantifying visual knowledge gaps
  2. Discovering Images in Wikimedia spaces: retrieving lists of images from a specific category, a specific Wikipedia article across languages, or a specific Wikidata items and its subclasses
  3. Analyzing and Classifying Images in Wikimedia spaces: tagging images according to their color distribution, their quality, and their content. Tools, opportunities and risks of using image classifiers.
  4. Comparing images in Wikimedia spaces: computing visual similarity between images or groups of images, using distance metrics and clustering algorithms

Talk Material[edit | edit source]

Slides[edit | edit source]

The draft of the slides for the tutorial can be found here: slides.

Feel free to leave comments!

Code[edit | edit source]

Code for the tutorial can be found at this git repo. There are 2 main files, python notebooks, that we are going to use:

  • Image_Analysis_Wikimania_19.ipynb for the first part of the tutorial on image classification
  • Image_Comparison_Wikimania_19.ipynb for the second part of the tutorial on image analysis

Requirements to run this code can be found in the README.md file in the repo. Any question, please reach out to Miriam_(WMF)

Participants [subscribe here!][edit | edit source]
  1. Miriam (WMF) (talk)
  2. Blue Rasberry (talk) 13:49, 1 August 2019 (UTC)[reply]
  3. Ranjithsiji (talk) 02:42, 3 August 2019 (UTC)[reply]
  4. Indrajitdas (talk) 16:54, 7 August 2019 (UTC)[reply]
  5. Lkjidm (talk) 13:08, 11 August 2019 (UTC)[reply]
  6. Abdeaitali (talk) 09:32, 16 August 2019 (UTC)[reply]
  7. Cmglee (talk) 09:37, 16 August 2019 (UTC)[reply]
  8. Jheald (talk) 10:56, 16 August 2019 (UTC)[reply]
  9. Mcoffsky