2019:GLAM/Structured Data on Wikimedia Commons for GLAM-Wiki/notes

From Wikimania

SESSION OVERVIEW[edit | edit source]

Structured Data on Wikimedia Commons for GLAM-Wiki
Day & time
Friday 16 August, 15:45
Session link
  • Sandra Fauconnier (WMF)
  • André Costa (WMSE), representing related work done in the FindingGLAMs project
  • Albin Larsson / Susanna Ånäs, Wikimedia Commons Data Roundtripping
  • Florence Devouard / Isla Haddow-Flood / Sean McBirnie, ISA tool
  • Satdeep Gill (WMF)
  • Depending on attendance, 3-4 other representatives of GLAM pilot projects that worked with Structured Data on Commons
Lucas Werkmeister, Lucy Patterson, M Björkroth

this session:

Working on Structured Data on Wikimedia Commons, specifically how it can be used by GLAMs.

Sandra Fauconnier[edit | edit source]

Updates: what has been developed, what’s coming?

For whom is SDoC new? [a few raised hands]

Wikimedia Commons is the media repository, not just images, also videos, sounds, data files

Has always been written as a wiki, in wikitext, not a classical database

long-term request from Commons community: to make it more structure,d more machinge readable, much more multi-lingual.

If you want to search Commons in your own language, you will find much less than if you speak English. Big hurdle for findability and usability of files.

In the past 2.5-3 years have been working on SDC, using structured data to help describe the files on commons

Produces much richer APIs that tool builders can use.

Search will become better – not implemented yet, but search will become multilingual.

Many exciting things can happen in terms of organizing files in automated way, doing all sorts of nice things with it.

Will show some glimpses of this in Pilot projects we've been working on.

[missing slides – see link below]

What we are trying to do: tagging things. What does this image show? There’s a beautiful image taken by a drone of a fisherman setting on [some shore?] with fishernets, sitting on a pirogue[?].

Now we can add structured data that says what is depicted in the image - you can switch languages to get those statements in any language you want.

What’s new on Commons?


  • You can add multilingual file captions – multilingual descriptions of files
  • depicts statements
  • other statements

Add these via the file page itself.

Simple example: very beautiful photo, close-up of sugar cubes. Featured picture. How do you describe this with structured data?

There is a new, subtle interface element with two tabs: file information and structured data.

First element you can add is CAPTIONS - short descriptions of what you see in the image. This is being fed to the search index – these short descriptions help to retrieve and find the file. What’s the difference with existing descriptions in templates? They’re not integrated in the same structured way, it’s more complicated to retrieve them, but also descriptions can have wikitext, links, markup – these captions are meant to be kept extremely simple, can be easily reused through the API. Purpose for now: mostly to make search better.

Next, click on the STRUCTURED DATA TAB and you can add DEPICTS statements - everything that is represented in the media file. In this example, sugar cubes. When you start typing it will look in the wikidata ontology and autocomplete. You can also qualify it, give more information: quantity – it’s 12 sugar cubes, and also of the white kind, not brown sugar.

More recently, you can add other statement as well. Who CREATED it (e. g. photographer)? What’s the LICENSE? Also, Commons-specific property: “Commons quality assessment”, this is a quality image or a featured image.

I said you can’t search it yet, not entirely true – you can use haswbstaement:P180=Q10409638. That’s not human-readable yet, but you can already use it to figure out how many files on Commons have a license, a creator etc. Better, more human ways will come, rest assured.

What’s coming next? We are ending the grant 2019, but the work will not stop

Before the end of the year we hope to finish SPARQL querying

Lua support (the scripting language with which you can use the data in Wikipedia or elsewhere) is under review

more datatypes – Wikidata experts will be familiar with e. g. dates, that’s not supported yet but will be, also geographical coordinates (you can already do it through the API...!)

Upload wizard for custom campaigns, make sure that participants use structured data

also, feature request: a campaign wants to pre-fill a field, e. g. which campaign (Wiki Loves …) the file belongs to

experimenting with Machine vision - AI-based detection of what's in an image.. An AI will make suggestions about what's in an image - human still has to approve. Working very hard to find the most unbiased algorithms...

We might do (but not promising it yet):

search depicts statements – perhaps as a gadget, instead of a full overhault of search

depicts of depicts – not explained, there is no time But it's useful!

image annotations: this particular bit of this image shows this bird (working on it but it’s further down the line)

Most important part:

GLAM pilot projects:


Content partnerships

Community initiatives

Tools and prototypes

How will this work for collaborations with cultural institutions?

Will also document best practices discovered through these pilots.

Is there a potential of SDoC to help with easier integration with Wikisource?

Q Florence[?]: depicts of depicts what?

A: it's complicated, come to hands-on session.
You have a photo of a sculpture, which is a sculpture of a bird. The image depicts a scultpure, and the sculpture will be a Wikidata item and depicts a bird.
We want to make sure that this is the correct way to model that.
At some point the search will go through these two steps, to see what the Wikidata item depicts, so that when you search for “pelican” you will also get that sculpture.

André Costa (WMSE), =[edit | edit source]

COO of Wikimedia Sweden, representing related work done in the FindingGLAMs project

(Andre and Sandra work closely together)

FindingGLAMs is run together with UNESCO and Wikimedia Foundation.

Goal: bring together as much freely licensed information about cultural heritage organizations as possible.

2 parts:

1// Creating a database of cultural heritage institutions and the collections they contain.

Partially finding and aggregating publicly available dataset, partially crowdsourcing[?] - FindingGLAMs - go to the session about that to find out more.

2// How can we support GLAMs to share their content

through case studies, best practices and guides

Project isn’t finished, about halfway through, SDoC parts are not done yet.

Case study: problematic data[edit | edit source]

Many collections contain images/materials with problematic data.

Metadata might use language that today we consider racist/discriminatory/otherwise offensive.

But also, through omission generalizing the depicted cultures.

This image [slides] for example, likely caption “Sapmi woman wearing traditional clothing”, or might use derogatory term instead of Sapmi.

Doesn’t say who she was – famous? Not captured in the metadata. What traditional clothing? From which region? Missing all this information.

And all the metadata will be in Swedish, not any of the Sapmi languages.

There is a workshop next week(?) for how to deal with this material, once you make it available to a broader public that might not know the background. That you wouldn't use this language and perspective today

Also looking at how to use SDoC for campaigns to encourage […] adding extra content that’s missing, fixing glaring mistakes,

enabling a shared ownersihp and curation of this material, which is held by a Swedish museum, not a Sapmi museum.

Now that it’s digital online, how can we share ownership?

Collaborate with Sapmi institutanios, Sapmi high school.

There is a related talk later.

Case study: Historic music[edit | edit source]

(much less problematic)

Audio recordings and sheet music that these audio recordings "depict". And images related to the performances

How do you connect all of these in Structured Data?

If we digitize the scoresheet to make it machine-readable, how do we link that to the original image and the other material?

What part is on Commons, what on Wikidata, what elsewhere?

If we digitize these sheets that clarifies the copyright status - can we use that to encourage the creation of new music?

Case study: Missing metadata structure[edit | edit source]

[haven’t done much on this yet]

How can we describe metadata structures for different specialised topics?

If aGLAM institution has scanned coins, and we have a structure of that.

Same thing if we have pictures of clothing..

Which images are missing obvious key metadata?

And then try crowdsourcing to fill that out.

For organizations that haven’t yet digitized the content,

that would make it easier when contacting them,

to capture that these key facts are already captured at digitalization.

Often we get a hard drive with images, and a hard drive with scanned info cards. you need that to be captured at the same time.

Q: music notes - what about different versions? draft versions...

A: maybe you also have annotation from the people who performed the pieces... Yes! :)

Albin Larsson: Wikimedia Commons Data Roundtripping and Institutional Ingestion of Wikimedia Data[edit | edit source]

GLAMs have often donated data to Commons, Wikidata, …

then data is changed/improved in our projects, and GLAMs want it back.

encouraging people to add data around GLAMs

a subset of this project was about SDC - we ran several campaigns

  • ?
  • wikidata
  • SDC

Invited users to improve metadata about GLAM campaigns donated.

Pilot with structured data on Commons: collected depicts statements for different fashion from Nordic museums.

You were presented with an image, seacrhed for items on Wikidata, and added depicts statements.

GLAMs often use controlled vocabularies.

eg Europeana Vocabulary Fashion vocabulary

There was often a mismatch between Wikidata and Europeana’s vocabulary.

Opportunity to find holes in existing vocabularies.

"Nordic Museum Depicts"

For one pilot, we used machine learning for translation of entire descriptions.

We realized: there could be errors, and we had no way to communicate that these were machine translations.

we need to tell the audience (origin? I think audience, yes) of the data that this was created by machine, but accepted by a human user. Communication. We also need to communicate the background on some old data that is really racist against sami people

Tools and code are open source, resources on Commons as well.

Two museums are currently working on putting this in production, even though it started as a research project.

Time for one question now, opportunity for more questions tomorrow.

[No question.]

Florence Devouard: The ISA tool[edit | edit source]

Came from a double need:

  • Wanted to do something related to Wiki Loves Africa, a photographic contest that’s already been running for five years, collected nearly 50 000 images. At that time there was no depicts, no structured data.
  • Wanted editors to improve the files all year.

Sandra came to Florence and Isla to discuss how they could use SDoC. More people: Eugene from Cameroon did not get a visa :( also Navino Evans[?] from Histropedia and Sean McBirnie (design).

ISA is a microcontribution tool that works on desktop but is also mobile-friendly.

Facilitate the addition of two types of structured data: captions and depicts. (In future: image quality assessment too?)

People could create campaigns, from a set of categories of images.

live demo

There are already some campaigns (click Explore Campaigns) - Participation requires login through Wiikemiad Commons account

People at Work across Africa campaign with 18000 images.

Proposes an interface where information about contribution through the tool is collected.

On the left: top contributors, on the right: top countries.

Challenges: fun moments where the organizer can challenge people to contribute over a limtied time (during an event, or for a week), then reward the top contributors.

Going through the images; shows filename, description, basic metadata from commons; ability to add depicts statements and captions.

[gets agreement to vandalise from someone elses account! ;)]

This image depicts an angle grinder.

Save – it’s saved on Wikimedia Commons.

For adding captions, select up to five languages in your preferences.

[going back to slides]

In development

currently only in english and french - translation help welcome

images currently served in a random order. would prefer images lacking information to be served first. With 18000 images, it’s hard to ensure that all images are worked on if they’re presented in order.

Anyone can create campaigns!

Requires begin date, end date, categories, logo.

Then advertise this URL to the people involved in the URL. (they don't have to go tot he entry point of the tool, but can go directly to the campaign)

Tool is focused on Wiki Loves … campaigns, based on the country. But it’s open to any kind of category. You can create a campaign for Category:Stockholm.

up to 5 categories per campaign

Statistics: collect as much information as possible on participants.

often used during an event - this will help with reporting on the event - will even draw dataviz graphics for you

Information is already available as CSV, but next step is to make visualizations available directly.


We will launch a campaign, possibly around the end of the month, for Wiki Loves Africa, to get people more excited.[?]

Also will put out a comprehensive manual for users and campaign organisers.


If anyone has technical questions, find Sean, he can answer them or knows how to reach the other developers.

Q Susanna: we have a skolt sami workshop next week (!), but Sami is not a MediaWiki user interface language - are there any obstacles for having ISA in a sami language? It is available in wikidata..

A Florence: Currently integrations are integrated fairly directly in the tool. Integration with TranslateWiki is “almost there”, “next couple of days” (Sean).

The depicted items come directly from Wikidata, need to be translated there.

Satdeep Gill (WMF): Digtization and Wikisource pilot[edit | edit source]

trying to work with wikisource, digitized books and SDC

For this pilot worked with Punjabi community.

Took a specific genre of Punjabi literature 19th and early 20th century, Qissa. Not much metadata available on these books. A lot of research was necessary to figure out whether the work is public domain or not.

We wanted to figure out the workflow for getting this content onto Wikidata and connecting it to Commons, and then to Wikisource.

Got 20 books from a private collection, collected metadata – some is on the books (name, author)… metadata collection is ongoing.

so far only 5 are know to be public domain

Wikidata items were already created, the bibliographic data is on Wikidata, at least whatever we have.

Works have also been scanned, ready to upload if public domain.

Listeria list of all the available data, including copyright status - larger wikisource community hasn't been doing this a lot

Main problem with digitzied works and SDoC: metadata for book editions will live on Wikidata, not Commons.

So what then goes to Commons? Information about that specific copy, e. g. that it comes from a private collection.

connect to wikidata using the depicts statement

Currently, metadata about books lives in different places and is not connected.

Data on wikidata is sometimes connected to Template:Book on Commons, sometimes not.

there is a duplication of metadata... we create metadata on wikidata, but that information is also included in the commons template - could we automate this..?

Open questions:

  • data modeling of published works?
  • which data lives where?
  • how to avoid publication of data?

Questions or any other feedback.

Q James Heald: Wikdiata has a class for “individual copy of book”, so why not put the individual copy on WIkidata as well? Don’t think anyone would object.

Audience member (Peter Isotalo, archivist): I would object! I’m an archivist by trade. How many copies do we want to keep track of? Is there a usefulness in having multiple Wikidata entries? If there are separate *editions*, it’s another thing.
J Heald: depends on how much you want to say about each individual copy.
A: we need good feedback on this before we can move forward... for now we are taking the text from the template on commons...

Q Peter Isotalo (archivist): Question about annotations! On Commons, there are annotations. What happens to them once we go to Wikidata? Example of a file with annotations on Commons:

File:MaryRose-gun furniture.jpg
A Sandra: I've done some. Our Brazilian community is great at that. On the “things to come” slide are annotations. Is it possible to figure out how often they are used?
Audience, Multichill: it's weird syntax on the page, you can search/count template usage. (Answer below: used 200 000 times)
Andre Costa: That’s also integrated on Wikidata to some extent. See also https://tools.wmflabs.org/wd-image-positions/

Sandra: Have been experimenting with IIIF integration

Q Scann: Question about image quality: is there a technical standard? What’s added under that property?

A Sandra: Process on commons - I made a nice photo, can you grade it? There was some discussion if you want to have more subjective statements that anyone can add.

There is a “problems and opportunities” session on Sunday afternoon.

Q Multichill: You mentioned that when we have Lua access, it will be available from other wikis too. Are you sure?

A Sandra: my colleagues said yes

Multichill: image annotationsused 200000 times

Comment: There are also featured images on other wikis.

A Sandra: Other wikis can also host images. In this grant we’ve only focused on Commons, but we’ve been talking about using it for other wikis. No promises, though.

SESSION SUMMARY[edit | edit source]

  • ...


You can follow along with the slides here: https://docs.google.com/presentation/d/1iG-rZGmcjjzZnmC36MgKjmbFBaJnX6eR9uz7WgFyOBo/edit#slide=id.g5cda8c7b2a_1_0


  • ...