Page MenuHomePhabricator

Wikimedia Technical Conference 2018 Session - Identifying and extracting data trapped in our content
Closed, ResolvedPublic


Session Themes and Topics

  • Theme: Increasing our technical capabilities to achieve our strategy
  • Topic: Where is data trapped in content and how do we get it out?

Session Leader

  • Michael Holloway


  • Kate Chapman


Wikipedia and other Wikimedia projects contain lots of potentially structurable data that is represented in an unstructured way in wikitext or HTML. With new data capabilities being developed, we can now represent this data in a structured way. The purpose of this session is to identify data that is already present, but trapped in unstructured content, and to explore techniques for modeling and extracting that data.

Questions to answer during this session

QuestionSignificance: Why is this question important? What is blocked by it remaining unanswered?
What types of data are currently stored in content that should be extracted and stored separately? What type of data is metadata and which is data to be composed into content? (Specifically discuss Categories and Infoboxes.)Identifying data within HTML content that we want extract into structured data is the first step in adding more semantic information about our content. This allows us to plan for the types of data that we want to store and design ways to extract the data.
Should the data you identified be stored on the host wiki or should it be stored on Wikidata? How do you decide this?It is unclear where a lot of data should be stored and how we make this decision. Answering this allows us to plan where to store such data and provide future guidance to others.
Which types of data that were identified must support versioning?Knowing which type of data must support versioning allows us to make decisions on how to store it and assess its impact on infrastructure.
Do you anticipate having difficulties automating the extraction of any of the data that you have identified? Do you anticipate having difficulties modeling any of the data that you have identified? Why?Identifying data that has the potential for being difficult to extract or model will help us plan and prioritize extracting this data.

Facilitator and Scribe notes

Facilitator reminders

Session Structure

10:00 - 10:05 - Break Participants into Small Groups
10:05 - 10:15 - Brainstorm different data in our content that could be structured and stored. Write each type on a sticky note.
10:15 - 10:20 - Group reporters report out the different data, to be added to master list in front of room for the whole group, clustering similar items.
10:20 - 10:40 - Discuss questions in groups:

  • Store the data identified in Wikidata or on the host wiki?
  • Which data must support versioning?
  • What difficulties do you anticipate for modeling/extracting any of these?

10:40 - 10:55 - Groups report their conclusions to the full group (5 mins each)
10:55 - 11:00 - Full group discussion for any time remaining


  • ...

Session Leaders please:

  • Add more details to this task description.
  • Coordinate any pre-event discussions (here on Phab, IRC, email, hangout, etc).
  • Outline the plan for discussing this topic at the event.
  • Optionally, include what it will not try to solve.
  • Update this task with summaries of any pre-event discussions.
  • Include ways for people not attending to be involved in discussions before the event and afterwards.

Post-event Summary:

  • ...

Action items:

  • ...

Event Timeline

kchapman renamed this task from Wikimedia Technical Conference 2018 Session - Where is data trapped in content and how do we get it out? to Wikimedia Technical Conference 2018 Session - Identifying and extracting data trapped in our content.Oct 3 2018, 2:42 AM
debt added a subscriber: Mhollo.
debt updated the task description. (Show Details)
debt edited subscribers, added: kchapman, Halfak; removed: Mholloway.
Mholloway added subscribers: Mholloway, Mhollo.

Please use my @Mholloway account and not @Mhollo. The latter is an account I created solely for testing Phab code reviews and would delete if Phabricator afforded that functionality. Thank you.