Session Themes and Topics
- Theme: Increasing our technical capabilities to achieve our strategy
- Topic: Where is data trapped in content and how do we get it out?
- Michael Holloway
- Kate Chapman
Wikipedia and other Wikimedia projects contain lots of potentially structurable data that is represented in an unstructured way in wikitext or HTML. With new data capabilities being developed, we can now represent this data in a structured way. The purpose of this session is to identify data that is already present, but trapped in unstructured content, and to explore techniques for modeling and extracting that data.
Questions to answer during this session
|Question||Significance: Why is this question important? What is blocked by it remaining unanswered?|
|What types of data are currently stored in content that should be extracted and stored separately? What type of data is metadata and which is data to be composed into content? (Specifically discuss Categories and Infoboxes.)||Identifying data within HTML content that we want extract into structured data is the first step in adding more semantic information about our content. This allows us to plan for the types of data that we want to store and design ways to extract the data.|
|Should the data you identified be stored on the host wiki or should it be stored on Wikidata? How do you decide this?||It is unclear where a lot of data should be stored and how we make this decision. Answering this allows us to plan where to store such data and provide future guidance to others.|
|Which types of data that were identified must support versioning?||Knowing which type of data must support versioning allows us to make decisions on how to store it and assess its impact on infrastructure.|
|Do you anticipate having difficulties automating the extraction of any of the data that you have identified? Do you anticipate having difficulties modeling any of the data that you have identified? Why?||Identifying data that has the potential for being difficult to extract or model will help us plan and prioritize extracting this data.|
Facilitator and Scribe notes
10:00 - 10:05 - Break Participants into Small Groups
10:05 - 10:15 - Brainstorm different data in our content that could be structured and stored. Write each type on a sticky note.
10:15 - 10:20 - Group reporters report out the different data, to be added to master list in front of room for the whole group, clustering similar items.
10:20 - 10:40 - Discuss questions in groups:
- Store the data identified in Wikidata or on the host wiki?
- Which data must support versioning?
- What difficulties do you anticipate for modeling/extracting any of these?
10:40 - 10:55 - Groups report their conclusions to the full group (5 mins each)
10:55 - 11:00 - Full group discussion for any time remaining
Session Leaders please:
- Add more details to this task description.
- Coordinate any pre-event discussions (here on Phab, IRC, email, hangout, etc).
- Outline the plan for discussing this topic at the event.
- Optionally, include what it will not try to solve.
- Update this task with summaries of any pre-event discussions.
- Include ways for people not attending to be involved in discussions before the event and afterwards.