Maniphest T206073

Wikimedia Technical Conference 2018 Session - Identifying and extracting data trapped in our content
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	debt
	Oct 2 2018, 10:57 PM

Description

Session Themes and Topics

Theme: Increasing our technical capabilities to achieve our strategy
Topic: Where is data trapped in content and how do we get it out?

Session Leader

Michael Holloway

Facilitator

Kate Chapman

Description

Wikipedia and other Wikimedia projects contain lots of potentially structurable data that is represented in an unstructured way in wikitext or HTML. With new data capabilities being developed, we can now represent this data in a structured way. The purpose of this session is to identify data that is already present, but trapped in unstructured content, and to explore techniques for modeling and extracting that data.

Questions to answer during this session

Question	Significance: Why is this question important? What is blocked by it remaining unanswered?
What types of data are currently stored in content that should be extracted and stored separately? What type of data is metadata and which is data to be composed into content? (Specifically discuss Categories and Infoboxes.)	Identifying data within HTML content that we want extract into structured data is the first step in adding more semantic information about our content. This allows us to plan for the types of data that we want to store and design ways to extract the data.
Should the data you identified be stored on the host wiki or should it be stored on Wikidata? How do you decide this?	It is unclear where a lot of data should be stored and how we make this decision. Answering this allows us to plan where to store such data and provide future guidance to others.
Which types of data that were identified must support versioning?	Knowing which type of data must support versioning allows us to make decisions on how to store it and assess its impact on infrastructure.
Do you anticipate having difficulties automating the extraction of any of the data that you have identified? Do you anticipate having difficulties modeling any of the data that you have identified? Why?	Identifying data that has the potential for being difficult to extract or model will help us plan and prioritize extracting this data.

Facilitator and Scribe notes

https://docs.google.com/document/d/1dnNQTvbRFkYdM1q-eNGnARVDnQI49OLRV2YUcFABKso

Facilitator reminders

https://www.mediawiki.org/wiki/Wikimedia_Technical_Conference/2018/Session_Guide#Session_Guidance_for_facilitators

Session Structure

10:00 - 10:05 - Break Participants into Small Groups
10:05 - 10:15 - Brainstorm different data in our content that could be structured and stored. Write each type on a sticky note.
10:15 - 10:20 - Group reporters report out the different data, to be added to master list in front of room for the whole group, clustering similar items.
10:20 - 10:40 - Discuss questions in groups:

Store the data identified in Wikidata or on the host wiki?
Which data must support versioning?
What difficulties do you anticipate for modeling/extracting any of these?

10:40 - 10:55 - Groups report their conclusions to the full group (5 mins each)
10:55 - 11:00 - Full group discussion for any time remaining

Resources:

Session Leaders please:

Add more details to this task description.
Coordinate any pre-event discussions (here on Phab, IRC, email, hangout, etc).
Outline the plan for discussing this topic at the event.
Optionally, include what it will not try to solve.
Update this task with summaries of any pre-event discussions.
Include ways for people not attending to be involved in discussions before the event and afterwards.

Post-event Summary:

Action items:

Event Timeline

debt created this task.Oct 2 2018, 10:57 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 2 2018, 10:57 PM

• kchapman renamed this task from Wikimedia Technical Conference 2018 Session - Where is data trapped in content and how do we get it out? to Wikimedia Technical Conference 2018 Session - Identifying and extracting data trapped in our content.Oct 3 2018, 2:42 AM

• Rfarrand moved this task from Backlog to Core Session on the Wikimedia-Technical-Conference-2018 board.Oct 4 2018, 6:18 PM

debt updated the task description. (Show Details)Oct 4 2018, 11:54 PM

debt added a subscriber: • Mhollo.

• Mholloway removed a subscriber: • Mhollo.Oct 5 2018, 5:28 AM

• Mholloway subscribed.

debt updated the task description. (Show Details)Oct 10 2018, 1:59 PM

debt assigned this task to • Mhollo.Oct 17 2018, 10:50 PM

debt updated the task description. (Show Details)

debt edited subscribers, added: • kchapman, Halfak; removed: • Mholloway.

Please use my @Mholloway account and not @Mhollo. The latter is an account I created solely for testing Phab code reviews and would delete if Phabricator afforded that functionality. Thank you.

• Mholloway removed a subscriber: • Mhollo.Oct 18 2018, 4:35 PM

Quiddity updated the task description. (Show Details)Oct 20 2018, 12:32 AM

Tgr subscribed.Oct 21 2018, 1:44 AM

• kchapman updated the task description. (Show Details)Oct 21 2018, 9:02 PM

0984C871-3105-4CFC-A82D-F16705E2A919.jpeg (3×4 px, 2 MB)

738AD817-02ED-44A5-BBBE-B01C2797CADF.jpeg (3×4 px, 2 MB)

A001A882-56AF-4A67-B113-CF9882B94214.jpeg (3×4 px, 2 MB)

8EACAAE3-DF04-482C-96A7-7E406E088749.jpeg (3×4 px, 2 MB)

002085B9-9880-4CA1-8CEC-7CCA725CF3D5.jpeg (3×4 px, 2 MB)

D5B4F77C-E1DD-4150-AB45-F75DBA3499E4.jpeg (3×4 px, 2 MB)

7FA4FF7E-E761-4071-AC99-612AAEFD33BF.jpeg (3×4 px, 2 MB)

Notes on wiki: https://www.mediawiki.org/wiki/Wikimedia_Technical_Conference/2018/Session_notes/Identifying_and_extracting_data_trapped_in_our_content

Amire80 added a project: Language-strategy.Dec 18 2018, 2:58 PM

• Mholloway closed this task as Resolved.Mar 29 2019, 5:34 PM

	F26778037: 0984C871-3105-4CFC-A82D-F16705E2A919.jpeg
	Oct 25 2018, 12:17 AM

	F26778032: D5B4F77C-E1DD-4150-AB45-F75DBA3499E4.jpeg
	Oct 25 2018, 12:17 AM

	F26778031: 7FA4FF7E-E761-4071-AC99-612AAEFD33BF.jpeg
	Oct 25 2018, 12:17 AM

	F26778033: 002085B9-9880-4CA1-8CEC-7CCA725CF3D5.jpeg
	Oct 25 2018, 12:17 AM

	F26778036: 738AD817-02ED-44A5-BBBE-B01C2797CADF.jpeg
	Oct 25 2018, 12:17 AM

	F26778034: 8EACAAE3-DF04-482C-96A7-7E406E088749.jpeg
	Oct 25 2018, 12:17 AM

	F26778035: A001A882-56AF-4A67-B113-CF9882B94214.jpeg
	Oct 25 2018, 12:17 AM

Wikimedia Technical Conference 2018 Session - Identifying and extracting data trapped in our contentClosed, ResolvedPublicActions

Description

Session Themes and Topics

Session Leader

Facilitator

Description

Questions to answer during this session

Facilitator and Scribe notes

Facilitator reminders

Session Structure

Resources:

Event Timeline

Wikimedia Technical Conference 2018 Session - Identifying and extracting data trapped in our content
Closed, ResolvedPublic
Actions