Maniphest T206067

Wikimedia Technical Conference 2018 Session - Integrating machine learning into our products
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	debt
	Oct 2 2018, 10:54 PM

Description

Session Themes and Topics

Theme: Defining our products, users and use cases
Topic: How do we integrate Machine Learning into our products?

Session Leader

Aaron Halfaker

Facilitator

Kate Chapman

Description

This session looks at the use of machine learning and other types of automated assessments in the Wikimedia ecosystem. We’ll discuss what Wikimedia needs to do in order to embrace the challenges of operating infrastructure for machine learning. We’ll discuss the interface between long term maintenance of AI services with new product development.

Keep in mind:

We'll be covering a wide range of topics from ecosystems, to funding technology teams. Machine learning is a subject of discussion, but we'll have no time to discuss technical details.

Desired Outcomes:

A common understanding on what investments into AI will cost us
Alignment on which AIs to invest in next

Questions to answer during this session

Question	Significance: Why is this question important? What is blocked by it remaining unanswered?
What is an ecosystem? What’s a technology ecosystem? What makes an ecosystem healthy?	We talk about our “technology ecosystem” but does anyone really understand what an ecosystem is, how they operate, and what their constraints are? This question is important so that we can develop a common language and a common understanding of what technical ecosystem health looks like.
Where has ML been used within the Wikimedia ecosystem? What are some successes we can be inspired by? What kinds of predictions/assessments/rankings do we want to have access to next?	Machine learning is a relatively new technology. Most people don’t understand what it is or what it can do for them. Through discussing the impacts that ML has already had, participants will gain a grasp of what ML has to offer and why it may be worth substantial investment of time and resources. Examples include simple classifiers(ORES), similarity indexes (Elastic search), and the merging of the two (LTR). Is the next step general recommender infrastructure? Image processing? Knowledge integrity? What do we need to do in the next 5 years.
What does ML cost? What kind of time and resources do we need to make ML sustainable?	ML might seem like magic, but it’s definitely not free. ORES and the Scoring Platform team are an example of what it takes to invest in ML infrastructure. Knowing what it costs to maintain an ML service can help us know how to plan our investments wisely. It can also help us avoid under-investing and thus creating weak foundations on which to build.
How do we integrate automated assessments in to the wiki interface? What concerns present themselves when machines begin to encroach on subjective judgement?	Automated analysis isn’t particularly useful on its own; it’s a tool that is designed to make life easier for the wiki communities. In order to achieve this the outputs of these tools need to be meaningful, and need to be embedded in human-machine processes that our users will act out. How do we fit machines into current workflows or use them to enable new workflows? How do we deal with the issues that will inevitably arise from having a machine take on roles that were once purely human? These are questions we must must answer in order to proceed with augmented product development.

Facilitator and Scribe notes

https://docs.google.com/document/d/1Q7FvPUw6S1SNLkAbPnwKsDKyKa4ggR4BHhSL6VQJw3Y

Facilitator reminders

https://www.mediawiki.org/wiki/Wikimedia_Technical_Conference/2018/Session_Guide#Session_Guidance_for_facilitators

Resources:

https://phabricator.wikimedia.org/tag/artificial-intelligence/
What is an ecosystem?
- https://etherpad.wikimedia.org/p/wmtc18_ml_ecosystem
Where has/should ML been/be used within the Wikimedia ecosystem?
- https://etherpad.wikimedia.org/p/wmtc18_ml_where
What does ML cost?
- https://etherpad.wikimedia.org/p/wmtc18_ml_cost
How do we integrate automated assessments in human-lead processes?
- https://etherpad.wikimedia.org/p/wmtc18_ml_humans

Session Structure

Brief keynote by @Halfak (Ecosystems, AIs, ORES, and investments in the Technology dept.)
Break out groups to tackle questions
- What is an ecosystem? What’s a technology ecosystem? What makes an ecosystem healthy?
- Where has ML been used within the Wikimedia ecosystem? What next?
- What does ML cost? What kind of time and resources do we need to make ML sustainable?
- How do we integrate automated assessments in to the wiki processes?

Session Leaders please:

Add more details to this task description.
Coordinate any pre-event discussions (here on Phab, IRC, email, hangout, etc).
Outline the plan for discussing this topic at the event.
Optionally, include what it will not try to solve.
Update this task with summaries of any pre-event discussions.
Include ways for people not attending to be involved in discussions before the event and afterwards.

Post-event Summary:

Action items:

Event Timeline

debt created this task.Oct 2 2018, 10:54 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 2 2018, 10:54 PM

• kchapman renamed this task from Wikimedia Technical Conference 2018 Session - How do we integrate Machine Learning into our products? to Wikimedia Technical Conference 2018 Session - Integrating machine learning into our products.Oct 3 2018, 2:40 AM

• Rfarrand moved this task from Backlog to Core Session on the Wikimedia-Technical-Conference-2018 board.Oct 4 2018, 6:18 PM

debt updated the task description. (Show Details)Oct 4 2018, 11:50 PM

debt added a subscriber: Halfak.

debt updated the task description. (Show Details)Oct 4 2018, 11:52 PM

ssastry assigned this task to Halfak.Oct 5 2018, 9:50 PM

debt updated the task description. (Show Details)Oct 10 2018, 2:04 PM

Halfak updated the task description. (Show Details)Oct 16 2018, 3:04 PM

debt updated the task description. (Show Details)Oct 17 2018, 11:37 PM

debt edited subscribers, added: • kchapman; removed: Halfak.

Quiddity updated the task description. (Show Details)Oct 20 2018, 12:13 AM

Halfak updated the task description. (Show Details)Oct 23 2018, 4:51 PM

• Capt_Swing subscribed.Oct 24 2018, 8:03 PM

Capankajsmilyo subscribed.Oct 25 2018, 8:22 PM

Notes are here: https://www.mediawiki.org/w/index.php?title=Wikimedia_Technical_Conference/2018/Session_notes/Integrating_machine_learning_into_our_products

Trained Dataset open-sourcing can revolutionise the field of AI. Currently object-recognition, landmark-recognition, face-recognition, animal-recognition, plant-recognition, etc is specialisation of few big players because of their ownership on data. Wikipedia has huge database of such data in commons and if it open-source the trained Dataset, this could bring huge improvement and development in AI landscape worldwide. This could be a huge driver for innovation and modernisation for humanity as a whole.

In T206067#4708754, @Capankajsmilyo wrote:

Trained Dataset open-sourcing can revolutionise the field of AI. Currently object-recognition, landmark-recognition, face-recognition, animal-recognition, plant-recognition, etc is specialisation of few big players because of their ownership on data. Wikipedia has huge database of such data in commons and if it open-source the trained Dataset, this could bring huge improvement and development in AI landscape worldwide. This could be a huge driver for innovation and modernisation for humanity as a whole.

I removed the tag by mistake, can someone please restore it. Mobile interface for phabricator might need UX improvements

Aklapper added a project: Wikimedia-Technical-Conference-2018.Oct 31 2018, 1:15 PM

I agree that we have a great opportunity in open access labeled data. @DarTar has been working to secure funding for development of our label gathering strategy and for hiring a program manager to manage outreach for labeling activities.

In T206067#4709592, @Halfak wrote:

I agree that we have a great opportunity in open access labeled data. @DarTar has been working to secure funding for development of our label gathering strategy and for hiring a program manager to manage outreach for labeling activities.

That's awesome. Besides labelled data, training output can also be outsourced. As far as I know (I'm just a beginner), ML involves generating JSON schema by feeding labeled training data into ML algorithm, (let's call it training output) then using that JSON (or training output) to predict or classify something.

When you say "training output", are you referring to the trained model (an algorithm) or the predictions made by the trained model (new theoretical data).

We've been producing some datasets on my team. E.g. https://figshare.com/articles/Monthly_Wikipedia_article_quality_predictions/3859800 Is this the kind of thing that you have in mind?

In T206067#4712896, @Halfak wrote:

When you say "training output", are you referring to the trained model (an algorithm) or the predictions made by the trained model (new theoretical data).

We've been producing some datasets on my team. E.g. https://figshare.com/articles/Monthly_Wikipedia_article_quality_predictions/3859800 Is this the kind of thing that you have in mind?

Not the prediction, but the model. Let me try and clarify with examples

In https://github.com/BrainJS/brain.js/blob/master/README.md, see JSON section.

const json = net.toJSON();

This is what I am suggesting to opensource. You can use this file directly in run method.

Similarly in https://github.com/anubhavshrimal/FaceRecognition/blob/master/README.md, classifier.pkl is what I'm talking about. You can feed this file directly into predict.py

In the first example (brain.js), data fed in train method is what I'm calling training data and net variable after that is the training output.

In the second example, train.py takes in training data as input and produce classifier.pkl as training output.

Like these? https://github.com/wikimedia/editquality/tree/master/models

What was the result??

The result is captured in the themed etherpads linked above. See also the notes I posted about on Oct 27th: https://www.mediawiki.org/w/index.php?title=Wikimedia_Technical_Conference/2018/Session_notes/Integrating_machine_learning_into_our_products

Hello! We are starting to ramp up on session creation for the 2019 Wikimedia Technical Conference. If there is no longer anything remaining to do here please close this task to avoid confusion.

Halfak closed this task as Resolved.Sep 12 2019, 2:28 PM

Wikimedia Technical Conference 2018 Session - Integrating machine learning into our productsClosed, ResolvedPublicActions

Description

Session Themes and Topics

Session Leader

Facilitator

Description

Keep in mind:

Desired Outcomes:

Questions to answer during this session

Facilitator and Scribe notes

Facilitator reminders

Resources:

Session Structure

Event Timeline

Wikimedia Technical Conference 2018 Session - Integrating machine learning into our products
Closed, ResolvedPublic
Actions