Page MenuHomePhabricator

Wikimedia Technical Conference 2018 Session - Integrating machine learning into our products
Open, Needs TriagePublic

Description

Session Themes and Topics

  • Theme: Defining our products, users and use cases
  • Topic: How do we integrate Machine Learning into our products?

Session Leader

  • Aaron Halfaker

Facilitator

  • Kate Chapman

Description

This session looks at the use of machine learning and other types of automated assessments in the Wikimedia ecosystem. We’ll discuss what Wikimedia needs to do in order to embrace the challenges of operating infrastructure for machine learning. We’ll discuss the interface between long term maintenance of AI services with new product development.

Keep in mind:

  • We'll be covering a wide range of topics from ecosystems, to funding technology teams. Machine learning is a subject of discussion, but we'll have no time to discuss technical details.

Desired Outcomes:

  • A common understanding on what investments into AI will cost us
  • Alignment on which AIs to invest in next

Questions to answer during this session

QuestionSignificance: Why is this question important? What is blocked by it remaining unanswered?
What is an ecosystem? What’s a technology ecosystem? What makes an ecosystem healthy?We talk about our “technology ecosystem” but does anyone really understand what an ecosystem is, how they operate, and what their constraints are? This question is important so that we can develop a common language and a common understanding of what technical ecosystem health looks like.
Where has ML been used within the Wikimedia ecosystem? What are some successes we can be inspired by? What kinds of predictions/assessments/rankings do we want to have access to next?Machine learning is a relatively new technology. Most people don’t understand what it is or what it can do for them. Through discussing the impacts that ML has already had, participants will gain a grasp of what ML has to offer and why it may be worth substantial investment of time and resources. Examples include simple classifiers(ORES), similarity indexes (Elastic search), and the merging of the two (LTR). Is the next step general recommender infrastructure? Image processing? Knowledge integrity? What do we need to do in the next 5 years.
What does ML cost? What kind of time and resources do we need to make ML sustainable?ML might seem like magic, but it’s definitely not free. ORES and the Scoring Platform team are an example of what it takes to invest in ML infrastructure. Knowing what it costs to maintain an ML service can help us know how to plan our investments wisely. It can also help us avoid under-investing and thus creating weak foundations on which to build.
How do we integrate automated assessments in to the wiki interface? What concerns present themselves when machines begin to encroach on subjective judgement?Automated analysis isn’t particularly useful on its own; it’s a tool that is designed to make life easier for the wiki communities. In order to achieve this the outputs of these tools need to be meaningful, and need to be embedded in human-machine processes that our users will act out. How do we fit machines into current workflows or use them to enable new workflows? How do we deal with the issues that will inevitably arise from having a machine take on roles that were once purely human? These are questions we must must answer in order to proceed with augmented product development.

Facilitator and Scribe notes

https://docs.google.com/document/d/1Q7FvPUw6S1SNLkAbPnwKsDKyKa4ggR4BHhSL6VQJw3Y

Facilitator reminders

https://www.mediawiki.org/wiki/Wikimedia_Technical_Conference/2018/Session_Guide#Session_Guidance_for_facilitators

Resources:

Session Structure

  • Brief keynote by @Halfak (Ecosystems, AIs, ORES, and investments in the Technology dept.)
  • Break out groups to tackle questions
    • What is an ecosystem? What’s a technology ecosystem? What makes an ecosystem healthy?
    • Where has ML been used within the Wikimedia ecosystem? What next?
    • What does ML cost? What kind of time and resources do we need to make ML sustainable?
    • How do we integrate automated assessments in to the wiki processes?

Session Leaders please:

  • Add more details to this task description.
  • Coordinate any pre-event discussions (here on Phab, IRC, email, hangout, etc).
  • Outline the plan for discussing this topic at the event.
  • Optionally, include what it will not try to solve.
  • Update this task with summaries of any pre-event discussions.
  • Include ways for people not attending to be involved in discussions before the event and afterwards.

Post-event Summary:

  • ...

Action items:

  • ...

Event Timeline

debt created this task.Oct 2 2018, 10:54 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 2 2018, 10:54 PM
kchapman renamed this task from Wikimedia Technical Conference 2018 Session - How do we integrate Machine Learning into our products? to Wikimedia Technical Conference 2018 Session - Integrating machine learning into our products.Oct 3 2018, 2:40 AM
debt updated the task description. (Show Details)Oct 4 2018, 11:50 PM
debt added a subscriber: Halfak.
debt updated the task description. (Show Details)Oct 4 2018, 11:52 PM
ssastry assigned this task to Halfak.Oct 5 2018, 9:50 PM
debt updated the task description. (Show Details)Oct 10 2018, 2:04 PM
Halfak updated the task description. (Show Details)Oct 16 2018, 3:04 PM
debt updated the task description. (Show Details)Oct 17 2018, 11:37 PM
debt edited subscribers, added: kchapman; removed: Halfak.
Quiddity updated the task description. (Show Details)Oct 20 2018, 12:13 AM
Halfak updated the task description. (Show Details)Oct 23 2018, 4:51 PM

Trained Dataset open-sourcing can revolutionise the field of AI. Currently object-recognition, landmark-recognition, face-recognition, animal-recognition, plant-recognition, etc is specialisation of few big players because of their ownership on data. Wikipedia has huge database of such data in commons and if it open-source the trained Dataset, this could bring huge improvement and development in AI landscape worldwide. This could be a huge driver for innovation and modernisation for humanity as a whole.

Capankajsmilyo added a comment.EditedOct 31 2018, 11:24 AM

Trained Dataset open-sourcing can revolutionise the field of AI. Currently object-recognition, landmark-recognition, face-recognition, animal-recognition, plant-recognition, etc is specialisation of few big players because of their ownership on data. Wikipedia has huge database of such data in commons and if it open-source the trained Dataset, this could bring huge improvement and development in AI landscape worldwide. This could be a huge driver for innovation and modernisation for humanity as a whole.

I removed the tag by mistake, can someone please restore it. Mobile interface for phabricator might need UX improvements

Halfak added a subscriber: DarTar.Oct 31 2018, 2:53 PM

I agree that we have a great opportunity in open access labeled data. @DarTar has been working to secure funding for development of our label gathering strategy and for hiring a program manager to manage outreach for labeling activities.

I agree that we have a great opportunity in open access labeled data. @DarTar has been working to secure funding for development of our label gathering strategy and for hiring a program manager to manage outreach for labeling activities.

That's awesome. Besides labelled data, training output can also be outsourced. As far as I know (I'm just a beginner), ML involves generating JSON schema by feeding labeled training data into ML algorithm, (let's call it training output) then using that JSON (or training output) to predict or classify something.

Halfak added a comment.Nov 1 2018, 3:24 PM

When you say "training output", are you referring to the trained model (an algorithm) or the predictions made by the trained model (new theoretical data).

We've been producing some datasets on my team. E.g. https://figshare.com/articles/Monthly_Wikipedia_article_quality_predictions/3859800 Is this the kind of thing that you have in mind?

When you say "training output", are you referring to the trained model (an algorithm) or the predictions made by the trained model (new theoretical data).
We've been producing some datasets on my team. E.g. https://figshare.com/articles/Monthly_Wikipedia_article_quality_predictions/3859800 Is this the kind of thing that you have in mind?

Not the prediction, but the model. Let me try and clarify with examples

In https://github.com/BrainJS/brain.js/blob/master/README.md, see JSON section.

const json = net.toJSON();

This is what I am suggesting to opensource. You can use this file directly in run method.

Similarly in https://github.com/anubhavshrimal/FaceRecognition/blob/master/README.md, classifier.pkl is what I'm talking about. You can feed this file directly into predict.py

In the first example (brain.js), data fed in train method is what I'm calling training data and net variable after that is the training output.

In the second example, train.py takes in training data as input and produce classifier.pkl as training output.

What was the result??

The result is captured in the themed etherpads linked above. See also the notes I posted about on Oct 27th: https://www.mediawiki.org/w/index.php?title=Wikimedia_Technical_Conference/2018/Session_notes/Integrating_machine_learning_into_our_products