Page MenuHomePhabricator

Use user-maintained bot run mode to gain stats and learn
Open, Needs TriagePublic

Description

The bot will have (at least) 2 run modes (see T137557):

  • auto: be conservative - no false-positive not to annoy commons users with unreliable bot work (and give the maintainer a lot of work to fix stuff)
  • user-maintained: be more experimental - show ALL possible results (no matter how significant) and the user decides which ones are valid

the idea now is to use the user-maintained to gain more knowledge on how to run the auto mode.

  1. is it possible to gain stats during user-maintained runs that allow use to develop a better bot (hard-coded)?
  2. is it possible to use machine learning during user-maintained runs to train the bot according to user wishes or generally (per user or global "config" files)?

Event Timeline

DrTrigon created this task.Jun 17 2016, 9:52 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 17 2016, 9:52 PM

This is an interesting question, and the major issue I see here is that the user's computer will hang if we do use it.

So, the best method may be to have something like what a lot of softwares do: "Would you like to send usage statistics to the owner to make the software better"

And then in the next release use the information to create a training set which is more comprehensive.

Note that a lot of times a larger training set can make the learning agent worse. Basically depends on where you wan the hyper plane to be drawn, etc. So, the training set *needs* to be well curated.

Am 18. Juni 2016 05:31:48 MESZ, schrieb AbdealiJK <no-reply@phabricator.wikimedia.org>:

AbdealiJK added a comment.
This is an interesting question, and the major issue I see here is that
the user's computer will hang if we do use it.
So, the best method may be to have something like what a lot of
softwares do: "Would you like to send usage statistics to the owner to
make the software better"
And then in the next release use the information to create a training
set which is more comprehensive.

I think that could add value and allow us to tweak params on a way bigger database (user experience) and file formats than just ours.
So we should formulate questions we want to answer and then think about what stats we have to store in order to do so. All must be anonymized.

Note that a lot of times a larger training set can make the learning
agent worse. Basically depends on where you wan the hyper plane to be
drawn, etc. So, the training set *needs* to be well curated.

Indeed that is an important point. Just adding a lot of noise makes it worse. The dataset has to suit the data.
The knowledge about persons appearing in data and using that for training might be appropriate for a face matcher, like http://docs.opencv.org/master/dc/dc3/tutorial_py_matcher.html#gsc.tab=0

Dr. Trigon

DrTrigon moved this task from Backlog to GSoC on the User-DrTrigon board.