Page MenuHomePhabricator

Configure ORES to publish new drafttopic scores to Kafka
Closed, ResolvedPublic

Description

As the first step of getting ORES drafttopic scores into ElasticSearch, they need to be somewhere in reach of the analytics infrastructure. The easiest approach is pushing them to some Kafka topic, which is already done for some other ORES models.

Event Timeline

I think if drafttopic is added to the list of 'precache' scores for changeprop, it will automatically get added. Ping @Pchelolo

We have the 'revision-score' topic where an event is pushed on every page edit via https://github.com/wikimedia/change-propagation/blob/master/sys/ores_updates.js and calling a recache uri in ores. Do we need the scores in a new topic? Can we just add it to this event?

I think (hope!) this event will be fine!

If you just put this score into the existing topic, it will show up in the event.mediawiki_revision_score hive table.

Thanks both! Where is the list of scores defined? I don't see it either in the EventBus emitter or the changeprop config or the ORES request logic (which, if I trace things correctly, just sends the revision data to the ORES /v3/precache endpoint, with no further arguments). So presumably this would only have to be changed in the ORES config?

yeah. AFAIK it's set up in the ORES config. @Ladsgroup knows much more about this.

While the current schema seems flexible enough to support the use case, it is also a bit vague as to what the topics look like. What we've found so far are fairly free form strings (ex: "Culture.Language and literature"). This is quite error prone. Having some kind of ID (topic1, topic2, etc...) and have a mapping to English (or other) outside of the technical pipeline would probably help reduce errors.

@Gehel not sure I totally understand your requirements, but maybe what you need can be done by filtering and transforming the mediawiki.revision-score stream (and or Hive table?) into whatever you need?

Well, a string is "some kind of ID". What would be needed for a better one? Just having a more predictable character set (lowercase, dash instead of space etc)? Being published in some machine-readable format and location?

(On an aside, ORES/Draft topic could really use more information.)

In any case, this is a v2 problem, right? As in, we can roll out drafttopic scores the way they are, and then change the names of the topics in the future if we wish so (they will probably change occasionally anyway, as people figure out the level detail that works best for users).
Or is there a technical complications with the current names as far as the ElasticSearch part of the system is concerned (e.g. having to escape spaces)?

In any case, this is a v2 problem, right? As in, we can roll out drafttopic scores the way they are, and then change the names of the topics in the future if we wish so (they will probably change occasionally anyway, as people figure out the level detail that works best for users).
is there a technical complications with the current names as far as the ElasticSearch part of the system is concerned (e.g. having to escape spaces)?

In terms of functionality, you can index the strings exactly as provided (previous example, Culture.Language and literature) and as long as we search for exactly that string it will be returned. For example Culture:Language and literature will not match, neither will Culture.Language and Literature or literature or it's own. This could be changed, but the understanding so far was that these topics are unique identifiers that will be used as such. My interpretation of @Gehel suggestion is that as far as identifiers go, these seem very free form. When an identifier is Q142342 every person that looks at it (hopefully) knows it must be reproduced exactly. When i look at Culture.Language and literature it seems less like an identifier and more like a description.

But indeed, this is not a technical problem with respect to indexing or search, it will work as is.

Halfak moved this task from Backlog/Lift Wing to Unsorted on the Machine-Learning-Team board.
Halfak added a subscriber: Halfak.

This is now done with T240609

For clarity, ORES now produces an event every time that an articles is edited in English Wikipedia via the ChangeProp/precache mechanism.

Thanks so much @Halfak! Is this task complete then or is there anything else you need support on?

Just checked, and drafttopic scores are now in the event.mediawiki_revision_score table:

select scores["drafttopic"].prediction, count(*) as cnt
from event.mediawiki_revision_score
where scores["drafttopic"] IS NOT NULL and
size(scores["drafttopic"].prediction) > 0 
and year=2019 and month=12 and day=19 and hour=10
group by scores["drafttopic"].prediction
order by cnt desc limit 10;

...

prediction	cnt
["Culture.Internet culture"]	32
["Culture.Language and literature"]	9
["Geography.Countries"]	7
["STEM.Time"]	5
["History_And_Society.Business and economics"]	5
["Assistance.Maintenance","STEM.Time"]	5
["Culture.Language and literature","Geography.Countries"]	5
["STEM.Time","Geography.Countries"]	2
["STEM.Technology","Culture.Internet culture"]	2
["STEM.Medicine"]	2
Ottomata claimed this task.
hive (default)> select day, hour, count(*) from event.mediawiki_revision_score where scores["drafttopic"] IS NOT NULL and year=2019 and month=12 and ( day=18 or day=19) group by day, hour order by day, hour limit 10000;

day	hour	_c2
18	0	264
18	1	485
18	2	520
18	3	327
18	4	286
18	5	174
18	6	219
18	7	223
18	8	226
18	9	166
18	10	214
(snip)

hive (default)> select count(*) from event.mediawiki_revision_score where scores["drafttopic"] IS NOT NULL and year=2019 and month=12 and day=18;

6999

That seems too small - enwiki has 50K content space edits daily per https://stats.wikimedia.org/v2/#/en.wikipedia.org/content/edited-pages/normal|line|2019-11-07~2019-12-01|page_type~content|daily

Woops! You're right. I just manually sent an event and it looks like it isn't getting picked up.

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> import json
>>> event = json.loads('{"$schema":" <snip> "rev_content_changed":true}')
>>> response = requests.post("https://ores.wikimedia.org/v3/precache", json=event); print(response.text)
{
  "enwiki": {
    "models": {
      "damaging": {
        "version": "0.5.0"
      },
      "goodfaith": {
        "version": "0.5.0"
      }
    },
    "scores": {
      "931601462": {
        "damaging": {
          "score": {
            "prediction": false,
            "probability": {
              "false": 0.8946838511642203,
              "true": 0.10531614883577967
            }
          }
        },
        "goodfaith": {
          "score": {
            "prediction": true,
            "probability": {
              "false": 0.049044124186712335,
              "true": 0.9509558758132877
            }
          }
        }
      }
    }
  }
}

This was a content page edit. I expected to see a drafttopic prediction in the output. I'll dig into this.

Change 559624 had a related patch set uploaded (by Halfak; owner: Halfak):
[mediawiki/services/ores/deploy@master] Replace 'content_edit' event with 'main_edit'

https://gerrit.wikimedia.org/r/559624

Got it! I had a type in the config for the event. I named the event "main_edit" not "content_edit" and mixed that up in the configuration.

It looks like it is too late to deploy this before the holiday. Bummer.

From our discussions with @MMiller_WMF, it looks like we'll be deploying an updated model in early January anyway, so I have hope that this will not be a blocker.

We have some data in the table (page creations only I guess?), that should be enough to unblock the next steps.

Change 559624 merged by Halfak:
[mediawiki/services/ores/deploy@master] Replace 'content_edit' event with 'main_edit'

https://gerrit.wikimedia.org/r/559624