As the first step of getting ORES drafttopic scores into ElasticSearch, they need to be somewhere in reach of the analytics infrastructure. The easiest approach is pushing them to some Kafka topic, which is already done for some other ORES models.
Description
Details
Project | Branch | Lines +/- | Subject | |
---|---|---|---|---|
mediawiki/services/ores/deploy | master | +1 -1 | Replace 'content_edit' event with 'main_edit' |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Rileych | T240517 [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics) | |||
Resolved | EBernhardson | T240556 Load ORES articletopic data into ElasticSearch via the weekly bulk update | |||
Invalid | None | T240553 Consume ORES articletopic data from Kafka and store it in HDFS | |||
Resolved | Halfak | T240725 ORES deployment mid-Dec. 2019 | |||
Resolved | Ottomata | T240549 Configure ORES to publish new drafttopic scores to Kafka | |||
Resolved | Halfak | T240609 Produce drafttopic score events on every edit to English Wikipedia articles |
Event Timeline
I think if drafttopic is added to the list of 'precache' scores for changeprop, it will automatically get added. Ping @Pchelolo
We have the 'revision-score' topic where an event is pushed on every page edit via https://github.com/wikimedia/change-propagation/blob/master/sys/ores_updates.js and calling a recache uri in ores. Do we need the scores in a new topic? Can we just add it to this event?
If you just put this score into the existing topic, it will show up in the event.mediawiki_revision_score hive table.
Thanks both! Where is the list of scores defined? I don't see it either in the EventBus emitter or the changeprop config or the ORES request logic (which, if I trace things correctly, just sends the revision data to the ORES /v3/precache endpoint, with no further arguments). So presumably this would only have to be changed in the ORES config?
While the current schema seems flexible enough to support the use case, it is also a bit vague as to what the topics look like. What we've found so far are fairly free form strings (ex: "Culture.Language and literature"). This is quite error prone. Having some kind of ID (topic1, topic2, etc...) and have a mapping to English (or other) outside of the technical pipeline would probably help reduce errors.
@Gehel not sure I totally understand your requirements, but maybe what you need can be done by filtering and transforming the mediawiki.revision-score stream (and or Hive table?) into whatever you need?
Well, a string is "some kind of ID". What would be needed for a better one? Just having a more predictable character set (lowercase, dash instead of space etc)? Being published in some machine-readable format and location?
(On an aside, ORES/Draft topic could really use more information.)
In any case, this is a v2 problem, right? As in, we can roll out drafttopic scores the way they are, and then change the names of the topics in the future if we wish so (they will probably change occasionally anyway, as people figure out the level detail that works best for users).
Or is there a technical complications with the current names as far as the ElasticSearch part of the system is concerned (e.g. having to escape spaces)?
In terms of functionality, you can index the strings exactly as provided (previous example, Culture.Language and literature) and as long as we search for exactly that string it will be returned. For example Culture:Language and literature will not match, neither will Culture.Language and Literature or literature or it's own. This could be changed, but the understanding so far was that these topics are unique identifiers that will be used as such. My interpretation of @Gehel suggestion is that as far as identifiers go, these seem very free form. When an identifier is Q142342 every person that looks at it (hopefully) knows it must be reproduced exactly. When i look at Culture.Language and literature it seems less like an identifier and more like a description.
But indeed, this is not a technical problem with respect to indexing or search, it will work as is.
For clarity, ORES now produces an event every time that an articles is edited in English Wikipedia via the ChangeProp/precache mechanism.
Thanks so much @Halfak! Is this task complete then or is there anything else you need support on?
Just checked, and drafttopic scores are now in the event.mediawiki_revision_score table:
select scores["drafttopic"].prediction, count(*) as cnt from event.mediawiki_revision_score where scores["drafttopic"] IS NOT NULL and size(scores["drafttopic"].prediction) > 0 and year=2019 and month=12 and day=19 and hour=10 group by scores["drafttopic"].prediction order by cnt desc limit 10; ... prediction cnt ["Culture.Internet culture"] 32 ["Culture.Language and literature"] 9 ["Geography.Countries"] 7 ["STEM.Time"] 5 ["History_And_Society.Business and economics"] 5 ["Assistance.Maintenance","STEM.Time"] 5 ["Culture.Language and literature","Geography.Countries"] 5 ["STEM.Time","Geography.Countries"] 2 ["STEM.Technology","Culture.Internet culture"] 2 ["STEM.Medicine"] 2
hive (default)> select day, hour, count(*) from event.mediawiki_revision_score where scores["drafttopic"] IS NOT NULL and year=2019 and month=12 and ( day=18 or day=19) group by day, hour order by day, hour limit 10000; day hour _c2 18 0 264 18 1 485 18 2 520 18 3 327 18 4 286 18 5 174 18 6 219 18 7 223 18 8 226 18 9 166 18 10 214 (snip) hive (default)> select count(*) from event.mediawiki_revision_score where scores["drafttopic"] IS NOT NULL and year=2019 and month=12 and day=18; 6999
That seems too small - enwiki has 50K content space edits daily per https://stats.wikimedia.org/v2/#/en.wikipedia.org/content/edited-pages/normal|line|2019-11-07~2019-12-01|page_type~content|daily
Woops! You're right. I just manually sent an event and it looks like it isn't getting picked up.
$ python Python 3.5.1+ (default, Mar 30 2016, 22:46:26) [GCC 5.3.1 20160330] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import requests >>> import json >>> event = json.loads('{"$schema":" <snip> "rev_content_changed":true}') >>> response = requests.post("https://ores.wikimedia.org/v3/precache", json=event); print(response.text) { "enwiki": { "models": { "damaging": { "version": "0.5.0" }, "goodfaith": { "version": "0.5.0" } }, "scores": { "931601462": { "damaging": { "score": { "prediction": false, "probability": { "false": 0.8946838511642203, "true": 0.10531614883577967 } } }, "goodfaith": { "score": { "prediction": true, "probability": { "false": 0.049044124186712335, "true": 0.9509558758132877 } } } } } } }
This was a content page edit. I expected to see a drafttopic prediction in the output. I'll dig into this.
Change 559624 had a related patch set uploaded (by Halfak; owner: Halfak):
[mediawiki/services/ores/deploy@master] Replace 'content_edit' event with 'main_edit'
Got it! I had a type in the config for the event. I named the event "main_edit" not "content_edit" and mixed that up in the configuration.
It looks like it is too late to deploy this before the holiday. Bummer.
From our discussions with @MMiller_WMF, it looks like we'll be deploying an updated model in early January anyway, so I have hope that this will not be a blocker.
We have some data in the table (page creations only I guess?), that should be enough to unblock the next steps.
Change 559624 merged by Halfak:
[mediawiki/services/ores/deploy@master] Replace 'content_edit' event with 'main_edit'