Page MenuHomePhabricator

Consume new event from the AddLink feature
Closed, ResolvedPublic8 Estimated Story Points

Description

See T261407 for details.

The event should contain the page id and page namespace so that it is easy to map to its target elastic index.
The frequency of the job is yet to be defined but could be hourly.

Once the new event is published, we need to consume it as part of the Elasticsearch update pipeline. The update pipeline itself will need to be modified to support an update frequency other than the current weekly.

Event Timeline

CBogen triaged this task as High priority.Sep 14 2020, 3:52 PM
CBogen moved this task from needs triage to Current work on the Discovery-Search board.

The data will be stored in elastic within the predicted_classes field (currently named ores_articletopic, but renaming will happen at some point) along with the the ores predictions already stored there.

FWIW we do expect other recommendation types (there will be one for images in Q3-Q4 and, if the concept works out, probably more in the future) and they aren't quite classifier predictions - right now we only care about presence or absence, but in the future we might want to filter on things like recommendation source (e.g. did the image recommendation come from Commons search or Wikidata or other language versions of the article?). So I'm not sure trying to fit it into the same data structure as ORES might not work well.

The data will be stored in elastic within the predicted_classes field (currently named ores_articletopic, but renaming will happen at some point) along with the the ores predictions already stored there.

FWIW we do expect other recommendation types (there will be one for images in Q3-Q4 and, if the concept works out, probably more in the future) and they aren't quite classifier predictions - right now we only care about presence or absence, but in the future we might want to filter on things like recommendation source (e.g. did the image recommendation come from Commons search or Wikidata or other language versions of the article?). So I'm not sure trying to fit it into the same data structure as ORES might not work well.

Indeed, in recent discussions we agreed to only store the flag required to filters the articles that are fit for a particular task (add_link here). The actual data required for the editing task will have to be fetched from other sources.

To trigger the event, you can use code like

$recommendationType = 'link';
$revision = MediaWikiServices::getInstance()->getRevisionLookup()->getRevisionByTitle( Title::newMainPage() );

$eventBusFactory = MediaWikiServices::getInstance()->get( 'EventBus.EventBusFactory' );
$eventBus = $eventBusFactory->getInstanceForStream( 'mediawiki.revision-recommendation-create' );
$eventFactory = $eventBus->getFactory();
$event = $eventFactory->createRecommendationCreateEvent( 'mediawiki.revision-recommendation-create', $recommendationType, $revision );
$eventBus->send( $event );

Example event:

{
    "$schema": "/mediawiki/revision/recommendation-create/1.0.0",
    "meta": {
      "uri": "https://examplewiki.wikipedia.org/wiki/TestPage10",
      "dt": "2020-06-10T18:57:16Z",
      "domain": "test.wikipedia.org",
      "stream": "mediawiki.revision-recommendation-create"
    },
    "database": "examplewiki",
    "page_id": 123,
    "page_title": "TestPage10",
    "page_namespace": 0,
    "rev_id": 123,
    "rev_timestamp": "2020-06-10T18:57:16Z",
    "rev_sha1": "mr0szy90m5qbn6tek7ch3nebaild3tm",
    "rev_minor_edit": false,
    "rev_len": 3,
    "rev_content_model": "wikitext",
    "rev_content_format": "text/x-wiki",
    "performer": {
      "user_text": "example_user_text",
      "user_groups": [
        "*",
        "user",
        "autoconfirmed"
      ],
      "user_is_bot": false,
      "user_id": 123,
      "user_registration_dt": "2016-01-29T21:13:24Z",
      "user_edit_count": 1
    },
    "page_is_redirect": false,
    "rev_parent_id": 122,
    "rev_content_changed": true,
    "recommendation_type": "link"
  }

The update pipeline itself will need to be modified to support an update frequency other than the current weekly.

For this part i'm thinking that the convert_to_esbulk process is already configuration driven, but the configuration is done from python. Perhaps a minimal shim can be put together to load this config from yaml or json. The script can then be run hourly and weekly with separate sets of tables to read from. That doesn't seem critical to implement first though.

For first steps i'm going to work out the scripts to process the events into a form suitable for convert_to_esbulk and test out the plan for merging data.

Change 647762 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[wikimedia/discovery/analytics@master] implement link recommendations imports

https://gerrit.wikimedia.org/r/647762

Change 657880 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Bump schema repo shas to get mediawiki/revision/recommendation-create

https://gerrit.wikimedia.org/r/657880

Change 657880 merged by Ottomata:
[eventgate-wikimedia@master] Bump schema repo shas to get mediawiki/revision/recommendation-create

https://gerrit.wikimedia.org/r/657880

Expecting to deploy the new data piplines and hourly scheduling early next week. Probably monday, but will see. Will emit some test events on testwiki and see if everything talks happily.

Change 657885 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-main - bump to 2021-01-22-173634-production

https://gerrit.wikimedia.org/r/657885

Change 657885 merged by Ottomata:
[operations/deployment-charts@master] eventgate-main - bump to 2021-01-22-173634-production

https://gerrit.wikimedia.org/r/657885

Change 658413 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-main - precache /mediawiki/revision/recommendation-create/1.0.0

https://gerrit.wikimedia.org/r/658413

Change 658413 merged by Ottomata:
[operations/deployment-charts@master] eventgate-main - precache /mediawiki/revision/recommendation-create/1.0.0

https://gerrit.wikimedia.org/r/658413

Change 647762 merged by jenkins-bot:
[wikimedia/discovery/analytics@master] implement link recommendations imports

https://gerrit.wikimedia.org/r/647762

Change 659319 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[operations/mediawiki-config@master] Enable canary events for recommendation create

https://gerrit.wikimedia.org/r/659319

Change 659319 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable canary events for recommendation create

https://gerrit.wikimedia.org/r/659319

I believe this is complete, hourly events are processing. No data has run through the pipeline that i can tell, only the canary events.

Yeah, I expect it will be a couple more weeks before we enable the feature in production.

Tgr reopened this task as Open.EditedMar 12 2021, 11:35 AM

We enabled event generation on testwiki yesterday, the pipeline seems to be working fine. Thanks for all the work that went into it!

I believe this is complete, hourly events are processing. No data has run through the pipeline that i can tell, only the canary events.

@EBernhardson we enabled cswiki today (processing started at 13:27 UTC) and while there are 3,866 items in the database, the hasrecommendation:link query returns no results. Do you have an idea of what might be going on?

I believe this is complete, hourly events are processing. No data has run through the pipeline that i can tell, only the canary events.

@EBernhardson we enabled cswiki today (processing started at 13:27 UTC) and while there are 3,866 items in the database, the hasrecommendation:link query returns no results. Do you have an idea of what might be going on?

Sorry, of course right after I posted this we began to see items (684) in the search.