Page MenuHomePhabricator

Add a link engineering: Recommendation version
Closed, ResolvedPublic

Description

Details of the link recommendations relevant to cross-component communication (e.g. how exactly instance_ocurrence is defined) are bound to change over time. Since we cache recommendation, multiple versions might exist in parallel, so we should include some kind of version number. This would have to be outputted by the service, stored in the cache table, and passed on by the MediaWiki API so the frontend can have split processing logic for different versions.

Event Timeline

Not really sure if we want this for initial release - could be left for whenever we want to change the format.

I think this is a good idea, but it's a probably a decent amount of work so I propose we revisit if/when the format changes.

The URL for retrieving recommendations now indicates v0 i.e. it's unstable. If we decide to make major changes or declare that it's stable, we can switch to v1 and then do the additional work for storing the version number in the cache table.

The URL for retrieving recommendations now indicates v0 i.e. it's unstable. If we decide to make major changes or declare that it's stable, we can switch to v1 and then do the additional work for storing the version number in the cache table.

We are now on v1 since we're serving the API via API-Portal.

Another use case is to know which datasets were used for generating the dataset. For example, https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/675172 was just merged which filters out links for things like years, list pages, disambiguation pages, etc from the anchors dictionary dataset. Suppose we have an article that we cached link recommendations for before this change was deployed; it would be useful to be able to know the dataset hash that was used for the recommendation.

So, I think we want to know:

  1. recommendation format version where that probably doesn't change often but would include things like whether instance_occurrence calculation is defined differently, or if we switched from using wikitext to HTML as the source data
  2. the hash and date (since we don't archive old datasets) corresponding to the dataset used to generate the recommendations. We could probably use the link_model.json.checksum for this.

The URL for retrieving recommendations now indicates v0 i.e. it's unstable. If we decide to make major changes or declare that it's stable, we can switch to v1 and then do the additional work for storing the version number in the cache table.

We are now on v1 since we're serving the API via API-Portal.

Another use case is to know which datasets were used for generating the dataset. For example, https://gerrit.wikimedia.org/r/c/research/mwaddlink/+/675172 was just merged which filters out links for things like years, list pages, disambiguation pages, etc from the anchors dictionary dataset. Suppose we have an article that we cached link recommendations for before this change was deployed; it would be useful to be able to know the dataset hash that was used for the recommendation.

So, I think we want to know:

  1. recommendation format version where that probably doesn't change often but would include things like whether instance_occurrence calculation is defined differently, or if we switched from using wikitext to HTML as the source data
  2. the hash and date (since we don't archive old datasets) corresponding to the dataset used to generate the recommendations. We could probably use the link_model.json.checksum for this.

Additionally we might want to record the application version (we currently don't have any kind of application version stored, so we'd need a mechanism for that) used to generate the recommendations. So it would be three things:

  1. application version (would use semantic versioning)
  2. recommendation format version (changed rarely, could use semver?)
  3. datasets version (ideally also using semver). This is tricky, because we don't really have a version currently, we have hashes of various files. Recently we added some logic for excluding dates and calendar years from recommendations, it would be useful to associate that kind of metadata with a dataset version so that someone inspecting the recommendation data in the cache could see which dataset was associated with the recommendation. Maybe there would be a file like https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/cswiki/VERSION that contains the current version number, and we have a separate file where we associate version numbers with dataset hashes, and also describe what has changed for feature version increments?

We will talk about this in engineer chat and make a decision this week.

If we want to avoid having to keep track of versions manually, the dataset version could be the timestamp at which the datasets were generated and the application version could be the git commit ID of the repository head. (That assumes the application is the one used for generating the dataset, not the one in Kubernetes, which might be a different revision, and which does not have access to the repo.) Format changes are a big deal, probably need b/c code, so a version number is appropriate there (not sure if semver adds much value over a plain integer but there isn't any downside either).

If we want to avoid having to keep track of versions manually, the dataset version could be the timestamp at which the datasets were generated

As mentioned later in your comment, I think we want:

  • git commit / application version which generated the dataset
  • date the dataset was generated

I think we'll need a new table for that, or we could reuse the existing lr_checksum table where lookup would be the checksum and value would be a blob containing the git commit / application version and the date the dataset was generated

and the application version could be the git commit ID of the repository head. (That assumes the application is the one used for generating the dataset, not the one in Kubernetes, which might be a different revision, and which does not have access to the repo.)

That could work.

Format changes are a big deal, probably need b/c code, so a version number is appropriate there (not sure if semver adds much value over a plain integer but there isn't any downside either).

OK, yeah we could just use a plain integer.

I've gone for something relatively simple, which would allow us to inspect recommendations on a wiki (via script or by hand) and determine which datasets and application version were used to return the response. We don't have a registry of checksum versions so in practice that means we can use this information to tell if the link recommendations were generated using the latest dataset on https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/{wikiId}.

Change 679783 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] Add version metadata to responses

https://gerrit.wikimedia.org/r/679783

Change 679783 merged by jenkins-bot:

[research/mwaddlink@main] Add version metadata to responses

https://gerrit.wikimedia.org/r/679783

Change 680985 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] [WIP] Store link recommendation metadata

https://gerrit.wikimedia.org/r/680985

Change 680985 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Use link recommendation metadata in fetching/storing recommendations

https://gerrit.wikimedia.org/r/680985

kostajh moved this task from Code Review to QA on the Growth-Team (Current Sprint) board.
kostajh added a subscriber: Etonkovidova.

@Etonkovidova the service now outputs some metadata in its response https://api.wikimedia.org/service/linkrecommendation/apidocs/ and in the growthexperiments_link_recommendations table, new entries should have that additional metadata included in the gelr_data column.

New entries are displayed (e.g. enwiki betalabs)

gelr_revision: 495970
    gelr_page: 89251
    gelr_data: {"links":[{"link_text":"continental United States","link_target":"Contiguous United States","match_index":0,"wikitext_offset":4713,"score":0.564686119556427,"context_before":"te of the ","context_after":", all of M","link_index":0},{"link_text":"Yucat\u00e1n Peninsula","link_target":"Yucat\u00e1n Peninsula","match_index":0,"wikitext_offset":4765,"score":0.5742266774177551,"context_before":"xcept the ","context_after":", and the ","link_index":1}],"meta":{"application_version":"5b4709d","dataset_checksums":{"anchors":"5a186f92365cecca979a38de3e133bbf6089984f35b95cacd93e19a0403e00ab","model":"dd650648b87d69547b1721560f0e16027e5f81ccb5dd2dcfdcf121d460abb53c","pageids":"9e1b004e7fa84b0187a486c682a1cbf152ca52d2e769d9c3a140eefe9b540d44","redirects":"57c5ce51d0ce4048743674a7ebb53e7a527d325765fd0632fa1455be46f7eb3c","w2vfiltered":"096a5f2ca30708ce41408897c99efb854204ac8322fbada8a8147d8156b031e8"},"format_version":1}}