Page MenuHomePhabricator

Add an image: pre-deployment model refresh
Closed, ResolvedPublic3 Estimated Story Points

Description

At the beginning of August, Platform team ran the image suggestions model via T285816: Add an image: generate static file of suggestions, and Search team loaded those suggestions to the search index via T285817: Add an image: load static file to search index. Our hope was that the articles and suggestions from August would still be sufficiently fresh for a release in October/November.

Now that we are nearing release, we see that a sizable number of articles with image suggestions in the search index (especially from Bengali Wikipedia) are now illustrated. This may have happened because of concerted efforts by the communities to illustrate articles, unrelated to our work. In short, the modeled suggestions have gotten stale faster than we expected.

We always knew that the modeled suggestions would grow stale, and so the Growth team's interface handles these moments by telling the user that the suggestion is actually not available after they click on it. But we don't want this to happen frequently to users, as it will make for a poor experience.

Therefore, we would like to refresh the apparatus in advance of our deployment, if possible. Our deployment is November 29. After talking with @sdkim, it sounds like the main things that would need to happen are:

  1. Re-run the model.
  2. Make the refreshed suggestions available for the image suggestions API
  3. Load the refreshed suggestions to the search index

On this task, we want to discuss whether such a refresh would be simple enough to do before November 29.

Event Timeline

Speaking on behalf of the image suggestion API team, once the model is rerun and exported, the image suggestion API should be updated within a days worth of work Designated by this task: (https://phabricator.wikimedia.org/T295326)

cc: @nnikkhoui @BPirkle

@MMiller_WMF
Speaking to my team, it sounds like currently when the search index is updated, only referenced pages are updated with new values; unreferenced pages without new values will retain their old values: i.e. by default, the old suggestions will only be cleared on pages referenced in the new data load.

Is this desired behavior? or should all old suggestions be cleared regardless of whether they are referenced by the new suggestions?

Is this desired behavior? or should all old suggestions be cleared regardless of whether they are referenced by the new suggestions?

All old suggestions should be cleared. The goal of the change is to remove a bunch of unwanted entries from the search index (e.g. articles to which images have been added in the WPWP campaign, articles which are about numbers or otherwise interesting). Updating the recommendations for the articles which are included in both the old and the new data is just a side benefit.

Thanks!
@EBernhardson, did you have other questions about what this task would require for Search?

We can clear the old suggestions, thats not a problem. I wanted it to be clear though that the search systems only update the pages that are referenced. If we want to update pages not referenced in a data dump it has to be done explicitly, it doesn't just happen.

An alternate solution is to not reuse names. We currently import the dumps under a particular prefix. Those prefixes aren't exposed to end users, so they could be versioned or whatever. If using a versioned import name then no question remains about if the old data is referenced or not, the question is only one of cleaning up after ourselves.

(Moving to watching; there's nothing (AFAIK) that is needed from Growth-Team to make this happen.)

I think in the long term we'd like the search data to be automatically regularly refreshed (e.g. monthly) so versioning wouldn't be easy to manage on the client side. As a short-term solution it would work fine for us if that's your preferred approach.

I think in the long term we'd like the search data to be automatically regularly refreshed (e.g. monthly) so versioning wouldn't be easy to manage on the client side. As a short-term solution it would work fine for us if that's your preferred approach.

If we want a monthly batch process, instead of a continuous update process, a versioned value sounds simpler to me? Post-deploy automation would only need to update a single value somewhere (could be an elastic doc in our MetaStore? The update would probably be done from mjolnir bulk daemon after it completes the data import) and everything keying off that would switch to the new variant. Clearing the old values in a versioned setup can work off a second list, we add the version to a list of values to clear and the saneitizer that already updates pages every 8 weeks can be updated to clear out data we aren't using anymore while it's issuing standard re-renders. Clearing the old values in the current setup is relatively non-trivial, the only obvious way is to perform searches and then delete documents that come back after cross-referencing results against the new dataset.

@EBernhardson right, versioning would be simpler for the infrastructure but more complicated for clients and users who would have to somehow figure out what search keyword to use.

Clearing the old values could be done by preserving the penultimate static dump and diffing it with the new dump, but it's certainly an awkward process.

@EBernhardson right, versioning would be simpler for the infrastructure but more complicated for clients and users who would have to somehow figure out what search keyword to use.

The exact versioning shouldn't be exposed to end users, indeed it would be crazy to expect end users to know which dump they should be referencing. That's what i was suggesting to use the metastore for, The translation from end-user visible keyword to whatever internal version is currently promoted should be able to happen by maintaining an array/map of whatever is currently the promoted versions in MetaStore. That can be updated whenever a new version finishes importing by the importing process.

Clearing the old values could be done by preserving the penultimate static dump and diffing it with the new dump, but it's certainly an awkward process.

MPhamWMF set the point value for this task to 5.Nov 15 2021, 4:47 PM
MPhamWMF changed the point value for this task from 5 to 3.

The Search team talked about this in planning today, and we can do the lower effort reload this time, but will think about a versioning approach for the future in order to reduce maintenance work on our team.

Change 739383 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Support for partial clearing of weighted_tags

https://gerrit.wikimedia.org/r/739383

Change 739383 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Support for partial clearing of weighted_tags

https://gerrit.wikimedia.org/r/739383

@EBernhardson The updated data can be found under clarakosi.search_imagerec. Please let me know if you have any problems accessing the table.

Data loaded from clarakosi.search_imagerec into the eqiad and codfw cirrus clusters. This updated ~75k pages in each DC, the majority of the import was nop'd at indexing time due to not causing any change to indexed content. I've started up the process to clear pages, a dry run reported it will clear old recommendations from ~70k pages per cluster, expecting it to finish an an hour or so.

Thanks a lot, @Clarakosi and @EBernhardson! The recommendation stream looks much better now.

Gehel claimed this task.