Page MenuHomePhabricator

Update ORES articletopic data score in ElasticSearch when an article gets edited
Closed, DeclinedPublic

Description

We plan to bulk load ORES drafttopic scores to ElasticSearch via T240556: Load ORES articletopic data into ElasticSearch via the weekly bulk update, but that will have a delay of up to one week, which is not ideal in some cases (especially for recently created articles; presumably for mature articles the topic scores don't change much over time). To fix that, hook into the ES update logic in MediaWiki (via ContentHandler::getDataForSearchIndex, probably), pull the score from the ORES service and add it to the ES document.

Document updates happen much more often than edits (e.g. via template edits), so a naive implementation might overload the ORES service. We need to figure out what's an acceptable load, and limit this functionality accordingly. Maybe we can get away with only doing it on real edits, otherwise we could limit it to new / recent articles, or maybe wikis where GrowthExperiment is deployed.

The code would either live in MediaWiki-extensions-ORES (where the search keyword code will live as well) or GrowthExperiments (if we limit it to those wikis).

Open questions:

Event Timeline

From eyeballing wikistats, monthly:

Content editsContent page creations
all Wikipedias10-15M200-300K
enwiki3-4M15-25K
arwiki100K-1M10-100K

(arwiki is the largest Growth wiki. The huge fluctuation is presumably due to bots; it nearly doubled in the last year.)

Per minute, taking the upper edge of the range, that would be

Content editsContent page creations
all Wikipedias~350~7
enwiki~100~0.5
arwiki~25~2

That does not say much about spikes though (especially on wikis where bots dominate article creation).

@Halfak what would be the cost of fetching ORES scores from the MediaWiki LinksUpdate job after an edit (near realtime)? Given that changeprop already sent that same edit to ORES (also near realtime), wouldn't one free-ride on the other, and not increase the current load much?

It's true. We have a dedupe mechanism that should make the cost essentially free if we're already generating the score with ChangeProp.

@MMiller_WMF how do we prioritize this? Given the advanced state of T240556: Load ORES articletopic data into ElasticSearch via the weekly bulk update, this probably won't be needed to have ORES search, but without it scores only update weekly (so recently created articles won't be returned); with it updates would be real-time.

@Tgr -- thanks for checking on this. We don't need to have this for the initial switch over from morelike to ORES, but I think we can prioritize it later on. I think we should remember that this would work well together with some process by which community members put maintenance templates on new articles.

FYI:

Hey folks, it looks like articletopic (a slightly different model that we now have in production) is a better option than the drafttopic model. Once we release the native models for ar, cs, ko, and viwiki, they'll only have an articletopic model for use.

The main annoyance here is that we'd have to duplicate the threshold fetching logic (and cache it and handle keeping that cache fresh, since it's too slow for doing it in the linkupdate job).

@Tgr can this be declined/marked as invalid?

Tgr renamed this task from Update ORES drafttopic data score in ElasticSearch when an article gets edited to Update ORES articletopic data score in ElasticSearch when an article gets edited.May 31 2022, 9:34 PM
Tgr closed this task as Declined.

Yeah, not worth the effort it would take, and eventually Lift Wing will handle this anyway (I assume).