Page MenuHomePhabricator

Investigate moving incoming_links computation to a batch job
Closed, ResolvedPublic8 Estimated Story Points

Description

The way the field containing the number of incoming link is fed today has multiple caveat that makes it hard to port as-is to a new search update pipeline.

The main issue is that CirrusSearch uses it's own index to extract this number:

  • a page X is edited to add a link to page Y and remove a link to page Z (known from MW LinksUpdateComplete hook)
  • a job to re-compute the number of incoming_links to Y and another one for page Z is scheduled with a delay
  • the job to update page X is assumed to run before Y & Z pages are updated
  • the elasticsearch index is assumed to be refreshed before Y & Z runs the count(outgoing_links:X) query against elasticsearch

Caveats:

  • delaying via changeprop is done by re-submitting a kafka message
  • everything assumes that updating page X went well and that the index was refreshed in the given delay

Replicating this technique in the new search update pipeline does not seem wise.

Given that the number of incoming_links is mainly a relevance signal, knowing the value in real-time does not seem to be a strong requirement.

We could investigate if there are ways to have this field be updated from a batch job in a similar way we do update the popularity_score field so that we could better evaluate how this field should be approached by the new search update pipeline.

AC:

  • investigate possible ways to compute and refresh the number of incoming links from a batch job

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 844075 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Import cirrus indexes to hdfs

https://gerrit.wikimedia.org/r/844075

The first step, to load the cirrus indices into the hadoop cluster so we can further process them, is mostly ready. Already have data in yarn to work with and the automated job should be deployeed next week. Some notes:

  • spark is having issues using very large rows, such as multiple megabytes of text. We need to often allocate significant amounts of off-heap memory overhead to get them running(4+GB, vs 384M by default). Both parquet and avro output formats needed excessive memory here, seems like something we have to accept.
  • Setting --conf sparkl.executor.extraJavaOptions=-XX:MaxDirectMemorySize=1024M (values vary) seems to help a bit when reading, but it depends on context and isn't fully reliable. Related online discussions suggest the problem is libraries waiting for the GC to clean up buffer references, but since the buffers aren't apply heap pressure they don't force any GC.
  • Looked over various online mentions of similar issues. Direct memory buffers are used for a few different things, notably when spark fetches from the shuffle service the http library used uses them. Additionally when parquet is decompressing snappy (via parquet) they are used. probably other cases as well. Spark does not seem to have a consistent story about how to handle these.
  • Switching to avro (from parquet) seems to have made reading the data back from hdfs much less error prone. With 768M of memory overhead tasks seems to rarely fail.

With the data now available in hdfs I've run some stats over the data available:

Number of pages: 495M
Number of distinct outgoing links: 595M. I believe this is because outgoing links includes red-links. I didn't expect redlinks to be quite this prevalent though. Might be worth further investigation.
Number of pages in weekly popularity_score update: 183M
Time to import weekly update: 32h

Pushing a batch job with 500M updates is potentially too much. One thing we've talked about with respect to making this job easier to push into elasticsearch is to exclude pages with very few incoming links from the batch update. This follows a pattern I've read about before for dealing with large amounts of long-tailed data where you set a minimum value and assume all values less than that value are equal to the minimum. So for example if we set a minimum value of 10 we wouldn't set a value <= 10 to any page, instead we would assume in relevance calculations that if the value is missing it is 10.

Some stats on how frequently pages have very few incoming links:

num incoming linksunique pages
> 0595,000,445
> 1226,732,557
> 2151,754,751
> 3118,936,875
> 4100,144,604
> 587,738,561
> 678,681,425
> 771,776,125
> 866,193,937
> 961,428,518
> 1057,265,663

I've run out of time today, but will look into the overlap of this data with the popularity data. That is a little more involved since the outgoing_links don't have page_id's, the data I have currently is all title based, but that should hopefully be solvable since we store the namespace_text and title for each page in the index.

Change 844075 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Import cirrus indexes to hdfs

https://gerrit.wikimedia.org/r/844075

Change 855655 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Implement incoming_links update as batch job

https://gerrit.wikimedia.org/r/855655

After some analysis I've found we don't even need to filter pages with low numbers of incoming links directly. The current state of the index is somewhat far out from the exact values, expected due to our usage of super_detect_noop with a 20% threshold. The first run of this will send ~115M pages to be updated. Analysis of two database dumps more than a week apart suggests that most pages have relatively constant incoming_links counts across weeks. Of 185M pages in the dumps with incoming_links, only 6M of them changed between 10-30 and 11-11.

Change 856692 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add config option to disable incoming link counting

https://gerrit.wikimedia.org/r/856692

Change 856692 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add config option to disable incoming link counting

https://gerrit.wikimedia.org/r/856692

Change 855655 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Implement incoming_links update as batch job

https://gerrit.wikimedia.org/r/855655

Change 859137 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] incoming_links: Rename wiki to wikiid

https://gerrit.wikimedia.org/r/859137

Change 859137 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] incoming_links: Rename wiki to wikiid

https://gerrit.wikimedia.org/r/859137

Change 861413 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Relax wait time for wait_for_incoming_links

https://gerrit.wikimedia.org/r/861413

Change 861413 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Relax wait time for wait_for_incoming_links

https://gerrit.wikimedia.org/r/861413

Change 862343 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Disable incoming link counting

https://gerrit.wikimedia.org/r/862343

Change 862343 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Disable incoming link counting

https://gerrit.wikimedia.org/r/862343

Mentioned in SAL (#wikimedia-operations) [2023-01-12T21:56:58Z] <thcipriani@deploy1002> Started scap: Backport for [[gerrit:879161|cirrus: Divert requests with x-public-cloud set to a dedicated pool counter (T326757)]], [[gerrit:862343|cirrus: Disable incoming link counting (T317023)]]

Mentioned in SAL (#wikimedia-operations) [2023-01-12T21:58:32Z] <thcipriani@deploy1002> thcipriani and ebernhardson: Backport for [[gerrit:879161|cirrus: Divert requests with x-public-cloud set to a dedicated pool counter (T326757)]], [[gerrit:862343|cirrus: Disable incoming link counting (T317023)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-01-12T22:06:22Z] <thcipriani@deploy1002> Finished scap: Backport for [[gerrit:879161|cirrus: Divert requests with x-public-cloud set to a dedicated pool counter (T326757)]], [[gerrit:862343|cirrus: Disable incoming link counting (T317023)]] (duration: 09m 23s)