Investigate moving incoming_links computation to a batch job
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	dcausse
	Sep 5 2022, 9:56 AM

Description

The way the field containing the number of incoming link is fed today has multiple caveat that makes it hard to port as-is to a new search update pipeline.

The main issue is that CirrusSearch uses it's own index to extract this number:

a page X is edited to add a link to page Y and remove a link to page Z (known from MW LinksUpdateComplete hook)
a job to re-compute the number of incoming_links to Y and another one for page Z is scheduled with a delay
the job to update page X is assumed to run before Y & Z pages are updated
the elasticsearch index is assumed to be refreshed before Y & Z runs the count(outgoing_links:X) query against elasticsearch

Caveats:

delaying via changeprop is done by re-submitting a kafka message
everything assumes that updating page X went well and that the index was refreshed in the given delay

Replicating this technique in the new search update pipeline does not seem wise.

Given that the number of incoming_links is mainly a relevance signal, knowing the value in real-time does not seem to be a strong requirement.

We could investigate if there are ways to have this field be updated from a batch job in a similar way we do update the popularity_score field so that we could better evaluate how this field should be approached by the new search update pipeline.

AC:

investigate possible ways to compute and refresh the number of incoming links from a batch job

Details

Subject	Repo	Branch	Lines +/-
cirrus: Disable incoming link counting	operations/mediawiki-config	master	+5 -1
Relax wait time for wait_for_incoming_links	wikimedia/discovery/analytics	master	+5 -1
incoming_links: Rename wiki to wikiid	wikimedia/discovery/analytics	master	+4 -2
Implement incoming_links update as batch job	wikimedia/discovery/analytics	master	+332 -2
Add config option to disable incoming link counting	mediawiki/extensions/CirrusSearch	master	+32 -4
Import cirrus indexes to hdfs	wikimedia/discovery/analytics	master	+958 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T317045 [Epic] Re-architect the Search Update Pipeline
Resolved	EBernhardson	T317023 Investigate moving incoming_links computation to a batch job
Resolved	EBernhardson	T265056 Make Cirrus Search dump script more resilient to failures (elasticsearch restarts)

Event Timeline

dcausse created this task.Sep 5 2022, 9:56 AM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptSep 5 2022, 9:56 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

dcausse updated the task description. (Show Details)Sep 5 2022, 9:56 AM

Gehel triaged this task as High priority.Sep 5 2022, 3:28 PM

Gehel added a parent task: T317045: [Epic] Re-architect the Search Update Pipeline.

Gehel moved this task from needs triage to ML & Data Pipeline on the Discovery-Search board.

Gehel moved this task from ML & Data Pipeline to needs triage on the Discovery-Search board.Sep 23 2022, 1:10 PM

EBernhardson added a subtask: T265056: Make Cirrus Search dump script more resilient to failures (elasticsearch restarts).Sep 26 2022, 3:18 PM

TJones updated the task description. (Show Details)Sep 26 2022, 3:28 PM

• MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Sep 26 2022, 3:29 PM

• MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

• MPhamWMF set the point value for this task to 8.Oct 17 2022, 3:35 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

EBernhardson claimed this task.Oct 18 2022, 10:42 PM

EBernhardson moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

Change 844075 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Import cirrus indexes to hdfs

https://gerrit.wikimedia.org/r/844075

gerritbot added a project: Patch-For-Review.Oct 18 2022, 11:00 PM

The first step, to load the cirrus indices into the hadoop cluster so we can further process them, is mostly ready. Already have data in yarn to work with and the automated job should be deployeed next week. Some notes:

spark is having issues using very large rows, such as multiple megabytes of text. We need to often allocate significant amounts of off-heap memory overhead to get them running(4+GB, vs 384M by default). Both parquet and avro output formats needed excessive memory here, seems like something we have to accept.
Setting --conf sparkl.executor.extraJavaOptions=-XX:MaxDirectMemorySize=1024M (values vary) seems to help a bit when reading, but it depends on context and isn't fully reliable. Related online discussions suggest the problem is libraries waiting for the GC to clean up buffer references, but since the buffers aren't apply heap pressure they don't force any GC.
Looked over various online mentions of similar issues. Direct memory buffers are used for a few different things, notably when spark fetches from the shuffle service the http library used uses them. Additionally when parquet is decompressing snappy (via parquet) they are used. probably other cases as well. Spark does not seem to have a consistent story about how to handle these.
Switching to avro (from parquet) seems to have made reading the data back from hdfs much less error prone. With 768M of memory overhead tasks seems to rarely fail.

With the data now available in hdfs I've run some stats over the data available:

Number of pages: 495M
Number of distinct outgoing links: 595M. I believe this is because outgoing links includes red-links. I didn't expect redlinks to be quite this prevalent though. Might be worth further investigation.
Number of pages in weekly popularity_score update: 183M
Time to import weekly update: 32h

Pushing a batch job with 500M updates is potentially too much. One thing we've talked about with respect to making this job easier to push into elasticsearch is to exclude pages with very few incoming links from the batch update. This follows a pattern I've read about before for dealing with large amounts of long-tailed data where you set a minimum value and assume all values less than that value are equal to the minimum. So for example if we set a minimum value of 10 we wouldn't set a value <= 10 to any page, instead we would assume in relevance calculations that if the value is missing it is 10.

Some stats on how frequently pages have very few incoming links:

num incoming links	unique pages
> 0	595,000,445
> 1	226,732,557
> 2	151,754,751
> 3	118,936,875
> 4	100,144,604
> 5	87,738,561
> 6	78,681,425
> 7	71,776,125
> 8	66,193,937
> 9	61,428,518
> 10	57,265,663

I've run out of time today, but will look into the overlap of this data with the popularity data. That is a little more involved since the outgoing_links don't have page_id's, the data I have currently is all title based, but that should hopefully be solvable since we store the namespace_text and title for each page in the index.

Change 844075 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Import cirrus indexes to hdfs

https://gerrit.wikimedia.org/r/844075

Maintenance_bot removed a project: Patch-For-Review.Oct 28 2022, 7:30 PM

EBernhardson mentioned this in rWDAN2326f9c67c5f: Import cirrus indexes to hdfs.Oct 28 2022, 9:41 PM

Change 855655 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Implement incoming_links update as batch job

https://gerrit.wikimedia.org/r/855655

gerritbot added a project: Patch-For-Review.Nov 10 2022, 5:12 PM

After some analysis I've found we don't even need to filter pages with low numbers of incoming links directly. The current state of the index is somewhat far out from the exact values, expected due to our usage of super_detect_noop with a 20% threshold. The first run of this will send ~115M pages to be updated. Analysis of two database dumps more than a week apart suggests that most pages have relatively constant incoming_links counts across weeks. Of 185M pages in the dumps with incoming_links, only 6M of them changed between 10-30 and 11-11.

EBernhardson moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Nov 14 2022, 4:11 PM

Change 856692 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Add config option to disable incoming link counting

https://gerrit.wikimedia.org/r/856692

Change 856692 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Add config option to disable incoming link counting

https://gerrit.wikimedia.org/r/856692

ReleaseTaggerBot added a project: MW-1.40-notes (1.40.0-wmf.12; 2022-11-28).Nov 15 2022, 11:00 AM

Change 855655 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Implement incoming_links update as batch job

https://gerrit.wikimedia.org/r/855655

jenkins-bot mentioned this in rWDANd33ab6c48295: Implement incoming_links update as batch job.Nov 17 2022, 4:07 PM

EBernhardson moved this task from Needs review to Needs Reporting on the Discovery-Search (Current work) board.Nov 17 2022, 6:52 PM

EBernhardson moved this task from Needs Reporting to To Be Deployed on the Discovery-Search (Current work) board.Nov 21 2022, 4:10 PM

Change 859137 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] incoming_links: Rename wiki to wikiid

https://gerrit.wikimedia.org/r/859137

Change 859137 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] incoming_links: Rename wiki to wikiid

https://gerrit.wikimedia.org/r/859137

jenkins-bot mentioned this in rWDAN00e5387bfe06: incoming_links: Rename wiki to wikiid.Nov 21 2022, 9:18 PM

Change 861413 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[wikimedia/discovery/analytics@master] Relax wait time for wait_for_incoming_links

https://gerrit.wikimedia.org/r/861413

Change 861413 merged by jenkins-bot:

[wikimedia/discovery/analytics@master] Relax wait time for wait_for_incoming_links

https://gerrit.wikimedia.org/r/861413

jenkins-bot mentioned this in rWDAN276aa70d90b1: Relax wait time for wait_for_incoming_links.Nov 28 2022, 6:57 PM

Change 862343 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] cirrus: Disable incoming link counting

https://gerrit.wikimedia.org/r/862343

EBernhardson moved this task from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.Dec 5 2022, 4:07 PM

Gehel closed this task as Resolved.Dec 12 2022, 3:54 PM

Change 862343 merged by jenkins-bot:

[operations/mediawiki-config@master] cirrus: Disable incoming link counting

https://gerrit.wikimedia.org/r/862343

Mentioned in SAL (#wikimedia-operations) [2023-01-12T21:56:58Z] <thcipriani@deploy1002> Started scap: Backport for [[gerrit:879161|cirrus: Divert requests with x-public-cloud set to a dedicated pool counter (T326757)]], [[gerrit:862343|cirrus: Disable incoming link counting (T317023)]]

Mentioned in SAL (#wikimedia-operations) [2023-01-12T21:58:32Z] <thcipriani@deploy1002> thcipriani and ebernhardson: Backport for [[gerrit:879161|cirrus: Divert requests with x-public-cloud set to a dedicated pool counter (T326757)]], [[gerrit:862343|cirrus: Disable incoming link counting (T317023)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-01-12T22:06:22Z] <thcipriani@deploy1002> Finished scap: Backport for [[gerrit:879161|cirrus: Divert requests with x-public-cloud set to a dedicated pool counter (T326757)]], [[gerrit:862343|cirrus: Disable incoming link counting (T317023)]] (duration: 09m 23s)

Gehel closed subtask T265056: Make Cirrus Search dump script more resilient to failures (elasticsearch restarts) as Resolved.Feb 17 2023, 2:40 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 17 2023, 3:10 PM

Investigate moving incoming_links computation to a batch jobClosed, ResolvedPublic8 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Investigate moving incoming_links computation to a batch job
Closed, ResolvedPublic8 Estimated Story Points
Actions

Related Objects
Search...