Page MenuHomePhabricator

New Impact module: Run backend updating logic on all Wikipedias
Closed, ResolvedPublic

Description

In the parent task (T336203), we want to deploy new Impact module to all Wikipedias. For that to be possible, the backend updating logic needs to be available on all Wikipedias. This should be done prior to the actual deployment, so we can be sure updating impact data on each edit doesn't result in technical issues that we didn't observe on the smaller wikis (OTOH, new Impact is already enabled on frwiki or arwiki, both fairly large wikis, so the probability is low, but we should test anyway).

Checklist
  • Enable wgGERefreshUserImpactDataMaintenanceScriptEnabled on large Wikipedias (P50571 is my proposal)
  • Observe logs (Logstash + maintenance job logs), ensure impact data gets populated. Suggested duration: 1 week
  • Resolve T344428: refreshUserImpactJob logs mysterious fatal errors
  • Enable wgGERefreshUserImpactDataMaintenanceScriptEnabled everywhere
  • Once impact data populates on all Wikipedias, resolve the task.

Related Objects

Event Timeline

Urbanecm_WMF triaged this task as Medium priority.

Change 949033 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/mediawiki-config@master] Growth: Enable new Impact backend on large Wikipedias

https://gerrit.wikimedia.org/r/949033

Change 949034 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/mediawiki-config@master] Growth: Enable new Impact backend everywhere

https://gerrit.wikimedia.org/r/949034

Change 949033 merged by jenkins-bot:

[operations/mediawiki-config@master] Growth: Enable new Impact backend on large Wikipedias

https://gerrit.wikimedia.org/r/949033

Mentioned in SAL (#wikimedia-operations) [2023-08-16T13:52:19Z] <urbanecm@deploy1002> Started scap: Backport for [[gerrit:949178|jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses (T344223 T343291)]], [[gerrit:949033|Growth: Enable new Impact backend on large Wikipedias (T344143)]]

Mentioned in SAL (#wikimedia-operations) [2023-08-16T13:53:55Z] <urbanecm@deploy1002> urbanecm and d3r1ck01: Backport for [[gerrit:949178|jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses (T344223 T343291)]], [[gerrit:949033|Growth: Enable new Impact backend on large Wikipedias (T344143)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessibl

Mentioned in SAL (#wikimedia-operations) [2023-08-16T14:06:32Z] <urbanecm@deploy1002> Finished scap: Backport for [[gerrit:949178|jobqueue: Disallow cross-wiki JobQueueGroup calls that require JobClasses (T344223 T343291)]], [[gerrit:949033|Growth: Enable new Impact backend on large Wikipedias (T344143)]] (duration: 14m 13s)

UserImpact jobs started for the P50571 wikis. I saw a bunch of error messages in the logs; some of them happened on the wikis we newly deployed to, some happened elsewhere. Example log entries:

Exception executing job: refreshUserImpactJob Спеціальна: impactDataBatch=array(100) staleBefore=1692174841 requestId=e75360eec015cb3c92297e77 : Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server; Unknown error while connecting (db1148)

[4d5e3babcf44dff5854e1f00] /rpc/RunSingleJob.php   Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server; Unknown error while connecting (db1170:3317)

Exception executing job: refreshUserImpactJob Spécial: impactDataBatch=array(100) staleBefore=1692174729 requestId=4d5e3babcf44dff5854e1f00 : Wikimedia\Rdbms\DBConnectionError: Cannot access the database: could not connect to any replica DB server; Unknown error while connecting (db1170:3317)

I'm not sure why the errors are happening. Might be that there was a lot of to do for all the projects enabled at once. Let's monitor the logs for a couple of more days to see if this reoccurs and then we can proceed. Next scheduled execution should be tomorrow UTC morning (05:15 and 07:45 respectively).

The errors reoccured this morning :-(. I believe they have the same root cause as described in T341658#9042285, as connecting to a database is (also) handled in a way similar to opening files. I filled T344428 as an umbrella task.

Considering the amount of logspam, we should probably resolve T344428 and subtasks before proceeding to scale new Impact backend logic further.

Change 949034 merged by jenkins-bot:

[operations/mediawiki-config@master] Growth: Enable new Impact backend everywhere

https://gerrit.wikimedia.org/r/949034

Mentioned in SAL (#wikimedia-operations) [2023-10-26T08:39:24Z] <urbanecm@deploy2002> Started scap: Backport for [[gerrit:949034|Growth: Enable new Impact backend everywhere (T344143)]]

Mentioned in SAL (#wikimedia-operations) [2023-10-26T08:40:48Z] <urbanecm@deploy2002> urbanecm: Backport for [[gerrit:949034|Growth: Enable new Impact backend everywhere (T344143)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-10-26T08:48:53Z] <urbanecm@deploy2002> Finished scap: Backport for [[gerrit:949034|Growth: Enable new Impact backend everywhere (T344143)]] (duration: 09m 29s)

I've enabled the new Impact backend globally. Last remaining step is to validate the impact data got populated. I'll do that in a couple of hours, when the first update (triggered in T344428#9283254) finishes everywhere.

Urbanecm_WMF updated the task description. (Show Details)

The new Impact backend successfully executed on all Wikipedias. Resolving.