Page MenuHomePhabricator

🟡 Query Service updater outage on 19th of November 2023
Closed, ResolvedPublic

Description

On the 19th of November 2023 in the morning, the Queryservice updater started crashing, resulting in no data being updated in the Queryservice for any wiki on wikibase.cloud

Event Timeline

Evelien_WMDE renamed this task from Investigate QueryService missing data root cause to 🟡 Investigate QueryService missing data root cause.Oct 9 2023, 9:52 AM

This is kind of a shot in the dark, but I remembered an issue we had a couple of years ago, that might be useful here. A bit of context: We had a manual installation of Wikibase (without the docker image, because it wasn't available back then). Later we installed the query service via the docker image and made some manual adjustments for the URLs in runUpdate.sh . It worked fine for a while but at some point the updater broke and we didn't notice right away. After some edits a while later we noticed, got the updater running again, but the new items still wouldn't appear. IIRC the cause was that the updater would query the recent changes but since the last import into blazegraph was more than 90 days ago, it didn't see the last point to connect to. The fix was just to increase the $wgRCMaxAge and run maintenance/rebuildrecentchanges.php again.

So how does this relate to this bug: While I thought about this issue in the last days, my thought was that maybe a similar thing happened here, only that instead of going over the time limit, here it went over the limit of items in recent changes, which should be 5000, or at least it was a while ago. So while the updater was ingesting the first chunk of changes, many new edits were being made and by the time it was done, more than 5000 changes had occurred. At least in my case this could have been the case, because I created close to 9000 new items in total and I was surprised how fast WikibaseIntegrator was running those (as compared to QuickStatements, which I had used for the first batch of imports)

As I said, this was a while ago (~5-6 years) and I am not sure if and how the functionality of the updater has changed since, but I thought I share this as a potential lead. Maybe it helps.
In the meantime, I will add a little time.sleep() call in the loop for the WBI when doing further imports or updates, to make sure I don't overwhelm the service.

Clearly something went arwy around 2am this morning. Have marked 2839220 as done to try and unblock stuff

Still failing; have tried to hack past to issue so updates will continue from now.
Have run $qs->update([eventTo => 14795244]) on the most recent QsBatch. This was the most recent eventId at the time of writing. We will need to replay events between around 2am this morning and the this eventId

the previous value of eventTo on the on existent QsBatch was 14738104

Pull Request for fixing the issue: https://github.com/wbstack/api/pull/685

What was causing the outage is the following:

  • with https://phabricator.wikimedia.org/T333164 we changed the ordering mechanism that determines in which entity batches are being processed. This means that heavily edited wikis might now be deprioritized because of the way we merge batches
  • one heavily edited wiki now amassed so many entity ids so that the concatenated form was too long to write to the database
  • as this currently happens every time the updater polls for batches, every request would fail, essentially keeping the updater from doing any work

This was worked around by @Tarrow by "fast forwarding" batches in the database. After deploying the fix, we should rewind and replay all PageUpdateEvents since Sunday morning.

Fring removed Fring as the assignee of this task.Nov 23 2023, 5:34 PM
Fring moved this task from Doing to In Review on the Wikibase Cloud (Kanban board Q4 2023) board.
Fring subscribed.
Fring renamed this task from 🟡 Investigate QueryService missing data root cause to 🟡 Query Service updater outage on 19th of November 2023.Nov 27 2023, 10:02 AM
Fring updated the task description. (Show Details)

After deploying, we plan to set the QsCheckpoint to 14738103, which means we'll reprocess al updates since the incident.

We should test this in staging by rewinding there too.

Fring removed Fring as the assignee of this task.Nov 27 2023, 12:07 PM
Fring moved this task from Doing to In Review on the Wikibase Cloud (Kanban board Q4 2023) board.

In production, migrations were applied and the checkpoint updated:

root@api-app-backend-845b4db576-hfpj8:/var/www/html# php artisan migrate       
**************************************
*     Application In Production!     *
**************************************

 Do you really wish to run this command? (yes/no) [no]:
 > yes

Migrating: 2023_11_21_091331_create_qs_checkpoints_table
Migrated:  2023_11_21_091331_create_qs_checkpoints_table (51.15ms)
root@api-app-backend-845b4db576-hfpj8:/var/www/html# php artisan tinker
Psy Shell v0.11.1 (PHP 7.4.33 — cli) by Justin Hileman
>>> QsCheckpoint::set(14738103)
[!] Aliasing 'QsCheckpoint' to 'App\QsCheckpoint' for this Tinker session.
=> null
>>> QsCheckpoint::get()
=> 14738103

After a bit of back and forth because of memory consumption issues, this is now running in production.
The initlal job run created roughly 4k batches:

>>> QsCheckpoint::get()
=> 14980646
>>> QsBatch::where(['done' => 0])->count()
=> 4046
Evelien_WMDE claimed this task.