🟡 Query Service updater outage on 19th of November 2023
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Evelien_WMDE
	Oct 5 2023, 2:05 PM

Description

On the 19th of November 2023 in the morning, the Queryservice updater started crashing, resulting in no data being updated in the Queryservice for any wiki on wikibase.cloud

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	Goal	Tarrow	T348439 🟨 Increase Query Service Reliability
		Resolved		Evelien_WMDE	T348256 🟡 Query Service updater outage on 19th of November 2023

Event Timeline

Evelien_WMDE created this task.Oct 5 2023, 2:05 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 5 2023, 2:05 PM

Evelien_WMDE updated the task description. (Show Details)Oct 5 2023, 2:13 PM

Evelien_WMDE renamed this task from Investigate QueryService missing data root cause to 🟡 Investigate QueryService missing data root cause.Oct 9 2023, 9:52 AM

Evelien_WMDE moved this task from Tech prioritized backlog to Ready to Pick Up on the Wikibase Cloud board.Oct 9 2023, 2:08 PM

Evelien_WMDE added a parent task: T348439: 🟨 Increase Query Service Reliability.Oct 10 2023, 9:36 AM

DFichtmueller subscribed.Nov 6 2023, 2:58 PM

Deniz_WMDE moved this task from Ready to Pick Up to Kanban board Q4 2023 on the Wikibase Cloud board.Nov 9 2023, 12:10 PM

Deniz_WMDE edited projects, added Wikibase Cloud (Kanban board Q4 2023); removed Wikibase Cloud.

Tarrow claimed this task.Nov 15 2023, 8:58 AM

Tarrow moved this task from To do to Doing on the Wikibase Cloud (Kanban board Q4 2023) board.

This is kind of a shot in the dark, but I remembered an issue we had a couple of years ago, that might be useful here. A bit of context: We had a manual installation of Wikibase (without the docker image, because it wasn't available back then). Later we installed the query service via the docker image and made some manual adjustments for the URLs in runUpdate.sh . It worked fine for a while but at some point the updater broke and we didn't notice right away. After some edits a while later we noticed, got the updater running again, but the new items still wouldn't appear. IIRC the cause was that the updater would query the recent changes but since the last import into blazegraph was more than 90 days ago, it didn't see the last point to connect to. The fix was just to increase the $wgRCMaxAge and run maintenance/rebuildrecentchanges.php again.

So how does this relate to this bug: While I thought about this issue in the last days, my thought was that maybe a similar thing happened here, only that instead of going over the time limit, here it went over the limit of items in recent changes, which should be 5000, or at least it was a while ago. So while the updater was ingesting the first chunk of changes, many new edits were being made and by the time it was done, more than 5000 changes had occurred. At least in my case this could have been the case, because I created close to 9000 new items in total and I was surprised how fast WikibaseIntegrator was running those (as compared to QuickStatements, which I had used for the first batch of imports)

As I said, this was a while ago (~5-6 years) and I am not sure if and how the functionality of the updater has changed since, but I thought I share this as a potential lead. Maybe it helps.
In the meantime, I will add a little time.sleep() call in the loop for the WBI when doing further imports or updates, to make sure I don't overwhelm the service.

Clearly something went arwy around 2am this morning. Have marked 2839220 as done to try and unblock stuff

and deleted it

Still failing; have tried to hack past to issue so updates will continue from now.
Have run $qs->update([eventTo => 14795244]) on the most recent QsBatch. This was the most recent eventId at the time of writing. We will need to replay events between around 2am this morning and the this eventId

the previous value of eventTo on the on existent QsBatch was 14738104

Fring claimed this task.Nov 20 2023, 7:52 AM

Fring added a subscriber: Tarrow.

Pull Request for fixing the issue: https://github.com/wbstack/api/pull/685

What was causing the outage is the following:

with https://phabricator.wikimedia.org/T333164 we changed the ordering mechanism that determines in which entity batches are being processed. This means that heavily edited wikis might now be deprioritized because of the way we merge batches
one heavily edited wiki now amassed so many entity ids so that the concatenated form was too long to write to the database
as this currently happens every time the updater polls for batches, every request would fail, essentially keeping the updater from doing any work

This was worked around by @Tarrow by "fast forwarding" batches in the database. After deploying the fix, we should rewind and replay all PageUpdateEvents since Sunday morning.

Fring removed Fring as the assignee of this task.Nov 23 2023, 5:34 PM

Fring moved this task from Doing to In Review on the Wikibase Cloud (Kanban board Q4 2023) board.

Fring subscribed.

Fring renamed this task from 🟡 Investigate QueryService missing data root cause to 🟡 Query Service updater outage on 19th of November 2023.Nov 27 2023, 10:02 AM

Fring updated the task description. (Show Details)

Fring mentioned this in T352008: 🟡 Investigate QueryService missing data root cause.Nov 27 2023, 10:11 AM

After deploying, we plan to set the QsCheckpoint to 14738103, which means we'll reprocess al updates since the incident.

We should test this in staging by rewinding there too.

Chart PR here https://github.com/wbstack/charts/pull/132

Tarrow claimed this task.Nov 27 2023, 10:28 AM

Fring claimed this task.Nov 27 2023, 11:45 AM

Fring moved this task from In Review to Doing on the Wikibase Cloud (Kanban board Q4 2023) board.

Followup API PR with some fixes: https://github.com/wbstack/api/pull/688

Fring removed Fring as the assignee of this task.Nov 27 2023, 12:07 PM

Fring moved this task from Doing to In Review on the Wikibase Cloud (Kanban board Q4 2023) board.

Fring moved this task from In Review to Waiting for Deploy to Staging on the Wikibase Cloud (Kanban board Q4 2023) board.Nov 27 2023, 2:31 PM

In production, migrations were applied and the checkpoint updated:

root@api-app-backend-845b4db576-hfpj8:/var/www/html# php artisan migrate       
**************************************
*     Application In Production!     *
**************************************

 Do you really wish to run this command? (yes/no) [no]:
 > yes

Migrating: 2023_11_21_091331_create_qs_checkpoints_table
Migrated:  2023_11_21_091331_create_qs_checkpoints_table (51.15ms)
root@api-app-backend-845b4db576-hfpj8:/var/www/html# php artisan tinker
Psy Shell v0.11.1 (PHP 7.4.33 — cli) by Justin Hileman
>>> QsCheckpoint::set(14738103)
[!] Aliasing 'QsCheckpoint' to 'App\QsCheckpoint' for this Tinker session.
=> null
>>> QsCheckpoint::get()
=> 14738103

Fring moved this task from Waiting for Deploy to Staging to Waiting for Deploy to Production on the Wikibase Cloud (Kanban board Q4 2023) board.Nov 28 2023, 9:05 AM

After a bit of back and forth because of memory consumption issues, this is now running in production.
The initlal job run created roughly 4k batches:

>>> QsCheckpoint::get()
=> 14980646
>>> QsBatch::where(['done' => 0])->count()
=> 4046

Fring moved this task from Waiting for Deploy to Production to Done on the Wikibase Cloud (Kanban board Q4 2023) board.Nov 28 2023, 11:49 AM

Evelien_WMDE closed this task as Resolved.Dec 4 2023, 3:31 PM

Evelien_WMDE claimed this task.

🟡 Query Service updater outage on 19th of November 2023Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

🟡 Query Service updater outage on 19th of November 2023
Closed, ResolvedPublic
Actions

Related Objects
Search...