Maniphest T352008

🟡 Investigate QueryService missing data root cause
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Fring
	Nov 27 2023, 10:04 AM

Description

Since Wikibase Cloud went live, there have been recurring issues with the QueryService missing data. The oldest report of this was in November last year: https://phabricator.wikimedia.org/T322753

Since the last few weeks, we've been seeing increased reports:

https://phabricator.wikimedia.org/T344142 (gives specific examples, helpful for investigation)
https://phabricator.wikimedia.org/T343034 (most recent)
https://phabricator.wikimedia.org/T345199 (also contains reports from other Wikibases in the comments)

We're already aware of 2 potential solutions we want to try out:

Adjusting the order in which we batch updates for the QueryService, so that "last in line" data doesn't appear in the QueryService with a massive delay. https://phabricator.wikimedia.org/T333164
Don't mark batch updates as "done" without the full check that they are, to make sure no data is missed there https://phabricator.wikimedia.org/T333163

There are multiple places where data injection into the QueryService can be missed. In addition to the 2 solutions we're already going to do, we want to do an investigation to find the root cause (s).

AC:

Investigation is done independently from https://phabricator.wikimedia.org/T333164 and https://phabricator.wikimedia.org/T333163
User reported issues are investigated and the root cause is found
Followup ticket(s) from this investigation is/are created

Related Objects

Mentioned In: T348439: 🟨 Increase Query Service Reliability
Mentioned Here: T348256: 🟡 Query Service updater outage on 19th of November 2023
T322753: Wikibase Query Service does not have some triples on a dreams.wikibase.cloud
T333163: 🟡 Do not mark queryservice batches as `done` until they are actually completed
T333164: 🟡 Adjust QueryService updater batch order
T343034: item not returned in SPARQL query on geokb wikibase.cloud instance
T344142: QueryService missing data for snarc-llgc: 9-12 August '23
T345199: Indexing on wikibase.cloud

Event Timeline

Fring created this task.Nov 27 2023, 10:04 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 27 2023, 10:04 AM

Relevant comment by @DFichtmueller on https://phabricator.wikimedia.org/T348256#9333546

Fring claimed this task.Dec 19 2023, 8:46 AM

Fring moved this task from To do to Doing on the Wikibase Cloud (Kanban board Q4 2023) board.

As far as I can tell all of the linked tickets are caused by edits not being picked up by the updater, which is what we've been optimizing for the last weeks. Right now, I can't think of any cases that are not covered by the mechanism that is currently running.

The issues I would still like to raise in the existing setup are:

Database locking does not work as intended

The code is currently written under the assumption that it has an exclusive read/write lock on the qs_batches table. To achieve this lock a DB::transaction wrapper is used. However, I've seen cases where batches would get processed multiple times in parallel, which must mean the locking does not work.

This currently does not create any real problems (it's safe to process batches multiple times), but in order to improve upon the predictability of what happens in this rather complex system we should look into if we can locking to be exclusive, making the code a lot easier to reason about, and also helping us with debugging future problems.

Processes around failed batches

Currently, batches that failed three times will be removed from the queue, and an error is raised. We should have an idea of what we do in such cases. Do we manually retry batches? Do we check the entities? Do we introduce a limit of failures we tolerate per week / month?

When do we rebuild

We have a reusable mechanism for rebuilding all query service data for a certain wiki in place. In which cases do we want to use this again? Is it the first thing we try on complaints, or is it our last resort. Where do we document this?

Fring removed Fring as the assignee of this task.Dec 20 2023, 1:27 PM

Fring moved this task from Doing to In Review on the Wikibase Cloud (Kanban board Q4 2023) board.

Tarrow claimed this task.Jan 10 2024, 12:22 PM

Database locking does not work as intended

Looking through the code at https://github.com/wbstack/api/blob/main/app/Http/Controllers/Backend/QsController.php

It wasn't initially super clear to me but I think perhaps it's as simple as transactions don't cause locks by default.

I believe that if we want this we should use pessimistic locks by default. We should probably "lock for update". See the laravel docs on pessimistic locking. I'm no expert but I would imagine it's better to use lockForUpdate rather that the broader sharedLock since any other calls to this function will also attempt to get the lock, fail and then rollback the transaction. In this case we probably also want to tweak the transaction in order to allow it to retry in the event of getting a deadlock so that we are minimising the number of errors that actually bubble up to the updater service. I'd guess 5 is a perfectly acceptable number here? According to the docs we just pass a second param to the DB::transaction for this.

Processes around failed batches

I think at this stage probably we "ought" to manually retry / investigate but after trying a rebuild (see below). I like the idea of tolerating failure but in reality this probably results in a very bad user experience. Perhaps this is something more producty to talk about with Charlie

When do we rebuild

I think this is probably the cheapest solution in terms of engineering cost and I think it also allows us to realistically take some action on behalf of the user in a timely manner. A proper investigation for each failed batch is quite expensive in human time and I worry that we won't be able to do it quickly enough to help out the upset user. Not sure the absolute best place to put this but as a first guess I would propose stick in somewhere in wbaas-deploy docs.

Other Issues

However these little thoughts should be noted alongside the discussions we've had during the rebuild where we have noticed that when updates run while the platform is down data maybe missing.

Tarrow mentioned this in T348439: 🟨 Increase Query Service Reliability.Jan 10 2024, 1:30 PM

Tarrow removed Tarrow as the assignee of this task.Jan 10 2024, 2:59 PM

Tarrow subscribed.

Fring moved this task from In Review to Done on the Wikibase Cloud (Kanban board Q4 2023) board.Jan 10 2024, 7:26 PM

Tarrow closed this task as Resolved.Feb 7 2024, 12:07 PM

Tarrow claimed this task.

🟡 Investigate QueryService missing data root causeClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Database locking does not work as intended

Processes around failed batches

When do we rebuild

Database locking does not work as intended

Processes around failed batches

When do we rebuild

Other Issues

🟡 Investigate QueryService missing data root cause
Closed, ResolvedPublic
Actions