Page MenuHomePhabricator

🟡 Investigate QueryService missing data root cause
Closed, ResolvedPublic

Description

Since Wikibase Cloud went live, there have been recurring issues with the QueryService missing data. The oldest report of this was in November last year: https://phabricator.wikimedia.org/T322753

Since the last few weeks, we've been seeing increased reports:

We're already aware of 2 potential solutions we want to try out:

There are multiple places where data injection into the QueryService can be missed. In addition to the 2 solutions we're already going to do, we want to do an investigation to find the root cause (s).

AC:

Event Timeline

As far as I can tell all of the linked tickets are caused by edits not being picked up by the updater, which is what we've been optimizing for the last weeks. Right now, I can't think of any cases that are not covered by the mechanism that is currently running.

The issues I would still like to raise in the existing setup are:

Database locking does not work as intended

The code is currently written under the assumption that it has an exclusive read/write lock on the qs_batches table. To achieve this lock a DB::transaction wrapper is used. However, I've seen cases where batches would get processed multiple times in parallel, which must mean the locking does not work.

This currently does not create any real problems (it's safe to process batches multiple times), but in order to improve upon the predictability of what happens in this rather complex system we should look into if we can locking to be exclusive, making the code a lot easier to reason about, and also helping us with debugging future problems.

Processes around failed batches

Currently, batches that failed three times will be removed from the queue, and an error is raised. We should have an idea of what we do in such cases. Do we manually retry batches? Do we check the entities? Do we introduce a limit of failures we tolerate per week / month?

When do we rebuild

We have a reusable mechanism for rebuilding all query service data for a certain wiki in place. In which cases do we want to use this again? Is it the first thing we try on complaints, or is it our last resort. Where do we document this?

Fring removed Fring as the assignee of this task.Dec 20 2023, 1:27 PM
Fring moved this task from Doing to In Review on the Wikibase Cloud (Kanban board Q4 2023) board.
Database locking does not work as intended

Looking through the code at https://github.com/wbstack/api/blob/main/app/Http/Controllers/Backend/QsController.php

It wasn't initially super clear to me but I think perhaps it's as simple as transactions don't cause locks by default.

I believe that if we want this we should use pessimistic locks by default. We should probably "lock for update". See the laravel docs on pessimistic locking. I'm no expert but I would imagine it's better to use lockForUpdate rather that the broader sharedLock since any other calls to this function will also attempt to get the lock, fail and then rollback the transaction. In this case we probably also want to tweak the transaction in order to allow it to retry in the event of getting a deadlock so that we are minimising the number of errors that actually bubble up to the updater service. I'd guess 5 is a perfectly acceptable number here? According to the docs we just pass a second param to the DB::transaction for this.

Processes around failed batches

I think at this stage probably we "ought" to manually retry / investigate but after trying a rebuild (see below). I like the idea of tolerating failure but in reality this probably results in a very bad user experience. Perhaps this is something more producty to talk about with Charlie

When do we rebuild

I think this is probably the cheapest solution in terms of engineering cost and I think it also allows us to realistically take some action on behalf of the user in a timely manner. A proper investigation for each failed batch is quite expensive in human time and I worry that we won't be able to do it quickly enough to help out the upset user. Not sure the absolute best place to put this but as a first guess I would propose stick in somewhere in wbaas-deploy docs.

Other Issues

However these little thoughts should be noted alongside the discussions we've had during the rebuild where we have noticed that when updates run while the platform is down data maybe missing.

Tarrow subscribed.
Tarrow claimed this task.