Page MenuHomePhabricator

Elasticsearch indices went read-only causing huge lag
Closed, ResolvedPublic

Description

Reported at https://www.wikidata.org/wiki/Wikidata:Project_chat#Severe_problems_editing_Wikidata

https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&panelId=5&fullscreen&from=now-2d&to=now shows the jobs have not been running since yesterday? or at least not as fast?

<•dcausse> hu wikidatawiki index is readonly...
10:53 AM update: /wikidatawiki_content_1537536135/page/60100428 caused blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];

Affected wikis: P8289
Time: 2019-03-27 from 07h40 to 11h20 UTC

Event Timeline

Addshore triaged this task as Unbreak Now! priority.Mar 27 2019, 10:58 AM

Could be related to the Wikibase(Lexeme)CirrusSearch extraction (T190022) or the ElasticSearch upgrade (T194199)?

From IRC:

<•dcausse> gehel: seems like there's a new settings in elastic
10:58 AM read_only_allow_delete is set to true when disk space goes low

It is being worked on :)

Mentioned in SAL (#wikimedia-operations) [2019-03-27T11:06:13Z] <dcausse> elasticsearch search cluster: setting "index.blocks.read_only_allow_delete" to null on all indices in omega/psi/chi@omega (T219364)

Mentioned in SAL (#wikimedia-operations) [2019-03-27T11:10:21Z] <dcausse> elasticsearch search cluster: setting cluster.routing.allocation.disk.watermark.flood_stage to 100% on omega/psi/chi@eqiad (T219364)

dcausse lowered the priority of this task from Unbreak Now! to High.Mar 27 2019, 1:35 PM

The backlog of updates is being processed, once we catch up on these updates we will run a maint script to reindex lost updates.
Lowering to High as the immediate actions were taken, it now may take few days to fully sync the index and the database for the affected wikis.

dcausse renamed this task from Wikidata search lagging behind to Elasticsearch indices went read-only causing huge lag.Mar 27 2019, 2:01 PM
dcausse updated the task description. (Show Details)

Backlog of updates is now completely absorbed, a script has been run to catchup lost updates, nothing we can do at this point except waiting for the maint script to stop, moving to done.