Page MenuHomePhabricator

db-eqiad and db-codfw sectionsByLoad can get out of sync
Closed, ResolvedPublic

Description

During the June 2021 DC switchover, tr.wikivoyage.org was briefly unavailable because in codfw the sectionsByLoad entry was typoed: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/702136/

Given that wikis are expected to be in the same section in both datacenters (please correct if wrong DBA), the two lists should never fall out of sync.

Possible ways to resolve this bug (non-exhaustive):

  • Implement a test to verify the two lists are the same
  • Move the lists to a db-common.php file that is used by both DCs
  • Automate generation of the lists from the dblists.

Marking as high priority because this caused a small outage.

https://wikitech.wikimedia.org/wiki/Incident_documentation/2021-06-29_trwikivoyage_primary_db


Original task description by @Urbanecm

It is necessary for all wikis to be in the dblist for the shard they live at, as well as the shard being defined by db-*.php files. Otherwise, various parts of the configuration might be confused.

Event Timeline

Legoktm renamed this task from Ensure dblist shard files match db-*.php definitions to db-eqiad and db-codfw sectionsByLoad can get out of sync.Jun 29 2021, 8:08 PM
Legoktm updated the task description. (Show Details)

They are expected to be in the same section on both DC's, that's correct.

Legoktm subscribed.

So...the only difference between db-eqiad and db-codfw are the list of parsercache hosts:

1--- db-codfw.php 2021-06-29 18:36:31.116279952 -0700
2+++ db-eqiad.php 2021-06-29 18:36:31.116279952 -0700
3@@ -8,14 +8,14 @@
4 # $wgReadOnly = "Wikimedia Sites are currently read-only during maintenance, please try again soon.";
5
6 $wmgParserCacheDBs = [
7- 'pc1' => '10.192.0.104', # pc2007, A1 4.4TB 256GB # pc1
8- 'pc2' => '10.192.16.35', # pc2008, B3 4.4TB 256GB # pc2
9- 'pc3' => '10.192.32.10', # pc2009, C1 4.4TB 256GB # pc3
10- # 'spare' => '10.192.48.14', # pc2010, D3 4.4TB 256GB # spare host. Use it to replace any of the above if needed
11+ 'pc1' => '10.64.0.180', # pc1007, A6 4.4TB 256GB # pc1
12+ 'pc2' => '10.64.16.20', # pc1008, B8 4.4TB 256GB # pc2
13+ 'pc3' => '10.64.32.29', # pc1009, C3 4.4TB 256GB # pc3
14+ # 'spare' => '10.64.48.174', # pc1010, D3 4.4TB 256GB # spare host. Use it to replace any of the above if needed
15 ];
16
17 # LOOKING FOR $wmgOldExtTemplate ? It no longer lives in the PHP configs.
18-# Instead try https://noc.wikimedia.org/dbconfig/codfw.json (see 'es1')
19+# Instead try https://noc.wikimedia.org/dbconfig/eqiad.json (see 'es1')
20 # For more info see also https://wikitech.wikimedia.org/wiki/dbctl
21
22 $wgLBFactoryConf = [
23@@ -134,8 +134,8 @@
24 # function (the master is not included, as by definition has lag 0).
25 #
26 # LOOKING FOR THE LOAD LISTS? They no longer live in the PHP configs.
27-# Instead try https://noc.wikimedia.org/db.php?dc=codfw and
28-# https://noc.wikimedia.org/dbconfig/codfw.json
29+# Instead try https://noc.wikimedia.org/db.php?dc=eqiad and
30+# https://noc.wikimedia.org/dbconfig/eqiad.json
31 # For more info see also https://wikitech.wikimedia.org/wiki/dbctl
32
33 'serverTemplate' => [
34@@ -149,7 +149,7 @@
35 'lagDetectionMethod' => 'pt-heartbeat',
36 'variables' => [
37 'innodb_lock_wait_timeout' => 15
38- ]
39+ ],
40 ],
41
42 'templateOverridesBySection' => [
43@@ -196,18 +196,18 @@
44 ],
45
46 # LOOKING FOR GROUP LOADS? They no longer live in the PHP configs.
47-# Instead try https://noc.wikimedia.org/dbconfig/codfw.json
48+# Instead try https://noc.wikimedia.org/dbconfig/eqiad.json
49 # For more info see also https://wikitech.wikimedia.org/wiki/dbctl
50
51 'groupLoadsByDB' => [],
52
53 # LOOKING FOR HOSTS BY NAME? They no longer live in the PHP configs.
54-# Instead try https://noc.wikimedia.org/dbconfig/codfw.json
55+# Instead try https://noc.wikimedia.org/dbconfig/eqiad.json
56 # For more info see also https://wikitech.wikimedia.org/wiki/dbctl
57 'hostsByName' => [],
58
59 # LOOKING FOR EXTERNAL LOADS? They no longer live in the PHP configs.
60-# Instead try https://noc.wikimedia.org/dbconfig/codfw.json (see es1/es2/es3/x1)
61+# Instead try https://noc.wikimedia.org/dbconfig/eqiad.json (see es1/es2/es3/x1)
62 # For more info see also https://wikitech.wikimedia.org/wiki/dbctl
63 'externalLoads' => [],
64
65@@ -248,11 +248,11 @@
66 # infrastructure if possible (IRC, other webpages) or infrastructure not prepared to absorve
67 # large traffic (phabricator) because they tend to collapse. A meta page would be appropiate.
68 #
69-# Also keep these read only messages if codfw is not the active dc, to prevent accidental writes
70+# Also keep these read only messages if eqiad is not the active dc, to prevent accidental writes
71 # getting trasmmitted from codfw to eqiad when the master dc is eqiad.
72 'readOnlyBySection' => [
73 # LOOKING FOR READONLY SECTIONS? They no longer live in the PHP configs.
74-# Instead try https://noc.wikimedia.org/dbconfig/codfw.json
75+# Instead try https://noc.wikimedia.org/dbconfig/eqiad.json
76 # For more info see also https://wikitech.wikimedia.org/wiki/dbctl
77 ],

My proposal is to move it to a db-production file, and then merge $wmgParserCacheDBs so its keyed by datacenter. And then in CommonSettings just use the primary DC. Will submit a patch shortly.

My proposal is to move it to a db-production file. […]

+1

[…] merge $wmgParserCacheDBs so its keyed by datacenter. And then in CommonSettings just use the primary DC.

Another place for this might be ProductionServices/LabsServices.php and then assign the information from there in CommonSettings.php as we do for other backend services already, including db-related services (such as m2 for xhgui, and likely x2 soon for mainstash-db).

I've re-added the original task description. This is mainly to keep Urbanecm's time-travel device a secret, and alleviate any concerns about them having reported today's incident some eleven months before it occurred in our timeline.

Change 702421 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/mediawiki-config@master] Merge db-codfw.php and db-eqiad.php into db-production.php

https://gerrit.wikimedia.org/r/702421

Is there any possible situation where we would want to apply different configuration to a section in one DC temporarily, especially thinking forward to Multi-DC? If so, maybe having a common default configuration (db-common.php) with a possibility of per-DC, per-section override (db-<DC>.php, empty under normal conditions) would be the safest approach.

Yes, there is. It is being commented on the patch.
But essentially situations like the one we'll have with wikitech in a few days, when we are migrating wikis from one section to another. They do not happen often, but they might in the future.
In the patch, Lego has provided a way to override it if needed which I think it is a good approach to overcome these kind of situations.

Legoktm triaged this task as High priority.Jul 7 2021, 9:59 PM

I rebased the patch, it should be ready to go come January but will need someone else to shepherd the deployment.

I can try but also this file is mentioned in lots of places, including the new wikis creation tool. NBD but keep it in mind.

Is there any possible situation where we would want to apply different configuration to a section in one DC temporarily, especially thinking forward to Multi-DC? If so, maybe having a common default configuration (db-common.php) with a possibility of per-DC, per-section override (db-<DC>.php, empty under normal conditions) would be the safest approach.

you can easily add an if condition in the php code if it's really needed. CommonSettings.php is full of them (and use $GLOBALS['wmfDatacenter']). It's not great but it would work.

Change 702421 merged by jenkins-bot:

[operations/mediawiki-config@master] Merge db-codfw.php and db-eqiad.php into db-production.php

https://gerrit.wikimedia.org/r/702421

Mentioned in SAL (#wikimedia-operations) [2022-01-12T14:14:17Z] <ladsgroup@deploy1002> Synchronized wmf-config/db-production.php: Config: [[gerrit:702421|Merge db-codfw.php and db-eqiad.php into db-production.php (T260297)]], Part I (duration: 01m 07s)

Mentioned in SAL (#wikimedia-operations) [2022-01-12T14:15:36Z] <ladsgroup@deploy1002> Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:702421|Merge db-codfw.php and db-eqiad.php into db-production.php (T260297)]], Part II (duration: 01m 08s)

Mentioned in SAL (#wikimedia-operations) [2022-01-12T14:17:11Z] <ladsgroup@deploy1002> Synchronized wmf-config: Config: [[gerrit:702421|Merge db-codfw.php and db-eqiad.php into db-production.php (T260297)]], Part III (duration: 01m 07s)

Change 753464 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/mediawiki-config@master] docroot: Clean up db.php after db-$dc.php

https://gerrit.wikimedia.org/r/753464

Change 753464 abandoned by Ladsgroup:

[operations/mediawiki-config@master] docroot: Clean up db.php after db-$dc.php

Reason:

https://gerrit.wikimedia.org/r/753464

did a scap pull and restarted apache in mwmaint. Fixed the issue.

Change 753465 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[analytics/wmde/scripts@master] Move away from db-eqiad.php

https://gerrit.wikimedia.org/r/753465

Change 753468 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[analytics/refinery@master] Fix outdated documentation

https://gerrit.wikimedia.org/r/753468

Change 753470 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[analytics/reportupdater@master] Fix outdated documentation

https://gerrit.wikimedia.org/r/753470

Change 753465 merged by jenkins-bot:

[analytics/wmde/scripts@master] Move away from db-eqiad.php

https://gerrit.wikimedia.org/r/753465

Change 753489 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[analytics/wmde/scripts@production] Move away from db-eqiad.php

https://gerrit.wikimedia.org/r/753489

Change 753489 merged by jenkins-bot:

[analytics/wmde/scripts@production] Move away from db-eqiad.php

https://gerrit.wikimedia.org/r/753489

Change 753468 merged by Joal:

[analytics/refinery@master] Fix outdated documentation

https://gerrit.wikimedia.org/r/753468

Legoktm claimed this task.

Thanks @Ladsgroup for deploying :) The remaining patch is just a comment-only change so I'm closing this as resolved.

Change 753470 merged by Mforns:

[analytics/reportupdater@master] Fix outdated documentation

https://gerrit.wikimedia.org/r/753470