Page MenuHomePhabricator

Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only
Closed, ResolvedPublic

Description

We are in process of setting up the new es4 and es5 sections, to make es2 and es3 read-only due to space constrains.
We have already set up the servers (T243052), created the wikis and tables (T245720) and added them to etcd (T245806).
Steps pending are: enable them as writable in MW and once we are happy with that step, set es3 to read-only in MW and also on a mysql level (and disconnect replication between them).

es4 is cluster26, and its master is es1020
es5 is cluster27, and its master is es1023

root@cumin1001:/home/marostegui# mysql.py -hes1020 enwiki -e "show tables"
+------------------+
| Tables_in_enwiki |
+------------------+
| blobs_cluster26  |
+------------------+
root@cumin1001:/home/marostegui# mysql.py -hes1023 enwiki -e "show tables"
+------------------+
| Tables_in_enwiki |
+------------------+
| blobs_cluster27  |
+------------------+

The idea would be to set es4 as writable and once we are ok with it, then follow with es5 to minimize risks.
We are not fully sure what needs to be done apart from the following patch (for es4): https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/574696/

Can you also let us know if there is a way to actually test this before deploying it everywhere?

Progress:

Details

Related Gerrit Patches:

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui triaged this task as Medium priority.Feb 25 2020, 8:50 AM
Marostegui added a project: Goal.
Marostegui moved this task from Triage to In progress on the DBA board.

Change 574696 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Enable es4 as new external store

https://gerrit.wikimedia.org/r/574696

jcrespo added a subscriber: jcrespo.EditedFeb 25 2020, 4:35 PM

We discussed this, Manuel and I, and unless someone can figure an ingenious way to test it, writes cannot be deployed partially (e.g. only on mwdebug), as it is an all-or nothing situation for subsequent reads, which will be references from the metadata servers; other than the cluster, by cluster on and off schema proposed by Manuel on the summary. But please feel free to propose amendments to a better deployment strategy and testing :-D.

Maybe some smoke tests using mw tooling?

I'm not terribly familiar with this process either, but as I understand it the process might go something like this:

  1. Add cluster26 and cluster27 to templateOverridesByCluster, but not to $wgDefaultExternalStore yet.
  2. Use shell.php to verify that ExternalStore can connect to it, probably something like seeing that ExternalStore::fetchFromURL( 'DB://cluster26/1' ) returns false instead of throwing an error.
  3. You might do further tests by setting $wgDefaultExternalStore to the new clusters on one of the mwdebug servers and using X-Wikimedia-Debug to do some edits, and/or by setting $wgDefaultExternalStore with the new clusters only for certain wikis.
  4. Add the new clusters to $wgDefaultExternalStore everywhere.
  5. (maybe at the same time as step 4) Remove the old clusters from $wgDefaultExternalStore everywhere. Verify that this stopped MediaWiki from directing writes to them.
  6. Set templateOverridesByCluster to specify 'is static' => true (like the other retired clusters) and reconfigure the former cluster24 and cluster25 masters to read-only and disable replication. I'd guess based on T135690 that the 'is static' bit should be done first.

BTW, it looks like last time this sort of thing was done was back in 2012. rOMWCac7f0b8e84e4: adding new es clusters (but not moving writes) did step 1 and rOMWC497ee914dd6b: moving ES writes from es1 (cluster23) to es2 (cluster24) and es3 (cluster25) did steps 4 and 5. The 'is static' stuff didn't exist back then.

Can a cluster be set as read only on one mw server and read-write on others? Would that work well, as far as you know (routing reads)?

Note the process has changed a bit by using dynamic configuration, so it is not exactly the same procedure as it used to be.

We discussed this, Manuel and I, and unless someone can figure an ingenious way to test it, writes cannot be deployed partially (e.g. only on mwdebug), as it is an all-or nothing situation for subsequent reads, which will be references from the metadata servers

Fortunately we can enable (step 1) and sort-of test (step 2) reads before doing any writes. And the only reads that will be directed to the new clusters are for things that were previously written to them, so there's little risk of breaking something user-visible if we only do test writes on pages users won't be looking at (step 3).

Thanks, that solves my fears.

Can a cluster be set as read only on one mw server and read-write on others? Would that work well, as far as you know (routing reads)?

Whether writes are directed to an ExternalStore cluster should depend on $wgDefaultExternalStore. If you remove the cluster from that on a wiki, then it shouldn't be getting any more writes from that wiki.

Once data is written, the location includes the cluster. All reads will go to the recorded cluster. So as long as that recorded cluster is accessible from everywhere, it reads should work fine. That's what my step 2 was intended to test, and we could test it again during step 3.

Thanks @Anomie for the detailed steps. That's super helpful.
I think we can try that with es4 and one of the mwdebug hosts indeed. To isolate steps.
For testing I can only use my talk page or something similar.
I'm not familiar with shell.php so I'll have a look a how that works. If you have an example handy, that will also help.

I will amend my existing patch to reflect what would be pushed to complete step 1.

Thank you again!

I'm not familiar with shell.php so I'll have a look a how that works. If you have an example handy, that will also help.

It's a PHP REPL with MediaWiki set up, useful for poking at things from a PHP perspective. There's documentation at https://www.mediawiki.org/wiki/Manual:Shell.php. A sample usage could look like

anomie@mwmaint1002:~$ mwscript shell.php --wiki=enwiki
Psy Shell v0.9.12 (PHP 7.2.26-1+0~20191218.33+debian9~1.gbpb5a340+wmf1 — cli) by Justin Hileman
>>> ExternalStore::fetchFromURL( 'DB://cluster26/1' )
Wikimedia/Rdbms/DBConnectionError with message 'Cannot access the database: No working replica DB server: Unknown error'
>>> ExternalStore::fetchFromURL( 'DB://cluster24/9999999999' )
=> false
>>>

If you have more questions, feel free to ask.

I'm not familiar with shell.php so I'll have a look a how that works. If you have an example handy, that will also help.

It's a PHP REPL with MediaWiki set up, useful for poking at things from a PHP perspective. There's documentation at https://www.mediawiki.org/wiki/Manual:Shell.php. A sample usage could look like

anomie@mwmaint1002:~$ mwscript shell.php --wiki=enwiki
Psy Shell v0.9.12 (PHP 7.2.26-1+0~20191218.33+debian9~1.gbpb5a340+wmf1 — cli) by Justin Hileman
>>> ExternalStore::fetchFromURL( 'DB://cluster26/1' )
Wikimedia/Rdbms/DBConnectionError with message 'Cannot access the database: No working replica DB server: Unknown error'
>>> ExternalStore::fetchFromURL( 'DB://cluster24/9999999999' )
=> false
>>>

If you have more questions, feel free to ask.

Thank you very much - you actually made even the test case for us to pretty much copy paste!
So useful! thanks again :-)

Krinkle removed a subscriber: Krinkle.Feb 25 2020, 10:39 PM

Change 575016 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Add es4 to default ES cluster

https://gerrit.wikimedia.org/r/575016

We are aiming to fully deploy es4 Tuesday 3rd March, starting at 09:00 AM UTC.
The idea would be to follow what @Anomie kindly described at T246072#5916506

root@mwmaint1002:~# mwscript shell.php --wiki=enwiki
Psy Shell v0.9.12 (PHP 7.2.26-1+0~20191218.33+debian9~1.gbpb5a340+wmf1 — cli) by Justin Hileman
>>> ExternalStore::fetchFromURL( 'DB://cluster26/1' )
Wikimedia/Rdbms/DBConnectionError with message 'Cannot access the database: No working replica DB server: Unknown error'
>>> ExternalStore::fetchFromURL( 'DB://cluster25/1' )
=> b"""
   Ý][o█Hû~¸»¿vÇ(i╚
  • Change mwdebug1001:/srv/mediawiki/wmf-config/db-eqiad.php and db-codfw.php

And add:

$wgDefaultExternalStore = [
        'DB://cluster24',
        'DB://cluster25',
+      'DB://cluster26',
];

Mentioned in SAL (#wikimedia-operations) [2020-03-02T14:39:15Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give weight to es4 and es5 unused codfw slaves T246072', diff saved to https://phabricator.wikimedia.org/P10578 and previous config saved to /var/cache/conftool/dbconfig/20200302-143915-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-03-02T14:40:33Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give weight to es4 and es5 unused eqiad slaves T246072', diff saved to https://phabricator.wikimedia.org/P10579 and previous config saved to /var/cache/conftool/dbconfig/20200302-144033-marostegui.json

Change 574696 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Add es4 as new ES, for initial testing

https://gerrit.wikimedia.org/r/574696

Mentioned in SAL (#wikimedia-operations) [2020-03-03T09:07:35Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Add es4 to the available es sections, not in use yet - T246072 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2020-03-03T09:09:59Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Add es4 to the available es sections, not in use yet - T246072 (duration: 00m 57s)

Change 575016 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Add es4 to default ES cluster

https://gerrit.wikimedia.org/r/575016

Mentioned in SAL (#wikimedia-operations) [2020-03-03T09:32:11Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Enable es4 as new writable external store section - T246072 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2020-03-03T09:33:22Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Enable es4 as new writable external store section - T246072 (duration: 00m 56s)

Mentioned in SAL (#wikimedia-operations) [2020-03-03T09:33:47Z] <marostegui@deploy1001> sync-file aborted: Enable es4 as new writable external store section - T246072 (duration: 00m 02s)

Mentioned in SAL (#wikimedia-operations) [2020-03-03T09:36:10Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Enable es4 as new writable external store section - T246072 (duration: 00m 56s)

es4 has been deployed successfully, it is now being written.
We are going to give it 24h and tomorrow at 09:00AM UTC we will set es2 on read only

Change 576286 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Set es2 on read only

https://gerrit.wikimedia.org/r/576286

Marostegui renamed this task from Enable es4 and es5 as writable new external store sections to Enable es4 and es5 as writable new external store sections and set es2 and es3 as read only.Wed, Mar 4, 6:47 AM
Marostegui updated the task description. (Show Details)
Marostegui updated the task description. (Show Details)

Change 576549 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] es2 hosts: Change them to standalone

https://gerrit.wikimedia.org/r/576549

Change 576286 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Set es2 on read only

https://gerrit.wikimedia.org/r/576286

Mentioned in SAL (#wikimedia-operations) [2020-03-04T09:09:58Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Set es2 as RO - T246072 (duration: 01m 14s)

Mentioned in SAL (#wikimedia-operations) [2020-03-04T09:21:37Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set es2 as RO - T246072 (duration: 01m 04s)

Mentioned in SAL (#wikimedia-operations) [2020-03-04T09:43:56Z] <marostegui> Set es1015 (es2 master) on read_only - T246072

Change 576549 merged by Marostegui:
[operations/puppet@production] es2 hosts: Change them to standalone

https://gerrit.wikimedia.org/r/576549

es2 after read_only is set on es1015 and pt-hearbeat stopped:

root@cumin1001:/home/marostegui# ./section es2 | while read host port; do echo "$host:$port"; mysql.py -h$host:$port -e "show master status\G show slave status\G"; done
es2016.codfw.wmnet:3306
*************************** 1. row ***************************
            File: es2016-bin.004034
        Position: 406645764
    Binlog_Do_DB:
Binlog_Ignore_DB:
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: es1015.eqiad.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: es1015-bin.004402
          Read_Master_Log_Pos: 750373395
               Relay_Log_File: es2016-relay-bin.000769
                Relay_Log_Pos: 750373684
        Relay_Master_Log_File: es1015-bin.004402
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 750373395
              Relay_Log_Space: 750374027
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171974840
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-171966470-411334248,171974840-171974840-441937577,180367401-180367401-77345258,171966470-171966470-356813944
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
es2015.codfw.wmnet:3306
*************************** 1. row ***************************
            File: es2015-bin.003154
        Position: 795796435
    Binlog_Do_DB:
Binlog_Ignore_DB:
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: es2016.codfw.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: es2016-bin.004034
          Read_Master_Log_Pos: 406645764
               Relay_Log_File: es2015-relay-bin.001286
                Relay_Log_Pos: 302754301
        Relay_Master_Log_File: es2016-bin.004034
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 406645764
              Relay_Log_Space: 406646966
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 180367401
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-171966470-411334248,171974840-171974840-441937577,180367401-180367401-120463553,171966470-171966470-356813944
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
es2014.codfw.wmnet:3306
*************************** 1. row ***************************
            File: es2014-bin.004028
        Position: 795815759
    Binlog_Do_DB:
Binlog_Ignore_DB:
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: es2016.codfw.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: es2016-bin.004034
          Read_Master_Log_Pos: 406645764
               Relay_Log_File: es2014-relay-bin.001295
                Relay_Log_Pos: 302754301
        Relay_Master_Log_File: es2016-bin.004034
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 406645764
              Relay_Log_Space: 406646966
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 180367401
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-171966470-411334248,171974840-171974840-441937577,180367401-180367401-120463553,171966470-171966470-356813944
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
es1015.eqiad.wmnet:3306
*************************** 1. row ***************************
            File: es1015-bin.004402
        Position: 750373395
    Binlog_Do_DB:
Binlog_Ignore_DB:
es1013.eqiad.wmnet:3306
*************************** 1. row ***************************
            File: es1013-bin.004400
        Position: 327166547
    Binlog_Do_DB:
Binlog_Ignore_DB:
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: es1015.eqiad.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: es1015-bin.004402
          Read_Master_Log_Pos: 750373395
               Relay_Log_File: es1013-relay-bin.000324
                Relay_Log_Pos: 224290831
        Relay_Master_Log_File: es1015-bin.004402
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 750373395
              Relay_Log_Space: 559515006
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171974840
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-171966470-411334248,171974840-171974840-441937577,180367401-180367401-77345258,171966470-171966470-356813944
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
es1011.eqiad.wmnet:3306
*************************** 1. row ***************************
            File: es1011-bin.004416
        Position: 124678110
    Binlog_Do_DB:
Binlog_Ignore_DB:
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: es1015.eqiad.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: es1015-bin.004402
          Read_Master_Log_Pos: 750373395
               Relay_Log_File: es1011-relay-bin.001102
                Relay_Log_Pos: 224290831
        Relay_Master_Log_File: es1015-bin.004402
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 750373395
              Relay_Log_Space: 559515006
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171974840
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-171966470-411334248,171974840-171974840-441937577,180367401-180367401-77345258,171966470-171966470-356813944
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative

Mentioned in SAL (#wikimedia-operations) [2020-03-04T09:55:15Z] <marostegui> Reset replication on es2 hosts - T246072

Mentioned in SAL (#wikimedia-operations) [2020-03-04T09:59:20Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give some weight to es2 master es1015 and es2016, now standalone - T246072', diff saved to https://phabricator.wikimedia.org/P10609 and previous config saved to /var/cache/conftool/dbconfig/20200304-095919-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-03-04T10:05:52Z] <marostegui> es2 maintenance window over T246072

Marostegui updated the task description. (Show Details)Wed, Mar 4, 10:07 AM

Mentioned in SAL (#wikimedia-operations) [2020-03-04T10:10:47Z] <marostegui> Update shards table to set es2 display=0 - T246072

Change 576651 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] prometheus-mysqld-exporter: Add es2 to the list of standalone sections

https://gerrit.wikimedia.org/r/576651

Change 576655 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] prometheus-mysqld-exporter: Add es3 to the list of standalone sections

https://gerrit.wikimedia.org/r/576655

Mentioned in SAL (#wikimedia-operations) [2020-03-04T10:20:39Z] <marostegui> Remove es2 eqiad and codfw from zarcillo.masters table - T246072

Change 576651 merged by Jcrespo:
[operations/puppet@production] prometheus-mysqld-exporter: Add es2 to the list of standalone sections

https://gerrit.wikimedia.org/r/576651

es2 was set to read-only successfully.

We will be deploying es5 on Tuesday 10th 09:00AM UTC
If that goes fine, we will set es3 on read only on Wednesday 11th at 09:00AM UTC

Prometheus config after updating the script and the db:

- labels:
    role: standalone
    shard: es1
  targets:
  - es1012:9104
  - es1016:9104
  - es1018:9104
- labels:
    role: standalone
    shard: es2
  targets:
  - es1011:9104
  - es1013:9104
  - es1015:9104
- labels:
    role: master
    shard: es3
  targets:
  - es1017:9104
- labels:
    role: slave
    shard: es3
  targets:
  - es1014:9104
  - es1019:9104
- labels:
    role: master
    shard: es4
  targets:
  - es1020:9104
- labels:
    role: slave
    shard: es4
  targets:
  - es1021:9104
  - es1022:9104
- labels:
    role: slave
    shard: es5
  targets:
  - es1023:9104
  - es1024:9104
  - es1025:9104

Change 577185 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Add es5 as new ES, for initial testing

https://gerrit.wikimedia.org/r/577185

Change 577189 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Set es5 as writable

https://gerrit.wikimedia.org/r/577189

Mentioned in SAL (#wikimedia-operations) [2020-03-10T09:00:45Z] <marostegui> Start es5 deployment window T246072

Change 577185 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Add es5 as new ES, for initial testing

https://gerrit.wikimedia.org/r/577185

Mentioned in SAL (#wikimedia-operations) [2020-03-10T09:03:00Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Add es5 to the available es sections, not in use yet - T246072 (duration: 01m 01s)

Mentioned in SAL (#wikimedia-operations) [2020-03-10T09:04:08Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Add es5 to the available es sections, not in use yet - T246072 (duration: 00m 59s)

Change 577189 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Set es5 as writable

https://gerrit.wikimedia.org/r/577189

Mentioned in SAL (#wikimedia-operations) [2020-03-10T09:25:02Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Enable es5 as new writable external store section - T246072 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2020-03-10T09:26:10Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Enable es5 as new writable external store section - T246072 (duration: 00m 58s)

Mentioned in SAL (#wikimedia-operations) [2020-03-10T09:27:20Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Enable es5 as new writable external store section - T246072 (duration: 00m 57s)

Mentioned in SAL (#wikimedia-operations) [2020-03-10T09:34:20Z] <marostegui> es5 deployment window finished T246072

es5 has been enabled correctly.
We will either disable es3 and set it as RO today or tomorrow (as original scheduled).

Marostegui updated the task description. (Show Details)Tue, Mar 10, 9:42 AM

Change 578547 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] prometheus-mysqld-exporter: Detect special "standalone replicas" from the db

https://gerrit.wikimedia.org/r/578547

Change 578547 merged by Jcrespo:
[operations/puppet@production] prometheus-mysqld-exporter: Detect special "standalone replicas" from the db

https://gerrit.wikimedia.org/r/578547

Change 576655 abandoned by Jcrespo:
prometheus-mysqld-exporter: Add es3 to the list of standalone sections

Reason:
Obsolete with https://gerrit.wikimedia.org/r/c/operations/puppet/ /578547 @Marostegui we (including me) should now remember to update the sections table when setting a new one as "standalone" (no replication).

zarcillo> UPDATE sections SET standalone=1 WHERE name='es3' LIMIT 1;

https://gerrit.wikimedia.org/r/576655

Change 578766 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Set es3 as RO

https://gerrit.wikimedia.org/r/578766

Change 578766 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Set es3 as RO

https://gerrit.wikimedia.org/r/578766

Mentioned in SAL (#wikimedia-operations) [2020-03-11T09:06:04Z] <marostegui@deploy1001> Synchronized wmf-config/db-codfw.php: Set es3 as RO - T246072 (duration: 01m 08s)

Mentioned in SAL (#wikimedia-operations) [2020-03-11T09:09:01Z] <marostegui@deploy1001> Synchronized wmf-config/db-eqiad.php: Set es3 as RO - T246072 (duration: 01m 08s)

Mentioned in SAL (#wikimedia-operations) [2020-03-11T09:18:53Z] <marostegui> Set es1017 (es3 master) in read only on mysql T246072

Change 578816 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] es3 hosts: Change them to standalone

https://gerrit.wikimedia.org/r/578816

Change 578816 merged by Marostegui:
[operations/puppet@production] es3 hosts: Change them to standalone

https://gerrit.wikimedia.org/r/578816

After stopping hearbeat on es3 master es1017:

root@cumin1001:/home/marostegui# ./section es3 | while read host port; do echo "$host:$port"; mysql.py -h$host:$port -e "show master status\G show slave status\G"; done
es2019.codfw.wmnet:3306
*************************** 1. row ***************************
            File: es2019-bin.003904
        Position: 590913484
    Binlog_Do_DB:
Binlog_Ignore_DB:
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: es2017.codfw.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: es2017-bin.004053
          Read_Master_Log_Pos: 266746454
               Relay_Log_File: es2019-relay-bin.001361
                Relay_Log_Pos: 143899800
        Relay_Master_Log_File: es2017-bin.004053
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 266746454
              Relay_Log_Space: 247599992
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 180355214
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 171970747-171970747-336240191,0-180355214-447846542,171974721-171974721-468813919,180355214-180355214-80619015,180359340-180359340-563
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
es2018.codfw.wmnet:3306
*************************** 1. row ***************************
            File: es2018-bin.004052
        Position: 590913468
    Binlog_Do_DB:
Binlog_Ignore_DB:
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: es2017.codfw.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: es2017-bin.004053
          Read_Master_Log_Pos: 266746454
               Relay_Log_File: es2018-relay-bin.001366
                Relay_Log_Pos: 143899800
        Relay_Master_Log_File: es2017-bin.004053
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 266746454
              Relay_Log_Space: 266747672
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 180355214
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 171970747-171970747-336240191,0-180355214-447846542,171974721-171974721-468813919,180355214-180355214-80619015,180359340-180359340-563
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
es2017.codfw.wmnet:3306
*************************** 1. row ***************************
            File: es2017-bin.004053
        Position: 266746454
    Binlog_Do_DB:
Binlog_Ignore_DB:
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: es1017.eqiad.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: es1017-bin.004424
          Read_Master_Log_Pos: 552431229
               Relay_Log_File: es2017-relay-bin.000841
                Relay_Log_Pos: 552431518
        Relay_Master_Log_File: es1017-bin.004424
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 552431229
              Relay_Log_Space: 552431861
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171974721
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 171970747-171970747-336240191,0-180359340-438187354,171974721-171974721-468813919,180355214-180355214-36897786,180359340-180359340-87
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
es1019.eqiad.wmnet:3306
*************************** 1. row ***************************
            File: es1019-bin.004418
        Position: 574755788
    Binlog_Do_DB:
Binlog_Ignore_DB:
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: es1017.eqiad.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: es1017-bin.004424
          Read_Master_Log_Pos: 552431229
               Relay_Log_File: es1019-relay-bin.000221
                Relay_Log_Pos: 7188532
        Relay_Master_Log_File: es1017-bin.004424
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 552431229
              Relay_Log_Space: 552432392
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171974721
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 171970747-171970747-336240191,0-180359340-438187354,171974721-171974721-468813919,180355214-180355214-36897786,180359340-180359340-87
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
es1017.eqiad.wmnet:3306
*************************** 1. row ***************************
            File: es1017-bin.004424
        Position: 552431229
    Binlog_Do_DB:
Binlog_Ignore_DB:
es1014.eqiad.wmnet:3306
*************************** 1. row ***************************
            File: es1014-bin.004437
        Position: 859966535
    Binlog_Do_DB:
Binlog_Ignore_DB:
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: es1017.eqiad.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: es1017-bin.004424
          Read_Master_Log_Pos: 552431229
               Relay_Log_File: es1014-relay-bin.001207
                Relay_Log_Pos: 7188532
        Relay_Master_Log_File: es1017-bin.004424
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 552431229
              Relay_Log_Space: 552432392
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171974721
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 171970747-171970747-336240191,0-180359340-438187354,171974721-171974721-468813919,180355214-180355214-36897786,180359340-180359340-87
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative

Mentioned in SAL (#wikimedia-operations) [2020-03-11T09:29:23Z] <marostegui> Disconnect replication on all es3 hosts T246072

Mentioned in SAL (#wikimedia-operations) [2020-03-11T09:43:03Z] <marostegui> Finish es3 maintenance window T246072

Marostegui updated the task description. (Show Details)Wed, Mar 11, 9:43 AM

Mentioned in SAL (#wikimedia-operations) [2020-03-11T10:38:02Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Slowly give weight to es3 old masters - T246072', diff saved to https://phabricator.wikimedia.org/P10684 and previous config saved to /var/cache/conftool/dbconfig/20200311-103802-marostegui.json

Mentioned in SAL (#wikimedia-operations) [2020-03-11T11:03:35Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Give normal 100 weight to es3 old masters - T246072', diff saved to https://phabricator.wikimedia.org/P10685 and previous config saved to /var/cache/conftool/dbconfig/20200311-110334-marostegui.json

Marostegui closed this task as Resolved.Wed, Mar 11, 11:07 AM
Marostegui claimed this task.
Marostegui updated the task description. (Show Details)

Considering this all done with the documentation, which will probably need some tweaking here and there and expansion if we feel like.
Thanks _a lot_ @jcrespo for all the help