Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts
Open, Needs TriagePublic

Description

The following shards needs to be imported and sanitized into db1095 (sanitarium2) and labsdb1009,1010 and 1011

  • s2
  • s4
  • s5
  • s6
  • s7
There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2017-02-03T09:42:54Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1064 - T153743 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-03T09:48:06Z] <marostegui> Restart mysql on db1064 to get its binary log changed to ROW - T153743

Mentioned in SAL (#wikimedia-operations) [2017-02-03T09:59:17Z] <marostegui> Upgrade db1064 from MariaDB 10.0.23 to 10.0.29 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-02-03T10:04:31Z] <marostegui> Reboot db1064 to pick up the new kernels T153743

Mentioned in SAL (#wikimedia-operations) [2017-02-03T10:19:54Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1064 - T153743 (duration: 00m 42s)

db1064 the future master of db1095 for s4 has now been:

  • Changed its binary log to ROW base
  • Upgraded from 10.0.23 to 10.0.29 (and jessie upgraded from 8.5 to 8.7)
  • Rebooted to pick up the last kernel

db1095 has finished transferring its snapshot to: es1017:/srv/tmp
I have tested to decompress it and it worked fine.

db1095 is back up and the slaves have reconnected.

Mentioned in SAL (#wikimedia-operations) [2017-02-06T07:30:53Z] <marostegui> Stop MySQL on db1095 to snapshot it to es1017 - T153743

Change 336194 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Depool db1064

https://gerrit.wikimedia.org/r/336194

Change 336194 merged by jenkins-bot:
db-eqiad.php: Depool db1064

https://gerrit.wikimedia.org/r/336194

Mentioned in SAL (#wikimedia-operations) [2017-02-06T08:43:28Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1064 - T153743 (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2017-02-06T09:30:26Z] <marostegui> Stop MySQL Replication on db1064 for maintenance - T153743

Mentioned in SAL (#wikimedia-operations) [2017-02-06T10:14:41Z] <marostegui> Started to transfer commonswiki (ibd and cfg) from db1064 to labsdb1011 - T153743

I have hit this: T151607#2826415 when importing the tables on labsdb1011.
I am mysqldumping these three tables (which are not really used, at least two of them) and will import them manually once the rest of the tables have been finished in labsdb1011:

root@PRODUCTION s4[commonswiki]> select t.table_schema,t.engine,t.table_name,c.column_name,c.column_type from information_schema.tables t    inner join information_schema.columns c on c.table_schema = t.table_schema and c.table_name = t.table_name   left outer join information_schema.innodb_sys_tables ist on ist.name = concat(t.table_schema,'/',t.table_name)   left outer join information_schema.innodb_sys_columns isc on isc.table_id = ist.table_id and isc.name = c.column_name  where c.column_type in ('time','timestamp','datetime')   and t.table_schema not in ('mysql','information_schema','performance_schema', 'ops', 'sys')   and t.table_type = 'base table'   and (t.engine != 'innodb' or (t.engine = 'innodb' and isc.mtype = 6)) order by t.table_schema,t.table_name,c.column_name;
+--------------+--------+-----------------+--------------+-------------+
| table_schema | engine | table_name      | column_name  | column_type |
+--------------+--------+-----------------+--------------+-------------+
| commonswiki  | InnoDB | categorylinks   | cl_timestamp | timestamp   |
| commonswiki  | InnoDB | objectcache     | exptime      | datetime    |
| commonswiki  | InnoDB | __wmf_checksums | ts           | timestamp   |
+--------------+--------+-----------------+--------------+-------------+
3 rows in set (0.05 sec)
jcrespo added a subscriber: jcrespo.Feb 6 2017, 2:37 PM

__wmf_checksums can just be deleted, specially on labs.

Mentioned in SAL (#wikimedia-operations) [2017-02-06T18:06:56Z] <marostegui> Start to transfer commonswiki ibd and cfg from db1064 to labsdb1010 - https://phabricator.wikimedia.org/T153743

labsdb1011 has now commonswiki imported. I have restarted a couple of times MySQL there without any issues. Also I have done a SELECT limit in all the tables and they all went fine.

Mentioned in SAL (#wikimedia-operations) [2017-02-07T07:06:33Z] <marostegui> Importing commonswiki tables on labsdb1010 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-02-07T07:08:19Z] <marostegui> Transferring commonswiki tables from db1064 to labsdb1009 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-02-07T08:40:36Z] <marostegui> Transferring commonswiki tables from db1064 to db1095 - T153743

All the tables have been imported to labsdb1010.
Stopping and starting mysql worked fine, doing a SELECT over all the tables worked and I have just run: "optimize table" over a few of them and no issues.
A few select count over the biggest tables also worked fine.

Mentioned in SAL (#wikimedia-operations) [2017-02-07T13:44:59Z] <marostegui> Importing commonswiki tables on db1095 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-02-07T14:05:53Z] <marostegui> Importing commonswiki tables on labsdb1009 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-02-08T07:23:02Z] <marostegui> Restart MySQL db1095 and labsdb1009 for maintenance - T153743

All the tables have been imported to labsdb1009.
Stopping and starting mysql worked fine, doing a SELECT over all the tables worked and I have just run: "optimize table" over a few of them and no issues.
A few select count over the biggest tables also worked fine.

All the tables have been imported to db1095.
Stopping and starting mysql worked fine, doing a SELECT over all the tables worked and I have just run: "optimize table" over a few of them and no issues.
A few select count over the biggest tables also worked fine.

Change 336600 had a related patch set uploaded (by Marostegui):
check_private_data.py: Add missing quote

https://gerrit.wikimedia.org/r/336600

Change 336600 merged by Marostegui:
check_private_data.py: Add missing quote

https://gerrit.wikimedia.org/r/336600

data has been sanitized in all the hosts (sanitarium2, labsdb1009,10,11).
Triggers have been enabled on commonswiki on sanitarium2

I have run the check_private_data script on all the hosts and nothing is being reported there.

Pending:

  • enable replication db1095->db1064
  • create the views - @chasemp / @yuvipanda is that something you will take care of?
  • add the grants for commonswiki
jcrespo added a comment.EditedFeb 8 2017, 1:40 PM

create the views - @chasemp / @yuvipanda is that something you will take care of?
add the grants for commonswiki

BTW, we said that if we took care manually of the grants, maybe that could be automatized so we do not bother you much? (it is ok to say no/not yet, I just do not want to bother you unnecessarily).

Mentioned in SAL (#wikimedia-operations) [2017-02-08T13:40:45Z] <marostegui> Enable replication between db1095 and db1064 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-02-08T13:41:29Z] <marostegui> Start replication on db1064 - T153743

Replication has been started on db1095 and db1064 and so far it is looking good (they are both catching up after 2 days stopped)

We also need to compress commonswiki on all the hosts.

data has been sanitized in all the hosts (sanitarium2, labsdb1009,10,11).
Triggers have been enabled on commonswiki on sanitarium2

I have run the check_private_data script on all the hosts and nothing is being reported there.

Pending:

create the views - @chasemp / @yuvipanda is that something you will take care of?
add the grants for commonswiki

BTW, we said that if we took care manually of the grants, maybe that could be automatized so we do not bother you much? (it is ok to say no/not yet, I just do not want to bother you unnecessarily).

Sure, I'll run it today. No worries on the ping for views, once we have everything there and it's down to changes to actual existing views I think I'll make the views run a notify in Puppet on demand. I'm still feeling more comfortable running it actively for now though. Still looking for surprises as it were :)

maintain-views --all-databases --replace-all --debug

on all three, small note labsdb1010 had that same issue which needed a flush privs or it would push back with:

pymysql.err.OperationalError: (1142, "CREATE VIEW command denied to user 'maintainviews'@'localhost' for table 'abuse_filter_action'"

We are not exactly sure why this pops up randomly at the moment.

maintain-views --all-databases --replace-all --debug

on all three, small note labsdb1010 had that same issue which needed a flush privs or it would push back with:

pymysql.err.OperationalError: (1142, "CREATE VIEW command denied to user 'maintainviews'@'localhost' for table 'abuse_filter_action'"

We are not exactly sure why this pops up randomly at the moment.

So for the record, the above issue has been fixed by @jcrespo doing a flush privileges.
As per @chasemp the views are looking good.
And I have just added the labsdbuser grants to access commonswiki view on labsdb1009,10 and 11

thanks guys!

Mentioned in SAL (#wikimedia-operations) [2017-02-09T07:41:38Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1064 - T153743 (duration: 00m 40s)

Change 336760 had a related patch set uploaded (by Marostegui):
db-eqiad.php: Add comment for db1064

https://gerrit.wikimedia.org/r/336760

Change 336760 merged by jenkins-bot:
db-eqiad.php: Add comment for db1064

https://gerrit.wikimedia.org/r/336760

Mentioned in SAL (#wikimedia-operations) [2017-02-09T07:47:16Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Added comment for db1064 being master of db1095 - T153743 (duration: 00m 40s)

Mentioned in SAL (#wikimedia-operations) [2017-02-09T08:17:18Z] <marostegui> Compressing commonswiki tables on db1095 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-02-09T16:15:35Z] <marostegui> Compressing commonswiki on labsdb1009 - T153743

Mentioned in SAL (#wikimedia-operations) [2017-02-13T07:03:21Z] <marostegui> Compressing commonswiki tables on labsdb1010 and labsdb1011 - T153743

Marostegui added a comment.EditedFeb 13 2017, 7:07 AM

commonswiki has been compressed on db1095 and labsdb1009. It has gone from around 1.3T to 495G. There is still 1.6T available on db1095, so I believe we can try to import dewiki and wikidatawiki there.
Checking the master there db1049 shows that dewiki is around 130G and wikidatawiki is around 260G which sanitized and compressed will be a lot less.

db1071 a random slaves shows bigger numbers though (which are probably the ones we need to keep in mind for db1095)

dewiki -> 382G
wikidatawiki -> 858G

Those will be reduced a lot anyways by the sanitization and the compression

On all three new servers :)

2017-02-14 17:03:54,237 DEBUG SQL:
                    CREATE OR REPLACE
                    DEFINER=viewmaster
                    VIEW `gnwikibooks_p`.`abuse_filter_action`
                    AS SELECT * FROM `gnwikibooks`.`abuse_filter_action`;

maintain-views --all-databases --replace-all --debug
pymysql.err.OperationalError: (1142, "CREATE VIEW command denied to user 'maintainviews'@'localhost' for table 'abuse_filter_action'")

gnwikibooks?

Try now on labsdb1009.

Oh I thought that was solved a few posts before with a FLUSH PRIVILEGES

gnwikibooks?

It seems order of creation for views is not deterministic. that was on labsdb1011.

labsdb1009

2017-02-14 17:03:26,853 DEBUG SQL:
                   CREATE OR REPLACE
                   DEFINER=viewmaster
                   VIEW `ugwiki_p`.`abuse_filter_action`
                   AS SELECT * FROM `ugwiki`.`abuse_filter_action`;

labsdb1010

2017-02-14 17:03:42,993 DEBUG SQL:
                    CREATE OR REPLACE
                    DEFINER=viewmaster
                    VIEW `wowiktionary_p`.`abuse_filter_action`
                    AS SELECT * FROM `wowiktionary`.`abuse_filter_action`;

If I get specific it's the same issue though:

root@labsdb1010:~# maintain-views --databases enwiki --replace-all --debug
2017-02-14 17:10:42,937 DEBUG Removing 0 dbs as sensitive
2017-02-14 17:10:42,939 INFO Full views for enwiki:
2017-02-14 17:10:42,942 INFO [abuse_filter_action]
2017-02-14 17:10:42,942 DEBUG SQL:
                    CREATE OR REPLACE
                    DEFINER=viewmaster
                    VIEW `enwiki_p`.`abuse_filter_action`
                    AS SELECT * FROM `enwiki`.`abuse_filter_action`;

pymysql.err.OperationalError: (1142, "CREATE VIEW command denied to user 'maintainviews'@'localhost' for table 'abuse_filter_action'")

Try now on labsdb1009.

nope

root@labsdb1009:~# maintain-views --databases enwiki --replace-all

Traceback (most recent call last):
  File "/usr/local/sbin/maintain-views", line 442, in <module>
    ops.execute(fullviews, customviews)
  File "/usr/local/sbin/maintain-views", line 272, in execute
    self.do_fullview(view)
  File "/usr/local/sbin/maintain-views", line 125, in do_fullview
    """.format(self.definer, self.db_p, view, self.db))
  File "/usr/local/sbin/maintain-views", line 53, in write_execute
    self.cursor.execute(query)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 167, in execute
    result = self._query(query)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 323, in _query
    conn.query(q)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 836, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1020, in _read_query_result
    result.read()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1303, in read
    first_packet = self.connection._read_packet()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 982, in _read_packet
    packet.check_error()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 394, in check_error
    err.raise_mysql_exception(self._data)
  File "/usr/lib/python3/dist-packages/pymysql/err.py", line 120, in raise_mysql_exception
    _check_mysql_exception(errinfo)
  File "/usr/lib/python3/dist-packages/pymysql/err.py", line 112, in _check_mysql_exception
    raise errorclass(errno, errorvalue)
pymysql.err.OperationalError: (1142, "CREATE VIEW command denied to user 'maintainviews'@'localhost' for table 'abuse_filter_action'")

Then this is a different issue than the FLUSH STATUS one. Checking.

gnwikibooks?

It seems order of creation for views is not deterministic. that was on labsdb1011.

labsdb1009

2017-02-14 17:03:26,853 DEBUG SQL:
                    CREATE OR REPLACE
                    DEFINER=viewmaster
                    VIEW `ugwiki_p`.`abuse_filter_action`
                    AS SELECT * FROM `ugwiki`.`abuse_filter_action`;

labsdb1010

2017-02-14 17:03:42,993 DEBUG SQL:
                     CREATE OR REPLACE
                     DEFINER=viewmaster
                     VIEW `wowiktionary_p`.`abuse_filter_action`
                     AS SELECT * FROM `wowiktionary`.`abuse_filter_action`;

If I get specific it's the same issue though:

root@labsdb1010:~# maintain-views --databases enwiki --replace-all --debug
 2017-02-14 17:10:42,937 DEBUG Removing 0 dbs as sensitive
 2017-02-14 17:10:42,939 INFO Full views for enwiki:
 2017-02-14 17:10:42,942 INFO [abuse_filter_action]
 2017-02-14 17:10:42,942 DEBUG SQL:
                     CREATE OR REPLACE
                     DEFINER=viewmaster
                     VIEW `enwiki_p`.`abuse_filter_action`
                     AS SELECT * FROM `enwiki`.`abuse_filter_action`;

pymysql.err.OperationalError: (1142, "CREATE VIEW command denied to user 'maintainviews'@'localhost' for table 'abuse_filter_action'")

But the view already exists, right? At least on labsdb1009

Maybe it lacks the grants for replace (drop). I have dropped [ugwiki_p]> drop view abuse_filter_action;. Can you run the same query only for ugwiki ?

Yes, it does exist but to update the views it's easier to say "update all of them to match canonical source replacing any that already exist". Mainly because a certain custom view for a certain table can change and even if the view exists it is now stale when compared to our yaml definition. That has been the historical scheme anyway. I added logic to specify a certain wikidb for ease and sanity (and even a particular table of a particular wikidb) but it's still common if there are a wide variety of new wikidb's or odds and ends to update to do it en mass and replace existing.

It is failing to replace an existing view, which may be interesting?

Could be but the grants are ALL as per this (labsdb1009) so I would assume the replace should go through no?:

GRANT ALL PRIVILEGES ON `%wik%\_p`.* TO 'maintainviews'@'localhost'
chasemp added a comment.EditedFeb 14 2017, 5:28 PM

Maybe it lacks the grants for replace (drop). I have dropped [ugwiki_p]> drop view abuse_filter_action;. Can you run the same query only for ugwiki ?

root@labsdb1009:~# maintain-views --databases ugwiki --table abuse_filter_action
pymysql.err.OperationalError: (1142, "CREATE VIEW command denied to user 'maintainviews'@'localhost' for table 'abuse_filter_action'")

edit:

2017-02-14 17:28:56,409 DEBUG SQL:
                    CREATE OR REPLACE
                    DEFINER=viewmaster
                    VIEW `ugwiki_p`.`abuse_filter_action`
                    AS SELECT * FROM `ugwiki`.`abuse_filter_action`;

Silly question, are you connecting through localhost specifically?
I am thinking about the issue we had in September where it was doing something weird with localhost/127.0.0.1 and the client was getting a bit crazy. I cannot remember all the details...

Silly question, are you connecting through localhost specifically?
I am thinking about the issue we had in September where it was doing something weird with localhost/127.0.0.1 and the client was getting a bit crazy. I cannot remember all the details...

Yup, not silly at all. It hasn't changed in awhile but it's using:

dbh = pymysql.connect(
    user=config["mysql_user"],
    passwd=config["mysql_password"],
    unix_socket="/tmp/mysql.sock",
    charset="utf8"
)

where user/pass are defined in /etc/maintain-views.yaml

I've added a workaround that makes no sense but that works for now ,we need to revisit it in the future. Maybe there is a bug on the latest version?

I've added a workaround that makes no sense but that works for now ,we need to revisit it in the future. Maybe there is a bug on the latest version?

Just for the record and future references, I believe this is what was added:

GRANT SELECT, DROP, CREATE VIEW ON `%wik%`.* TO 'maintainviews'@'localhost'

Which as Jaime says, makes not much sense that it needs those privileges (DROP and CREATE) on the original databases...

@Marostegui : Quick update on that one as well - How many shards are still to be done?
Thanks !

@Marostegui : Quick update on that one as well - How many shards are still to be done?
Thanks !

Hi!

We are limited by disk space on the hosts, I believe we will soon import s5 (dewiki and wikidata). After that we need to reavaluate how much disk space we still available there.
Reminder it already has:

s1 (enwiki)
s3 (around 800 wikis)
s4 (commons)

Thanks for the headsup @Marostegui :)

Shall we go for s5 or s2 next? I am not sure we can do both (because of disk space issues).
s2 has more wikis, but s5 has wikidata :)

s5 for me

Agreed.
If that is the case, I believe db1070 can be a good option to be sanitarium's master.
It is the vslow slave now, and by checking the future hardware and how the shard will look like after the new hardware comes, db1070 will remain there being vslow slave. So we'd not need to change sanitarium's masters after all the hardware refresh.

Change 341007 had a related patch set uploaded (by marostegui):
[operations/puppet] site.pp: Enable ROW binlog for db1070

https://gerrit.wikimedia.org/r/341007

Change 341759 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config] db-eqiad.php: Depool db1070

https://gerrit.wikimedia.org/r/341759

Change 341007 merged by Marostegui:
[operations/puppet] site.pp: Enable ROW binlog for db1070

https://gerrit.wikimedia.org/r/341007

Change 341759 merged by jenkins-bot:
[operations/mediawiki-config] db-eqiad.php: Depool db1070

https://gerrit.wikimedia.org/r/341759

Mentioned in SAL (#wikimedia-operations) [2017-03-08T08:32:45Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1070 - T153743 (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2017-03-08T08:36:15Z] <marostegui> Restart mysql on db1070 to change binlog to ROW - T153743

Mentioned in SAL (#wikimedia-operations) [2017-03-08T09:00:58Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Repool db1070 - T153743 (duration: 00m 41s)

Change 341797 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config] db-eqiad.php: Add a few comments

https://gerrit.wikimedia.org/r/341797

Change 341797 merged by jenkins-bot:
[operations/mediawiki-config] db-eqiad.php: Add a few comments

https://gerrit.wikimedia.org/r/341797

db1070 has been restarted with ROW based replication to serve as a master for db1095.

Mentioned in SAL (#wikimedia-operations) [2017-03-08T17:52:26Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: db1070 ROW based replication comments - T153743 (duration: 00m 41s)

In order to start getting ready to import s5 on sanitarium2 and labsdb1009,10,11 I am going to start:

  • Compressing InnoDB on db1070 (master for sanitarium2)
  • Backuping db1095 just in case

Change 342427 had a related patch set uploaded (by Marostegui):
[operations/mediawiki-config] db-eqiad.php: Depool db1070

https://gerrit.wikimedia.org/r/342427

Change 342427 merged by jenkins-bot:
[operations/mediawiki-config] db-eqiad.php: Depool db1070

https://gerrit.wikimedia.org/r/342427

Mentioned in SAL (#wikimedia-operations) [2017-03-13T08:27:40Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db1070 - T153743 (duration: 00m 41s)

Mentioned in SAL (#wikimedia-operations) [2017-03-13T08:30:42Z] <marostegui> Stop MySQL on db1095 (sanitarium2) to take a backup - T153743

Mentioned in SAL (#wikimedia-operations) [2017-03-13T08:31:30Z] <marostegui> Stop replication on labsdb1009,10 and 11 - T153743

labsdb1009,10 and 11 - replication stopped
db1095 replication stopped and mysql down
data transfer between db1095 and dbstore1001 is now happening.

Mentioned in SAL (#wikimedia-operations) [2017-03-13T08:40:43Z] <marostegui> Compress dewiki - db1070 - T153743

Marostegui moved this task from Next to In progress on the DBA board.Mon, Mar 13, 9:46 AM

I just realised that db1070 doesn't have .ibd files because of this: T137191
So I think I will reclone that host from a host that does have file per table and then we can export dewiki and wikidata do sanitarium2.

Probably also do a re-image won't hurt.

Marostegui moved this task from In progress to Next on the DBA board.Mon, Mar 13, 10:21 AM