Fri, May 17
Thu, May 16
Wed, May 15
I don't see a (thank) on the Not working ones for bots and anonymous users, but I would guess that is intended.
Tue, May 14
dbproxy1006 switched over completely. The above patch (plus db1139 shutdown will be done hours before the maintenance).
Ignore the above, that is unrelated.
It now says: CRITICAL: Devices (12) not equal to PDs (2)
Not sure if related, but now there seems to be contention on cebwiki for LinksUpdate::updateLinksTimestamp this one looks more like a structural problem, as it seems a lot of connections are trying to update the same page row at the same time:
If you want a fast, in-memory, replicated storage we have memcached. It doesn't have the fancy data types redis has, but it's hardly impossible to use it for the same purpose.
If you want a less fast, reliable, eventually consistent multi-dc storage, you have cassandra via kask
Mon, May 13
It would be nice to have a mockup of the API to test soon (with no production effect except maybe some debug information). That will allow to test automation from scripts we have already. I think that would be step #6 ?
Sat, May 11
All 9 + 9 backups worked, starting 20 UTC and the last one finished at 10:15 the next day. 14.4 TB of backups produced in that interval (~5.6 TB after compression).
In comparison, dumps do 14 + 13 backups and it takes from 17 UTC to 00:12 the next day, with a total size of 2.9 TB after compression.
Fri, May 10
Things pending I would like to work on:
MySQL and Prometheus have been stopped on the above hosts. This is almost ready, only pending wait some time and see if there is something we would like to keep from these old hosts.
I'd say, after closing all children, that this is done.
A first version has been intetrated into transfer.py:
firstname.lastname@example.org[zarcillo]> select file_path, file_name, round(size/1024/1024/1024 * 100)/100 as GB FROM backup_files where backup_id=1311 order by size desc LIMIT 20; +--------------+------------------------------------------------------+----------+ | file_path | file_name | GB | +--------------+------------------------------------------------------+----------+ | wikidatawiki | wb_terms.ibd | 475.8500 | | wikidatawiki | revision.ibd | 188.8200 | | wikidatawiki | revision_actor_temp.ibd | 161.7200 | | wikidatawiki | pagelinks.ibd | 155.5300 | | wikidatawiki | slots.ibd | 87.2900 | | wikidatawiki | content.ibd | 86.7900 | | wikidatawiki | revision_comment_temp.ibd | 51.8000 | | wikidatawiki | comment.ibd | 45.0700 | | wikidatawiki | change_tag.ibd | 42.6300 | | wikidatawiki | text.ibd | 36.4000 | | wikidatawiki | cu_changes.ibd | 20.6700 | | wikidatawiki | page_props.ibd | 15.8800 | | wikidatawiki | wb_items_per_site.ibd | 12.0900 | | wikidatawiki | externallinks.ibd | 10.4100 | | wikidatawiki | recentchanges.ibd | 8.5500 | | wikidatawiki | page.ibd | 6.7300 | | wikidatawiki | wikimedia_editor_tasks_entity_description_exists.ibd | 6.2200 | | wikidatawiki | wb_changes_subscription.ibd | 5.2400 | | wikidatawiki | __wmf_checksums.ibd | 4.7200 | | wikidatawiki | watchlist.ibd | 4.3500 | +--------------+------------------------------------------------------+----------+ 20 rows in set (0.00 sec)
email@example.com[zarcillo]> select IF(LOCATE('eqiad', source), 'eqiad', 'codfw') as dc, section, round (total_size / 1024/1024/1024 * 100)/100 as GB FROM backups where start_date > now() - interval 23 hour ORDER BY total_size DESC; +-------+---------+-----------+ | dc | section | GB | +-------+---------+-----------+ | eqiad | s8 | 1450.8400 | | codfw | s8 | 1150.2200 | | eqiad | s4 | 994.5700 | | codfw | s4 | 969.4600 | | eqiad | s1 | 896.9300 | | eqiad | s2 | 805.2100 | | eqiad | s7 | 801.9200 | | eqiad | s3 | 800.4600 | | codfw | s3 | 713.2600 | | codfw | s5 | 608.6100 | | codfw | x1 | 110.6500 | | eqiad | x1 | 95.5100 | | codfw | s1 | NULL | +-------+---------+-----------+ 13 rows in set (0.00 sec)
Multichill I think a solution to your problems can be done on wikireplicas (or a similar level)- wikireplicas don't need to have the same structure than production, and additional tables or indexes can be done there. I don't think internal production needs should cater tool needs. That doesn't mean tool needs should not be provided, on the contrary, better query methods should be provided but I think they are different problems that should not be confused with one another. Better APIs should be made available with a stable interface, I 100% agree with that, but the database cannot be an interface that is guaranteed to be stable.
Thu, May 9
Compression has finished for these hosts.
Now it says proton1001: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
Wed, May 8
Executed as I mentioned above:
I think it overloaded again today (times are CEST):
To set up replication on the destination, questions: does the metadata file contain only GTID coordinates so we have to do the "translation" looking at the master's binlog?
root@cumin2001:~$ time transfer.py --no-checksum --no-encrypt --type=decompress dbprov2001.codfw.wmnet:/srv/backups/snapshots/latest/snapshot.s6.2019-05-07--20-00-02.tar.gz db2117.codfw.wmnet:/srv/sqldata ... WARNING: Original size is 207411963527 but transferred size is 494357898256 for copy to db2117.codfw.wmnet 494357898256 bytes correctly transferred from dbprov2001.codfw.wmnet to db2117.codfw.wmnet
Mon, May 6
Blocked on bacula setup.
Does T220002#5158901 conflict with setting it as spare? I wanted to set it as spare soon-ish, decom later.
You might be confused with db2047, I don't recall db2049 having a disk replaced lately
Sun, May 5
All the hosts have been setup and provisioned. Only pending patch to deploy is https://gerrit.wikimedia.org/r/507925 There is, however, a few iterations of table optimization and compression.
T222526 db2049 (again?)
@Papaul Please see if you have spare 600 GB disks (this is unlikely to be covered by warranty) to replace this. In the case you don't have we can see other options.
Thu, May 2
How does it sound?
eqiad is complete too, also pending only possible recompressions to save space, like most of the codfw servers here.
-- Logs begin at Sat 2019-04-20 15:06:53 UTC, end at Thu 2019-05-02 16:07:12 UTC. -- May 02 14:53:39 dbproxy1005 haproxy: Backup Server mariadb/db1117:3325 is DOWN. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. May 02 14:55:25 dbproxy1005 haproxy: Server mariadb/db1073 is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. May 02 14:55:25 dbproxy1005 haproxy: proxy mariadb has no server available! May 02 15:04:22 dbproxy1005 haproxy: Backup Server mariadb/db1117:3325 is DOWN. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. May 02 15:04:22 dbproxy1005 haproxy: proxy mariadb has no server available!
Both servers were detected as down, so likely a network/app level issue of the proxy, not the databases.
This happened again, restarting proxy, as I don't see a clear connection with max_connections. Network instability?
installed, implementation (provisioning) will be handled at T220572.
@Cmjohnson In case this is useful for you, I have documented how to enable ipmi on ilo5 from the web interface here: https://wikitech.wikimedia.org/w/index.php?title=Management_Interfaces&diff=1824940&oldid=1823217
Either dns, remote ipmi or password may not be configured properly:
Error: Unable to establish IPMI v2 / RMCP+ session 11:23:36 | Unable to run wmf-auto-reimage: Remote IPMI failed for mgmt 'db1139.mgmt.eqiad.wmnet': Command '['ipmitool', '-I', 'lanplus', '-H', 'db1139.mgmt.eqiad.wmnet', '-U', 'root', '-E', 'chassis', 'power', 'status']' returned non-zero exit status 1
trying to debug following workbook.
db2102 is setup, pending loading data, which being done now while testing at the same time the latest recover_dump.py version and generated backup.
+1 to add namespace to the "title" table.
Tue, Apr 30
All tables that are being used in a master-replica database need to have PK (preferably auto_increment integer PK)
Mon, Apr 29
98 and 99 done, althought they need recompression (specially s3).
Sat, Apr 27
The deadlocks came back, so I don't think we can close this for now. I still do not think it is high priority, but it is an ongoing event: https://logstash.wikimedia.org/goto/e5a2230fb3cd9c90155d5391fa54a484
Based on T187153#5101883 and the rate at https://logstash.wikimedia.org/goto/c930467fddcf4aaa4d4c0f8f00838498 lots of hits of this on fiwiki right now. The logging is the blocker issue, not the actual problem.
Recompression is ongoing on db2097, but technically it is done.
root@db2097:/srv$ mysql -A -BN -S /run/mysqld/mysqld.s6.sock -e "select CONCAT(table_schema, '.', table_name) FROM information_schema.tables where table_Schema like '%wik%' and engine='InnoDB' and row_format != 'COMPRESSED'" | head -n 1 | while read table; do echo "$table..."; mysql -S /run/mysqld/mysqld.s6.sock -e "set session sql_log_bin=0; ALTER TABLE $table row_format=COMPRESSED"; done frwiki.actor... ERROR 1062 (23000) at line 1: Duplicate entry 'X.X.X.X' for key 'actor_name'
Fri, Apr 26
Apr 25 2019
I suggest to close this rather than keep it around if someone already checkit. No reason to keep a backlog if it is not clearly actionable, and we can reopen if it reoccurres. Normally I don't create a ticket for this kind of errors, but I was worried that it kept happening for days rather than hours of minutes.
This is now done, both servers are in production (although not with 100% of the final load, only all logical backups and one snapshot each). There is still the issue with permissions for free disk monitoring which is important, but not fatal. Will be fixed at the same time than the issues with dbprov2* servers (T218336#5081898).
Apr 24 2019
I think it is better to hardcode the constants on modules/profile/manifests/mariadb/ferm.pp (for now, not as an ideal situation) than to go on a multi-file refactoring commit without even notifying or searching input from the code maintainer (note I also inherited that code).
Apr 23 2019
(also not the same disk slot, so different issues and should be tracked separately)
@Harej My question is more like, is the summary still accurate about the result of the conversations? (e.g. rampup of 1%, etc.), bots technically not allowed, etc. If yes, no problem, if not, I was asking to update it to reflect the latest agreement.
wmf-mariadb103 doesn't exist, and if it exists, it won't work- we don't support it yet as we found some bugs and we are not working on those at the moment. The plan is only to support wmf-mariadb101 for stretch, and wmf-mariadb103 for buster (we stopped supporting wmf-mariadb for ubuntu and wmf-mariadb10 for jessie). If you need a roadmap, please ask us.
Let's keep them separated for now, my bet is they are the same underlying issue, but the effects (Special:Log vs Special:Contributions), are different and they may need different resolutions (e.g. different query hints or indexes).
possible duplicate of T221380
Going up to unbreak now, because as far as I can see all edit hooks may be broken, causing long-lasting issues on the metadata. Change if that is not true.