The disks have been received (T143874#2890442), so we need to place them in dbstore1001 and reinstall it as jessie.
Backups should be moved (and generated) to/on dbstore2001 to avoid downtime on them while installing and reimporting all the data
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
linux-host-entries: Remove trusty from dbstore1001 | operations/puppet | production | +0 -2 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Marostegui | T153768 Install and reimage dbstore1001 as jessie | |||
Resolved | • Cmjohnson | T158893 dbstore1001 troubleshoot IPMI issue |
Event Timeline
I have agreed 23 of Feb for the replacement with Chris- this week we have the labsdb failover on the 15, plus it is holidays on Monday on the US and many people will be away. That means we have until the 22 to prepare a service replacement/copy away all data.
Sounds good to me!
As we spoke the idea is:
- enable backups on dbstore2001 (so we can test them on that "new" host and see how they go)
- copy the content of dbstore1001:/srv/ to one server with available space (ie: external storage)
- reimage dbstore1001
- decide whether we want to leave the backups running on dbstore2001 but enable the backup generation on dbstore1001 anyways.
@jcrespo after lots of thought on it, I think we should try to go for multi instance rather than multi source for the following reasons:
- we can have GTID enabled (which is something I am not foreseen to happen with multisource any time soon: see https://jira.mariadb.org/browse/MDEV-12012?focusedCommentId=91865&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-91865)
- we can do maintenance on one shard without affecting others
- we can have different settings (if required) per shard.
- we can restore/recover/reimport a shard without the risk of importing tablespaces and crashing the whole InnoDB tablespace.
Mentioned in SAL (#wikimedia-operations) [2017-02-20T13:35:10Z] <marostegui> Transferring dbstore1001:/srv/backups (the last 2 backups) to dbstore2001:/srv/backup/dbstore1001 - T153768
I have started the transfer from dbstore1001:/srv/backups (the last two generated backups (find -mtime -14) to dbstore2001:/srv/backup/dbstore1001.
It is being encrypted too.
In order to backup dbstore1001:/srv/sqldata we'd need 4.3T available (probably less as it would be compressed) but still, quite a big amount of space.
The only hosts with enough space within eqiad are, unfortunately, the es10XX hosts I am afraid.
Once the backups from today have been finished I will copy them to dbstore2001 as well and start the transfer of /srv/sqldata to one of the esXX servers within the same DC
Mentioned in SAL (#wikimedia-operations) [2017-02-22T22:52:17Z] <jynus> stopping dbstore1001 mariadb in preparation for tomorrow's reimage T153768
Change 339351 had a related patch set uploaded (by Marostegui):
linux-host-entries: Remove trusty from dbstore1001
Change 339351 merged by Marostegui:
linux-host-entries: Remove trusty from dbstore1001
The data copy has been finished, so we'd be good to go.
@Cmjohnson we would need you to: install the disks and create the RAID10 for us if you can (last time I tried to do it myself it was impossible due to the latency) once you are ready. We'd poweroff the server for you.
Once that is done just leave it off and we will reimage the server.
As far as I know @Volans already fixed the issue with remaging a server that was powered off we encountered some months ago when reinstalling dbstore2001 (T150448 - https://gerrit.wikimedia.org/r/#/c/321633/)
Mentioned in SAL (#wikimedia-operations) [2017-02-23T15:08:06Z] <marostegui> Power off dbstore1001 to change its disks and reimage - T153768
Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:
['dbstore1001.eqiad.wmnet']
The log can be found in /var/log/wmf-auto-reimage/201702231550_marostegui_23247.log.
Server got reinstalled, unfortunately manually as there were some issues with ipmitool (I will explain further in T150160 as that server was in that list of affected hosts).
Anyways, the server is now good and the data transfer from es1017 is flowing to dbstore1001.
root@dbstore1001:/srv/sqldata# lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 8.7 (jessie) Release: 8.7 Codename: jessie Name : RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0 Size : 10.913 TB Sector Size : 512 Mirror Data : 10.913 TB State : Optimal Strip Size : 256 KB root@dbstore1001:/srv/sqldata# df -hT /srv/ Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/tank-data xfs 11T 39G 11T 1% /srv root@dbstore1001:/srv/sqldata# dpkg -l | grep wmf-mariadb10 ii wmf-mariadb10 10.0.29-2 amd64 MariaDB plus patches. root@dbstore1001:/srv/sqldata# ps aux | grep salt-minion root 1257 0.0 0.0 351304 49612 ? Ssl 18:13 0:01 /usr/bin/python /usr/bin/salt-minion
Thanks @jcrespo @Volans and @Cmjohnson for all the help getting this host back to life!
dbstore1001 is looking good after the weekend! :-)
Let's wait till Wednesday to see how backups got generated and if all went good, I believe we should be good to close this ticket.
The misc backups were generated:
root@dbstore1001:/srv/backups# ls -lh total 39G -rw-r--r-- 1 root root 16G Mar 1 01:59 m1-20170301010001.sql.gz -rw-r--r-- 1 root root 162M Mar 1 02:01 m2-20170301015920.sql.gz -rw-r--r-- 1 root root 20G Mar 1 02:23 m3-phabricator-phlegal-20170301020133.sql.gz -rw-r--r-- 1 root root 617M Mar 1 02:26 m5-20170301022322.sql.gz -rw-r--r-- 1 root root 3.1G Mar 1 02:40 x1-20170301022632.sql.gz
However, the rest of them were not (as a reminder, last week they were also not generated and we had to manually force them), I am checking the logs.
Looks like the job is running (with lots of other jobs) or at least bacula reports so :
Running Jobs: Console connected at 01-Mar-17 07:48 48872 Full dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-03-01_02.05.00_25 is waiting on max Storage jobs
Let's wait a bit longer.
At least, helium can connect to it (although it says nothing is running):
Connecting to Client dbstore1001.eqiad.wmnet-fd at dbstore1001.eqiad.wmnet:9102 dbstore1001.eqiad.wmnet-fd Version: 5.2.6 (21 February 2012) x86_64-pc-linux-gnu debian jessie/sid Daemon started 23-Feb-17 19:18. Jobs: run=0 running=0. Heap: heap=270,336 smbytes=17,783 max_bytes=17,800 bufs=73 max_bufs=73 Sizeof: boffset_t=8 size_t=8 debug=0 trace=0 Running Jobs: Director connected at: 01-Mar-17 07:49 No Jobs running.
I am doubtful whether it is going to run sometime.
If we look at the past events:
01-Feb 23:31 dbstore1001.eqiad.wmnet-fd JobId 46761: shell command: run ClientRunBeforeJob "/etc/bacula/scripts/predump" 01-Feb 23:31 helium.eqiad.wmnet JobId 46761: Start Backup JobId 46761, Job=dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-02-01_02.05.00_06 08-Feb 02:05 dbstore1001.eqiad.wmnet-fd JobId 47352: shell command: run ClientRunBeforeJob "/etc/bacula/scripts/predump" 08-Feb 02:05 helium.eqiad.wmnet JobId 47352: Start Backup JobId 47352, Job=dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-02-08_02.05.00_09 15-Feb 02:05 dbstore1001.eqiad.wmnet-fd JobId 47871: shell command: run ClientRunBeforeJob "/etc/bacula/scripts/predump" 15-Feb 02:05 helium.eqiad.wmnet JobId 47871: Start Backup JobId 47871, Job=dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-02-15_02.05.00_16
We can see that the 1st of Feb (Wednesday) it started to run the job almost 24 hours after its scheduled time, whereas the following two weeks it ran fine.
Today, we don't have it yet running, maybe we should wait until tomorrow morning to see if it is just a delay or it is actually failing. Past week we forced it manually as we needed them to be transferred and couldn't wait. But We might find out if this is not working or just delayed by giving it some more hours.
The one from past week (21st of Feb) failed, and we were not sure if this error had something to do with it or not but it is the same JobId:
17-Feb 03:10 helium.eqiad.wmnet-fd JobId 47871: Recycled volume "production0016" on device "FileStorage1" (/srv/baculasd1), all previou s data lost. 17-Feb 03:10 helium.eqiad.wmnet-fd JobId 47871: New volume "production0016" mounted on device "FileStorage1" (/srv/baculasd1) at 17-Feb -2017 03:10. 17-Feb 03:10 helium.eqiad.wmnet JobId 47871: Fatal error: catreq.c:586 attribute create error. Query failed: DROP TABLE DelCandidates: ERR=Unknown table 'bacula.DelCandidates' 21-Feb 13:55 helium.eqiad.wmnet JobId 47871: Error: Watchdog sending kill after 518405 secs to thread stalled reading File daemon. 21-Feb 13:55 helium.eqiad.wmnet JobId 47871: Fatal error: Network error with FD during Backup: ERR=Resource temporarily unavailable 21-Feb 13:55 helium.eqiad.wmnet JobId 47871: Fatal error: No Job status returned from FD. 21-Feb 13:55 helium.eqiad.wmnet JobId 47871: Error: Bacula helium.eqiad.wmnet 5.2.5 (26Jan12): Build OS: x86_64-pc-linux-gnu ubuntu 12.04 JobId: 47871 Job: dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-02-15_02.05.00_16 Backup Level: Full Backup Level: Full Client: "dbstore1001.eqiad.wmnet-fd" 5.2.6 (21Feb12) x86_64-pc-linux-gnu,ubuntu,14.04 FileSet: "mysql-srv-backups" 2015-01-28 02:05:00 Pool: "production" (From Job resource) Catalog: "production" (From Client resource) Storage: "helium-FileStorage1" (From Pool resource) Scheduled time: 15-Feb-2017 02:05:00 Start time: 15-Feb-2017 13:56:09 End time: 21-Feb-2017 13:55:55 Elapsed time: 5 days 23 hours 59 mins 46 secs Priority: 10 FD Files Written: 0 SD Files Written: 0 FD Bytes Written: 0 (0 B) SD Bytes Written: 0 (0 B) Rate: 0.0 KB/s Software Compression: None VSS: no Encryption: no Accurate: no Volume name(s): production0017|production0015|production0041|production0016 Volume Session Id: 4502 Volume Session Time: 1479834296 Last Volume Bytes: 102,489,069 (102.4 MB) Non-fatal FD errors: 2 SD Errors: 0 FD termination status: Error SD termination status: Running Termination: *** Backup Error ***
The current JobId assigned to this' week backups is: 48872
But grepping in the log for that or something related to it shows nothing so far:
root@helium:/var/log/bacula# grep -iR "48872" * root@helium:/var/log/bacula#
Looking at other jobs, looks like they are quite delayed too:
-- Scheduled time: 26-Feb-2017 04:05:02 Start time: 28-Feb-2017 12:27:04 -- Scheduled time: 28-Feb-2017 04:05:01 Start time: 28-Feb-2017 12:27:16 -- Scheduled time: 26-Feb-2017 04:05:03 Start time: 28-Feb-2017 12:27:18 -- Scheduled time: 28-Feb-2017 04:05:01 Start time: 28-Feb-2017 12:27:30
If by the end of the day the backup hasn't started yet I will run it manually.
There are no jobs from March finished just yet:
root@helium:/var/log/bacula# cat log.1 | grep -A1 "Scheduled" | grep Mar root@helium:/var/log/bacula# root@helium:/var/log/bacula# cat log.1 | grep -A1 "Start" | grep Mar root@helium:/var/log/bacula#
root@helium:~$ echo "list jobs" | bconsole | head -n 10 Connecting to Director helium.eqiad.wmnet:9101 1000 OK: helium.eqiad.wmnet Version: 5.2.5 (26 January 2012) Enter a period to cancel a command. list jobs Automatically selected Catalog: production Using Catalog "production" +--------+------------------------------------------------------------------------------------+---------------------+------+-------+-----------+-------------------+-----------+ | JobId | Name | StartTime | Type | Level | JobFiles | JobBytes | JobStatus | +--------+------------------------------------------------------------------------------------+---------------------+------+-------+-----------+-------------------+-----------+ | 22,251 | Offsite Job | 0000-00-00 00:00:00 | c | F | 0 | 0 | f | root@helium:~$ echo "list jobs" | bconsole | grep dbstore1001 | grep 2017 | 44,830 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-01-08 21:09:46 | B | F | 1,819 | 1,177,430,260,044 | T | | 45,310 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-01-11 02:05:00 | B | F | 0 | 0 | f | | 45,805 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-01-18 02:05:00 | B | F | 0 | 0 | f | | 46,287 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-01-25 02:05:00 | B | F | 0 | 0 | f | | 46,353 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-01-26 02:58:46 | B | I | 0 | 0 | f | | 46,761 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-02-02 12:10:35 | B | F | 1,826 | 1,232,270,897,726 | T | | 47,352 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-02-08 13:42:10 | B | F | 2,730 | 1,793,193,161,486 | T | | 47,871 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-02-15 13:56:09 | B | F | 0 | 0 | f |
Exactly, so for Feb they never took that much time so far:
root@helium:~# cal -m 2 February 2017 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Let's keep an eye on them today...
They are definitely not happening so we will need to run them manually on Monday:
root@helium:/var/log/bacula# echo "list jobs" | bconsole | grep dbstore1001 | grep 2017 | 44,830 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-01-08 21:09:46 | B | F | 1,819 | 1,177,430,260,044 | T | | 45,310 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-01-11 02:05:00 | B | F | 0 | 0 | f | | 45,805 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-01-18 02:05:00 | B | F | 0 | 0 | f | | 46,287 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-01-25 02:05:00 | B | F | 0 | 0 | f | | 46,353 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-01-26 02:58:46 | B | I | 0 | 0 | f | | 46,761 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-02-02 12:10:35 | B | F | 1,826 | 1,232,270,897,726 | T | | 47,352 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-02-08 13:42:10 | B | F | 2,730 | 1,793,193,161,486 | T | | 47,871 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups | 2017-02-15 13:56:09 | B | F | 0 | 0 | f |
This is the status of the jobid assigned to dbstore1001 backup:
status jobid=48872 Device status: Device "FileStorage1" (/srv/baculasd1) is not open. Device is BLOCKED waiting to create a volume for: Pool: production Media type: File Device "FileStorage2" (/srv/baculasd2) is not open.
baculasd2 looks like quite small actually
root@helium:/srv# df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/md0 ext3 390G 6.6G 364G 2% / udev devtmpfs 3.9G 4.0K 3.9G 1% /dev tmpfs tmpfs 798M 336K 798M 1% /run none tmpfs 5.0M 0 5.0M 0% /run/lock none tmpfs 3.9G 0 3.9G 0% /run/shm /dev/mapper/bacula-baculasd2 ext4 886G 528G 359G 60% /srv/baculasd2 /dev/mapper/bacula-baculasd1 ext4 30T 25T 4.0T 87% /srv/baculasd1 root@helium:/srv# ls -lh baculasd2 total 528G -rw-r----- 1 bacula tape 500G Oct 8 2015 archive0003 -rw-r----- 1 bacula tape 28G Oct 8 2015 archive0055 -rw-r--r-- 1 root root 63K Apr 19 2016 lala drwx------ 2 root root 16K Jul 13 2015 lost+found
Update - @Joe is kindly helping and we have seen a few issues which suggests that the storage itself might be having issues:
[52078943.954044] ata1.00: BMDMA stat 0x24 [52078943.957955] ata1.00: failed command: READ DMA EXT [52078943.962968] ata1.00: cmd 25/00:00:00:11:00/00:04:00:00:00/e0 tag 0 dma 524288 in [52078943.962969] res 51/40:e4:1c:11:00/40:03:00:00:00/e0 Emask 0x9 (media error) [52078943.978756] ata1.00: status: { DRDY ERR } [52078943.983102] ata1.00: error: { UNC } [52078944.065909] ata1.00: configured for UDMA/133 [52078944.066038] ata1: EH complete [52083624.809231] md: md0: data-check done.
and:
05-Mar 20:16 helium.eqiad.wmnet-fd JobId 48861: Fatal error: block.c:439 Attempt to write on read-only Volume. dev="FileStorage1" (/srv /baculasd1)
For the record: The above disk broken looks sda, which is not part of the baculads volume but /.
Although mdstat doesn't see it broken.
root@helium:/var/log/bacula# cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdb1[1] sda1[0] 415037304 blocks super 1.2 [2/2] [UU]
I am manually executing the "predump" script on dbstore1001 on a root screen called dumps. At least to have a local copy of the backups until we fix the bacula issue.
I will wait for it to finish on dbstore1001 before running on dbstore2001, as I don't want the script to first delete all the backups we copied from dbstore1001 and then run it as we would not have local copies anywhere in that case. Firstly, let's make sure we always have one local copy, either on dbstore1001 or dbstore2001 :-)
There are now 2 backups happening on dbstore1001, one on 201703061505 and another on 201703061552, one from screen, another from bacula-fd :-/
I saw no entries on tendril so I assumed the backups that started to run yesterday finished, so I wanted to check if they were ok on bacula, and I saw this:
Automatically selected FileSet: mysql-srv-backups +--------+-------+----------+-------------------+---------------------+----------------+ | JobId | Level | JobFiles | JobBytes | StartTime | VolumeName | +--------+-------+----------+-------------------+---------------------+----------------+ | 47,352 | F | 2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0023 | | 47,352 | F | 2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0037 | | 47,352 | F | 2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0038 | | 47,352 | F | 2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0039 | | 47,352 | F | 2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0040 | +--------+-------+----------+-------------------+---------------------+----------------+
That date is obviously not March. @jcrespo how do you normally verify them on bacula?
The local copies on dbstore1001 do look ok to me:
root@dbstore1001:/srv/backups# find . -size 0 root@dbstore1001:/srv/backups# root@dbstore1001:/srv/backups# ls -lh enwiki-* commonswiki-2017030615* -rw-r--r-- 1 root root 72G Mar 6 23:14 commonswiki-201703061505.sql.gz -rw-r----- 1 root root 72G Mar 7 00:37 commonswiki-201703061552.sql.gz -rw-r--r-- 1 root root 87G Mar 7 02:31 enwiki-201703061505.sql.gz -rw-r----- 1 root root 87G Mar 7 03:58 enwiki-201703061552.sql.gz
Good news: the backups were scheduled to run today at 2AM and the first ones started around 45 minutes ago! :-)
So I believe we are in a good shape here, let's let them finish and close this ticket finally!
The most recent backups for dbstore1001 are still only from Feb, and not March:
+--------+-------+----------+-------------------+---------------------+----------------+ | JobId | Level | JobFiles | JobBytes | StartTime | VolumeName | +--------+-------+----------+-------------------+---------------------+----------------+ | 47,352 | F | 2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0023 | | 47,352 | F | 2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0037 | | 47,352 | F | 2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0038 | | 47,352 | F | 2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0039 | | 47,352 | F | 2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0040 | +--------+-------+----------+-------------------+---------------------+----------------+
Although looks like we still have the job running:
Running Jobs: Console connected at 10-Mar-17 10:02 Console connected at 10-Mar-17 10:03 JobId Level Name Status ====================================================================== 49465 Full dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-03-08_09.54.47_16 is running ====
I think we are good to go now \o/:
Automatically selected FileSet: mysql-srv-backups +--------+-------+----------+-------------------+---------------------+----------------+ | JobId | Level | JobFiles | JobBytes | StartTime | VolumeName | +--------+-------+----------+-------------------+---------------------+----------------+ | 49,465 | F | 2,786 | 1,813,206,487,408 | 2017-03-10 02:39:04 | production0001 | | 49,465 | F | 2,786 | 1,813,206,487,408 | 2017-03-10 02:39:04 | production0004 | | 49,465 | F | 2,786 | 1,813,206,487,408 | 2017-03-10 02:39:04 | production0005 | | 49,465 | F | 2,786 | 1,813,206,487,408 | 2017-03-10 02:39:04 | production0006 | | 49,465 | F | 2,786 | 1,813,206,487,408 | 2017-03-10 02:39:04 | production0007 | +--------+-------+----------+-------------------+---------------------+----------------+
$ ls enwiki-* enwiki-201703061505.sql.gz enwiki-201703061552.sql.gz enwiki-201703081217.sql.gz
The duplication for the 6th is probably due to the fact that we run backups manually and a few hours later the job started to run after @akosiaris fixed dbstore1001.
The 8th backup is the one that ran normally after being scheduled as can be seen in at: T153768#3090617
Thanks everyone!!!