Install and reimage dbstore1001 as jessie
Closed, ResolvedPublic

Description

The disks have been received (T143874#2890442), so we need to place them in dbstore1001 and reinstall it as jessie.
Backups should be moved (and generated) to/on dbstore2001 to avoid downtime on them while installing and reimporting all the data

Marostegui created this object in space Restricted Space.
Restricted Application added a project: procurement. · View Herald TranscriptDec 20 2016, 5:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Restricted Application added a project: procurement. · View Herald TranscriptDec 20 2016, 5:50 PM
Marostegui removed a parent task: Unknown Object (Task).Dec 20 2016, 5:51 PM
Restricted Application added a project: Operations. · View Herald TranscriptDec 20 2016, 5:51 PM
Marostegui shifted this object from the Restricted Space space to the S1 Public space.Dec 20 2016, 5:52 PM
Marostegui removed projects: Operations, procurement.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to Next on the DBA board.Jan 9 2017, 7:48 PM

I have agreed 23 of Feb for the replacement with Chris- this week we have the labsdb failover on the 15, plus it is holidays on Monday on the US and many people will be away. That means we have until the 22 to prepare a service replacement/copy away all data.

I have agreed 23 of Feb for the replacement with Chris- this week we have the labsdb failover on the 15, plus it is holidays on Monday on the US and many people will be away. That means we have until the 22 to prepare a service replacement/copy away all data.

Sounds good to me!
As we spoke the idea is:

  • enable backups on dbstore2001 (so we can test them on that "new" host and see how they go)
  • copy the content of dbstore1001:/srv/ to one server with available space (ie: external storage)
  • reimage dbstore1001
  • decide whether we want to leave the backups running on dbstore2001 but enable the backup generation on dbstore1001 anyways.

@jcrespo after lots of thought on it, I think we should try to go for multi instance rather than multi source for the following reasons:

Mentioned in SAL (#wikimedia-operations) [2017-02-20T13:35:10Z] <marostegui> Transferring dbstore1001:/srv/backups (the last 2 backups) to dbstore2001:/srv/backup/dbstore1001 - T153768

I have started the transfer from dbstore1001:/srv/backups (the last two generated backups (find -mtime -14) to dbstore2001:/srv/backup/dbstore1001.
It is being encrypted too.

In order to backup dbstore1001:/srv/sqldata we'd need 4.3T available (probably less as it would be compressed) but still, quite a big amount of space.
The only hosts with enough space within eqiad are, unfortunately, the es10XX hosts I am afraid.

In order to backup dbstore1001:/srv/sqldata we'd need 4.3T available (probably less as it would be compressed) but still, quite a big amount of space.
The only hosts with enough space within eqiad are, unfortunately, the es10XX hosts I am afraid.

Almost done 800GB out of 1.1T (in 17 hours)

Once the backups from today have been finished I will copy them to dbstore2001 as well and start the transfer of /srv/sqldata to one of the esXX servers within the same DC

Mentioned in SAL (#wikimedia-operations) [2017-02-22T22:52:17Z] <jynus> stopping dbstore1001 mariadb in preparation for tomorrow's reimage T153768

Change 339351 had a related patch set uploaded (by Marostegui):
linux-host-entries: Remove trusty from dbstore1001

https://gerrit.wikimedia.org/r/339351

Change 339351 merged by Marostegui:
linux-host-entries: Remove trusty from dbstore1001

https://gerrit.wikimedia.org/r/339351

The data copy has been finished, so we'd be good to go.

@Cmjohnson we would need you to: install the disks and create the RAID10 for us if you can (last time I tried to do it myself it was impossible due to the latency) once you are ready. We'd poweroff the server for you.
Once that is done just leave it off and we will reimage the server.

As far as I know @Volans already fixed the issue with remaging a server that was powered off we encountered some months ago when reinstalling dbstore2001 (T150448 - https://gerrit.wikimedia.org/r/#/c/321633/)

Please give me 5 additional minutes to check we have copied everything we wanted.

Please give me 5 additional minutes to check we have copied everything we wanted.

Sure, no rush. I was not expecting Christ to be online in the next hour or so :)

Mentioned in SAL (#wikimedia-operations) [2017-02-23T15:08:06Z] <marostegui> Power off dbstore1001 to change its disks and reimage - T153768

Script wmf_auto_reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts:

['dbstore1001.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201702231550_marostegui_23247.log.

Server got reinstalled, unfortunately manually as there were some issues with ipmitool (I will explain further in T150160 as that server was in that list of affected hosts).
Anyways, the server is now good and the data transfer from es1017 is flowing to dbstore1001.

root@dbstore1001:/srv/sqldata# lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 8.7 (jessie)
Release:	8.7
Codename:	jessie


Name                :
RAID Level          : Primary-1, Secondary-0, RAID Level Qualifier-0
Size                : 10.913 TB
Sector Size         : 512
Mirror Data         : 10.913 TB
State               : Optimal
Strip Size          : 256 KB


root@dbstore1001:/srv/sqldata# df -hT /srv/
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs    11T   39G   11T   1% /srv

root@dbstore1001:/srv/sqldata# dpkg -l | grep wmf-mariadb10
ii  wmf-mariadb10                  10.0.29-2                  amd64        MariaDB plus patches.

root@dbstore1001:/srv/sqldata# ps aux | grep salt-minion
root      1257  0.0  0.0 351304 49612 ?        Ssl  18:13   0:01 /usr/bin/python /usr/bin/salt-minion

Thanks @jcrespo @Volans and @Cmjohnson for all the help getting this host back to life!

Marostegui mentioned this in Unknown Object (Task).Feb 24 2017, 9:20 AM

dbstore1001 is looking good after the weekend! :-)
Let's wait till Wednesday to see how backups got generated and if all went good, I believe we should be good to close this ticket.

The misc backups were generated:

root@dbstore1001:/srv/backups# ls -lh
total 39G
-rw-r--r-- 1 root root  16G Mar  1 01:59 m1-20170301010001.sql.gz
-rw-r--r-- 1 root root 162M Mar  1 02:01 m2-20170301015920.sql.gz
-rw-r--r-- 1 root root  20G Mar  1 02:23 m3-phabricator-phlegal-20170301020133.sql.gz
-rw-r--r-- 1 root root 617M Mar  1 02:26 m5-20170301022322.sql.gz
-rw-r--r-- 1 root root 3.1G Mar  1 02:40 x1-20170301022632.sql.gz

However, the rest of them were not (as a reminder, last week they were also not generated and we had to manually force them), I am checking the logs.

Looks like the job is running (with lots of other jobs) or at least bacula reports so :

Running Jobs:
Console connected at 01-Mar-17 07:48
 48872 Full    dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-03-01_02.05.00_25 is waiting on max Storage jobs

Let's wait a bit longer.

At least, helium can connect to it (although it says nothing is running):

Connecting to Client dbstore1001.eqiad.wmnet-fd at dbstore1001.eqiad.wmnet:9102

dbstore1001.eqiad.wmnet-fd Version: 5.2.6 (21 February 2012)  x86_64-pc-linux-gnu debian jessie/sid
Daemon started 23-Feb-17 19:18. Jobs: run=0 running=0.
 Heap: heap=270,336 smbytes=17,783 max_bytes=17,800 bufs=73 max_bufs=73
 Sizeof: boffset_t=8 size_t=8 debug=0 trace=0
Running Jobs:
Director connected at: 01-Mar-17 07:49
No Jobs running.

I am doubtful whether it is going to run sometime.
If we look at the past events:

01-Feb 23:31 dbstore1001.eqiad.wmnet-fd JobId 46761: shell command: run ClientRunBeforeJob "/etc/bacula/scripts/predump"
01-Feb 23:31 helium.eqiad.wmnet JobId 46761: Start Backup JobId 46761, Job=dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-02-01_02.05.00_06
08-Feb 02:05 dbstore1001.eqiad.wmnet-fd JobId 47352: shell command: run ClientRunBeforeJob "/etc/bacula/scripts/predump"
08-Feb 02:05 helium.eqiad.wmnet JobId 47352: Start Backup JobId 47352, Job=dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-02-08_02.05.00_09
15-Feb 02:05 dbstore1001.eqiad.wmnet-fd JobId 47871: shell command: run ClientRunBeforeJob "/etc/bacula/scripts/predump"
15-Feb 02:05 helium.eqiad.wmnet JobId 47871: Start Backup JobId 47871, Job=dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-02-15_02.05.00_16

We can see that the 1st of Feb (Wednesday) it started to run the job almost 24 hours after its scheduled time, whereas the following two weeks it ran fine.
Today, we don't have it yet running, maybe we should wait until tomorrow morning to see if it is just a delay or it is actually failing. Past week we forced it manually as we needed them to be transferred and couldn't wait. But We might find out if this is not working or just delayed by giving it some more hours.

The one from past week (21st of Feb) failed, and we were not sure if this error had something to do with it or not but it is the same JobId:

17-Feb 03:10 helium.eqiad.wmnet-fd JobId 47871: Recycled volume "production0016" on device "FileStorage1" (/srv/baculasd1), all previou
s data lost.
17-Feb 03:10 helium.eqiad.wmnet-fd JobId 47871: New volume "production0016" mounted on device "FileStorage1" (/srv/baculasd1) at 17-Feb
-2017 03:10.
17-Feb 03:10 helium.eqiad.wmnet JobId 47871: Fatal error: catreq.c:586 attribute create error. Query failed: DROP TABLE DelCandidates:
ERR=Unknown table 'bacula.DelCandidates'
21-Feb 13:55 helium.eqiad.wmnet JobId 47871: Error: Watchdog sending kill after 518405 secs to thread stalled reading File daemon.
21-Feb 13:55 helium.eqiad.wmnet JobId 47871: Fatal error: Network error with FD during Backup: ERR=Resource temporarily unavailable
21-Feb 13:55 helium.eqiad.wmnet JobId 47871: Fatal error: No Job status returned from FD.
21-Feb 13:55 helium.eqiad.wmnet JobId 47871: Error: Bacula helium.eqiad.wmnet 5.2.5 (26Jan12):
  Build OS:               x86_64-pc-linux-gnu ubuntu 12.04
  JobId:                  47871
  Job:                    dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-02-15_02.05.00_16
  Backup Level:           Full
Backup Level:           Full
  Client:                 "dbstore1001.eqiad.wmnet-fd" 5.2.6 (21Feb12) x86_64-pc-linux-gnu,ubuntu,14.04
  FileSet:                "mysql-srv-backups" 2015-01-28 02:05:00
  Pool:                   "production" (From Job resource)
  Catalog:                "production" (From Client resource)
  Storage:                "helium-FileStorage1" (From Pool resource)
  Scheduled time:         15-Feb-2017 02:05:00
  Start time:             15-Feb-2017 13:56:09
  End time:               21-Feb-2017 13:55:55
  Elapsed time:           5 days 23 hours 59 mins 46 secs
  Priority:               10
  FD Files Written:       0
  SD Files Written:       0
  FD Bytes Written:       0 (0 B)
  SD Bytes Written:       0 (0 B)
  Rate:                   0.0 KB/s
  Software Compression:   None
  VSS:                    no
  Encryption:             no
  Accurate:               no
  Volume name(s):         production0017|production0015|production0041|production0016
  Volume Session Id:      4502
  Volume Session Time:    1479834296
  Last Volume Bytes:      102,489,069 (102.4 MB)
  Non-fatal FD errors:    2
  SD Errors:              0
  FD termination status:  Error
  SD termination status:  Running
  Termination:            *** Backup Error ***

The current JobId assigned to this' week backups is: 48872
But grepping in the log for that or something related to it shows nothing so far:

root@helium:/var/log/bacula# grep -iR "48872" *
root@helium:/var/log/bacula#

Looking at other jobs, looks like they are quite delayed too:

--
  Scheduled time:         26-Feb-2017 04:05:02
  Start time:             28-Feb-2017 12:27:04
--
  Scheduled time:         28-Feb-2017 04:05:01
  Start time:             28-Feb-2017 12:27:16
--
  Scheduled time:         26-Feb-2017 04:05:03
  Start time:             28-Feb-2017 12:27:18
--
  Scheduled time:         28-Feb-2017 04:05:01
  Start time:             28-Feb-2017 12:27:30

If by the end of the day the backup hasn't started yet I will run it manually.
There are no jobs from March finished just yet:

root@helium:/var/log/bacula# cat log.1 | grep -A1 "Scheduled" | grep Mar
root@helium:/var/log/bacula#
root@helium:/var/log/bacula# cat log.1 | grep -A1 "Start" | grep Mar
root@helium:/var/log/bacula#
root@helium:~$ echo "list jobs" | bconsole | head -n 10
Connecting to Director helium.eqiad.wmnet:9101
1000 OK: helium.eqiad.wmnet Version: 5.2.5 (26 January 2012)
Enter a period to cancel a command.
list jobs
Automatically selected Catalog: production
Using Catalog "production"
+--------+------------------------------------------------------------------------------------+---------------------+------+-------+-----------+-------------------+-----------+
| JobId  | Name                                                                               | StartTime           | Type | Level | JobFiles  | JobBytes          | JobStatus |
+--------+------------------------------------------------------------------------------------+---------------------+------+-------+-----------+-------------------+-----------+
| 22,251 | Offsite Job                                                                        | 0000-00-00 00:00:00 | c    | F     |         0 |                 0 | f         |
root@helium:~$ echo "list jobs" | bconsole | grep dbstore1001 | grep 2017
| 44,830 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-01-08 21:09:46 | B    | F     |     1,819 | 1,177,430,260,044 | T         |
| 45,310 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-01-11 02:05:00 | B    | F     |         0 |                 0 | f         |
| 45,805 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-01-18 02:05:00 | B    | F     |         0 |                 0 | f         |
| 46,287 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-01-25 02:05:00 | B    | F     |         0 |                 0 | f         |
| 46,353 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-01-26 02:58:46 | B    | I     |         0 |                 0 | f         |
| 46,761 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-02-02 12:10:35 | B    | F     |     1,826 | 1,232,270,897,726 | T         |
| 47,352 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-02-08 13:42:10 | B    | F     |     2,730 | 1,793,193,161,486 | T         |
| 47,871 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-02-15 13:56:09 | B    | F     |         0 |                 0 | f         |

Exactly, so for Feb they never took that much time so far:

root@helium:~# cal -m 2
   February 2017
Su Mo Tu We Th Fr Sa
          1  2  3  4
 5  6  7  8  9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28

Let's keep an eye on them today...

Marostegui moved this task from Next to In progress on the DBA board.Mar 3 2017, 3:58 PM

They are definitely not happening so we will need to run them manually on Monday:

root@helium:/var/log/bacula# echo "list jobs" | bconsole | grep dbstore1001 | grep 2017
| 44,830 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-01-08 21:09:46 | B    | F     |     1,819 | 1,177,430,260,044 | T         |
| 45,310 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-01-11 02:05:00 | B    | F     |         0 |                 0 | f         |
| 45,805 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-01-18 02:05:00 | B    | F     |         0 |                 0 | f         |
| 46,287 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-01-25 02:05:00 | B    | F     |         0 |                 0 | f         |
| 46,353 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-01-26 02:58:46 | B    | I     |         0 |                 0 | f         |
| 46,761 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-02-02 12:10:35 | B    | F     |     1,826 | 1,232,270,897,726 | T         |
| 47,352 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-02-08 13:42:10 | B    | F     |     2,730 | 1,793,193,161,486 | T         |
| 47,871 | dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups                    | 2017-02-15 13:56:09 | B    | F     |         0 |                 0 | f         |

Yes, but let's try forcing/managing bacula first, local backups are not that useful.

jcrespo raised the priority of this task from Normal to High.Mar 5 2017, 12:26 PM

This is the status of the jobid assigned to dbstore1001 backup:

status jobid=48872

Device status:
Device "FileStorage1" (/srv/baculasd1) is not open.
    Device is BLOCKED waiting to create a volume for:
       Pool:        production
       Media type:  File
Device "FileStorage2" (/srv/baculasd2) is not open.

baculasd2 looks like quite small actually

root@helium:/srv# df -hT
Filesystem                   Type      Size  Used Avail Use% Mounted on
/dev/md0                     ext3      390G  6.6G  364G   2% /
udev                         devtmpfs  3.9G  4.0K  3.9G   1% /dev
tmpfs                        tmpfs     798M  336K  798M   1% /run
none                         tmpfs     5.0M     0  5.0M   0% /run/lock
none                         tmpfs     3.9G     0  3.9G   0% /run/shm
/dev/mapper/bacula-baculasd2 ext4      886G  528G  359G  60% /srv/baculasd2
/dev/mapper/bacula-baculasd1 ext4       30T   25T  4.0T  87% /srv/baculasd1
root@helium:/srv# ls -lh baculasd2
total 528G
-rw-r----- 1 bacula tape 500G Oct  8  2015 archive0003
-rw-r----- 1 bacula tape  28G Oct  8  2015 archive0055
-rw-r--r-- 1 root   root  63K Apr 19  2016 lala
drwx------ 2 root   root  16K Jul 13  2015 lost+found
Marostegui added a subscriber: Joe.

Update - @Joe is kindly helping and we have seen a few issues which suggests that the storage itself might be having issues:

[52078943.954044] ata1.00: BMDMA stat 0x24
[52078943.957955] ata1.00: failed command: READ DMA EXT
[52078943.962968] ata1.00: cmd 25/00:00:00:11:00/00:04:00:00:00/e0 tag 0 dma 524288 in
[52078943.962969]          res 51/40:e4:1c:11:00/40:03:00:00:00/e0 Emask 0x9 (media error)
[52078943.978756] ata1.00: status: { DRDY ERR }
[52078943.983102] ata1.00: error: { UNC }
[52078944.065909] ata1.00: configured for UDMA/133
[52078944.066038] ata1: EH complete
[52083624.809231] md: md0: data-check done.

and:

05-Mar 20:16 helium.eqiad.wmnet-fd JobId 48861: Fatal error: block.c:439 Attempt to write on read-only Volume. dev="FileStorage1" (/srv
/baculasd1)

For the record: The above disk broken looks sda, which is not part of the baculads volume but /.
Although mdstat doesn't see it broken.

root@helium:/var/log/bacula# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdb1[1] sda1[0]
      415037304 blocks super 1.2 [2/2] [UU]
Marostegui edited subscribers, added: akosiaris; removed: Cmjohnson, Papaul.Mar 6 2017, 3:06 PM

I am manually executing the "predump" script on dbstore1001 on a root screen called dumps. At least to have a local copy of the backups until we fix the bacula issue.

I am manually executing the "predump" script on dbstore1001 on a root screen called dumps. At least to have a local copy of the backups until we fix the bacula issue.

I will wait for it to finish on dbstore1001 before running on dbstore2001, as I don't want the script to first delete all the backups we copied from dbstore1001 and then run it as we would not have local copies anywhere in that case. Firstly, let's make sure we always have one local copy, either on dbstore1001 or dbstore2001 :-)

There are now 2 backups happening on dbstore1001, one on 201703061505 and another on 201703061552, one from screen, another from bacula-fd :-/

There are now 2 backups happening on dbstore1001, one on 201703061505 and another on 201703061552, one from screen, another from bacula-fd :-/

We can kill the screen one with no problem and let bacuka finish its backup

I saw no entries on tendril so I assumed the backups that started to run yesterday finished, so I wanted to check if they were ok on bacula, and I saw this:

Automatically selected FileSet: mysql-srv-backups
+--------+-------+----------+-------------------+---------------------+----------------+
| JobId  | Level | JobFiles | JobBytes          | StartTime           | VolumeName     |
+--------+-------+----------+-------------------+---------------------+----------------+
| 47,352 | F     |    2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0023 |
| 47,352 | F     |    2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0037 |
| 47,352 | F     |    2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0038 |
| 47,352 | F     |    2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0039 |
| 47,352 | F     |    2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0040 |
+--------+-------+----------+-------------------+---------------------+----------------+

That date is obviously not March. @jcrespo how do you normally verify them on bacula?
The local copies on dbstore1001 do look ok to me:

root@dbstore1001:/srv/backups# find . -size 0
root@dbstore1001:/srv/backups#
root@dbstore1001:/srv/backups# ls -lh enwiki-* commonswiki-2017030615*
-rw-r--r-- 1 root root 72G Mar  6 23:14 commonswiki-201703061505.sql.gz
-rw-r----- 1 root root 72G Mar  7 00:37 commonswiki-201703061552.sql.gz
-rw-r--r-- 1 root root 87G Mar  7 02:31 enwiki-201703061505.sql.gz
-rw-r----- 1 root root 87G Mar  7 03:58 enwiki-201703061552.sql.gz

Good news: the backups were scheduled to run today at 2AM and the first ones started around 45 minutes ago! :-)
So I believe we are in a good shape here, let's let them finish and close this ticket finally!

The most recent backups for dbstore1001 are still only from Feb, and not March:

+--------+-------+----------+-------------------+---------------------+----------------+
| JobId  | Level | JobFiles | JobBytes          | StartTime           | VolumeName     |
+--------+-------+----------+-------------------+---------------------+----------------+
| 47,352 | F     |    2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0023 |
| 47,352 | F     |    2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0037 |
| 47,352 | F     |    2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0038 |
| 47,352 | F     |    2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0039 |
| 47,352 | F     |    2,730 | 1,793,193,161,486 | 2017-02-08 13:42:10 | production0040 |
+--------+-------+----------+-------------------+---------------------+----------------+

Although looks like we still have the job running:

Running Jobs:
Console connected at 10-Mar-17 10:02
Console connected at 10-Mar-17 10:03
 JobId Level   Name                       Status
======================================================================
 49465 Full    dbstore1001.eqiad.wmnet-Weekly-Wed-production-mysql-srv-backups.2017-03-08_09.54.47_16 is running
====

I think we are good to go now \o/:

Automatically selected FileSet: mysql-srv-backups
+--------+-------+----------+-------------------+---------------------+----------------+
| JobId  | Level | JobFiles | JobBytes          | StartTime           | VolumeName     |
+--------+-------+----------+-------------------+---------------------+----------------+
| 49,465 | F     |    2,786 | 1,813,206,487,408 | 2017-03-10 02:39:04 | production0001 |
| 49,465 | F     |    2,786 | 1,813,206,487,408 | 2017-03-10 02:39:04 | production0004 |
| 49,465 | F     |    2,786 | 1,813,206,487,408 | 2017-03-10 02:39:04 | production0005 |
| 49,465 | F     |    2,786 | 1,813,206,487,408 | 2017-03-10 02:39:04 | production0006 |
| 49,465 | F     |    2,786 | 1,813,206,487,408 | 2017-03-10 02:39:04 | production0007 |
+--------+-------+----------+-------------------+---------------------+----------------+
$ ls enwiki-*
enwiki-201703061505.sql.gz
enwiki-201703061552.sql.gz
enwiki-201703081217.sql.gz

The duplication for the 6th is probably due to the fact that we run backups manually and a few hours later the job started to run after @akosiaris fixed dbstore1001.
The 8th backup is the one that ran normally after being scheduled as can be seen in at: T153768#3090617

Thanks everyone!!!

Marostegui closed this task as Resolved.Mar 13 2017, 8:54 AM