Page MenuHomePhabricator

Marostegui (Manuel Aróstegui)
Staff Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Sep 1 2016, 6:48 AM (216 w, 5 d)
Availability
Available
IRC Nick
marostegui
LDAP User
Marostegui
MediaWiki User
MArostegui (WMF) [ Global Accounts ]

TZ: UTC +1/+2

Recent Activity

Yesterday

Marostegui added a comment to T266485: Populating orchestrator metadata on a per-server basis.

Forgetting the existing hosts:

root@dborch1001:~# orchestrator-client -c forget-cluster -alias pc1007
root@dborch1001:~#
Tue, Oct 27, 10:01 AM · Patch-For-Review, DBA
Marostegui created P13075 (An Untitled Masterwork).
Tue, Oct 27, 9:46 AM
Marostegui added a comment to T266432: Increase on database writes and deletes activity on Commonswiki leads to some replication lag.

Another spike from 08:05 to 08:06 and this is what the binlog shows (number of statements) in terms of writes during that timeframe:

Tue, Oct 27, 8:26 AM · Platform Engineering, Wikimedia-production-error, DBA, Commons, Release-Engineering-Team, Operations
Marostegui added a comment to T260370: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet.

Per my chat with Chris, updating the rack location from A2 to A1 and from C2 to C3

Tue, Oct 27, 8:08 AM · Operations, DBA, ops-eqiad, DC-Ops
Marostegui updated the task description for T260370: (Need By: 2020-08-31) rack/setup/install es10[26-34].eqiad.wmnet.
Tue, Oct 27, 8:08 AM · Operations, DBA, ops-eqiad, DC-Ops
Marostegui added a comment to T266432: Increase on database writes and deletes activity on Commonswiki leads to some replication lag.

Another spike yesterday on DELETEs

Tue, Oct 27, 7:31 AM · Platform Engineering, Wikimedia-production-error, DBA, Commons, Release-Engineering-Team, Operations
Marostegui reopened T265323: Add toil::systemd_scope_cleanup to dbprov hosts as "Open".

This has happened again:

[05:55:19]  <+icinga-wm>	PROBLEM - Check systemd state on dbprov1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
Tue, Oct 27, 6:09 AM · Data-Persistence-Backup, Operations, SRE-tools
Marostegui reopened T265323: Add toil::systemd_scope_cleanup to dbprov hosts, a subtask of T199911: Systemd session creation fails under I/O load, as Open.
Tue, Oct 27, 6:08 AM · Operations, SRE-tools
Marostegui added a comment to T265344: Monitor the growth of CheckUser tables at large wikis.

@Huji 20MB for ruwiki means around 1GB per year at current growth (assuming it keeps growing the same rate). That is perfectly acceptable.
However, we do need to do this same exercise for the big wikis excluded at T253802#6536344 once it is enabled there.

Tue, Oct 27, 6:06 AM · DBA
Marostegui added a comment to T266003: orchestrator: Select backend database solution.

db2093 now hosts the orchestrator database.
@Kormat the only pending thing is to decide what to do with monitoring and read_only right? As right now read_only alerts on db2093 as it is supposed to be read_only=true but that cannot longer happen as this host needs to be writable for orchestrator.

Tue, Oct 27, 6:00 AM · Patch-For-Review, Data-Persistence, User-Kormat, DBA
Marostegui moved T266483: Enable report_host for mariadb from Refine to Ready on the DBA board.
Tue, Oct 27, 5:58 AM · Patch-For-Review, DBA, User-Kormat

Mon, Oct 26

Marostegui closed T261914: Enable replication eqiad -> codfw and other checks as Resolved.

This is all done

Mon, Oct 26, 5:14 PM · DBA
Marostegui closed T261914: Enable replication eqiad -> codfw and other checks, a subtask of T243318: FY2020-2021 Q1 codfw -> eqiad switchback, as Resolved.
Mon, Oct 26, 5:14 PM · Operations
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Mon, Oct 26, 5:14 PM · DBA
Marostegui triaged T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14) as Medium priority.
Mon, Oct 26, 4:39 PM · DBA, Blocked-on-schema-change
Marostegui added a comment to T266432: Increase on database writes and deletes activity on Commonswiki leads to some replication lag.

We just had another huge spike of DELETEs

Mon, Oct 26, 4:06 PM · Platform Engineering, Wikimedia-production-error, DBA, Commons, Release-Engineering-Team, Operations
Marostegui added a comment to T266485: Populating orchestrator metadata on a per-server basis.

Maybe this can be placed on the ops database already.
We'd need to deploy the following grants everywhere:

GRANT SELECT ON ops.cluster TO 'orchestrator'@'orc_host';
Mon, Oct 26, 4:02 PM · Patch-For-Review, DBA
Marostegui moved T266483: Enable report_host for mariadb from Triage to Refine on the DBA board.
Mon, Oct 26, 3:58 PM · Patch-For-Review, DBA, User-Kormat
Marostegui triaged T266485: Populating orchestrator metadata on a per-server basis as Medium priority.
Mon, Oct 26, 3:58 PM · Patch-For-Review, DBA
Marostegui created T266485: Populating orchestrator metadata on a per-server basis.
Mon, Oct 26, 3:58 PM · Patch-For-Review, DBA
Marostegui updated the task description for T266483: Enable report_host for mariadb.
Mon, Oct 26, 3:53 PM · Patch-For-Review, DBA, User-Kormat
Marostegui created P13069 (An Untitled Masterwork).
Mon, Oct 26, 3:07 PM
Marostegui added a comment to T167973: Move database for wikitech (labswiki) to a main cluster section.

The DC switchover is tomorrow, so we can try to plan for it in Q2, if we find the time for it.
I will ping you once I've come up with a plan!
Thanks!

Mon, Oct 26, 2:40 PM · wikitech.wikimedia.org, DBA
Marostegui added a comment to T266432: Increase on database writes and deletes activity on Commonswiki leads to some replication lag.

Adding Wikimedia-production-error as it seems to coincide with a non-train deploy at 16:45 on the 22. I am unable to find it on SAL, however?

Mon, Oct 26, 12:24 PM · Platform Engineering, Wikimedia-production-error, DBA, Commons, Release-Engineering-Team, Operations
Marostegui triaged T266452: Integrate orchestrator with !log as Medium priority.
Mon, Oct 26, 10:27 AM · Operations, User-Kormat, DBA
Marostegui created T266452: Integrate orchestrator with !log.
Mon, Oct 26, 10:27 AM · Operations, User-Kormat, DBA
Marostegui renamed T266432: Increase on database writes and deletes activity on Commonswiki leads to some replication lag from Increase on database writes and deletes activity on Commonswiki lead to some replication lag to Increase on database writes and deletes activity on Commonswiki leads to some replication lag.
Mon, Oct 26, 8:12 AM · Platform Engineering, Wikimedia-production-error, DBA, Commons, Release-Engineering-Team, Operations
Marostegui raised the priority of T266432: Increase on database writes and deletes activity on Commonswiki leads to some replication lag from Medium to High.

Setting to high as this might be causing cross dc lag

Mon, Oct 26, 8:12 AM · Platform Engineering, Wikimedia-production-error, DBA, Commons, Release-Engineering-Team, Operations
Marostegui triaged T266432: Increase on database writes and deletes activity on Commonswiki leads to some replication lag as Medium priority.
Mon, Oct 26, 8:11 AM · Platform Engineering, Wikimedia-production-error, DBA, Commons, Release-Engineering-Team, Operations
Marostegui created T266432: Increase on database writes and deletes activity on Commonswiki leads to some replication lag.
Mon, Oct 26, 8:11 AM · Platform Engineering, Wikimedia-production-error, DBA, Commons, Release-Engineering-Team, Operations
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Mon, Oct 26, 7:48 AM · DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Mon, Oct 26, 7:19 AM · DBA
Marostegui moved T266086: Nuria's volunteer account from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Mon, Oct 26, 6:36 AM · Analytics-Radar, Operations, SRE-Access-Requests
Marostegui triaged T253986: update bacula-sd config so that it listens on IPv6 as Medium priority.
Mon, Oct 26, 6:36 AM · Operations, IPv6
Marostegui triaged T266338: orchestrator: Add service monitoring as Low priority.

We don't have it in production, so putting this to low as we aren't on a hurry for this as of today

Mon, Oct 26, 6:35 AM · Operations, User-Kormat, DBA
Marostegui triaged T266428: Orchestrator: Create basic documentation as Low priority.

We are far from having it in production, so this is not urgent at the moment.

Mon, Oct 26, 6:34 AM · DBA
Marostegui created T266428: Orchestrator: Create basic documentation.
Mon, Oct 26, 6:33 AM · DBA
Marostegui added a comment to T261914: Enable replication eqiad -> codfw and other checks.

Double checked that replication is enabled on all masters (on both dcs)

Mon, Oct 26, 6:26 AM · DBA
Marostegui added a comment to T266086: Nuria's volunteer account.

The offboarding script has an option for "stay volunteer".

Mon, Oct 26, 5:45 AM · Analytics-Radar, Operations, SRE-Access-Requests
Marostegui added a comment to T266086: Nuria's volunteer account.

I can confirm https://phabricator.wikimedia.org/L2 has been signed by @Nuria
We'd still need a C-Level approval for this. I will seek Grant's approval for this

Mon, Oct 26, 5:43 AM · Analytics-Radar, Operations, SRE-Access-Requests

Fri, Oct 23

Marostegui added a comment to T261914: Enable replication eqiad -> codfw and other checks.

All tables came clean. All those that reported differences, were confirmed as false positives by second runs.
The false positives were found at:

Fri, Oct 23, 12:08 PM · DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 12:07 PM · DBA
Marostegui moved T266331: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation from Backlog to Acknowledged on the Operations board.
Fri, Oct 23, 11:57 AM · Traffic, Operations, netbox, DNS, netops, cloud-services-team (Kanban)
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 11:32 AM · DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 9:29 AM · DBA
Marostegui triaged T266314: Decom cookbook should also remove keytabs as Medium priority.
Fri, Oct 23, 8:54 AM · User-MoritzMuehlenhoff, Operations
Marostegui moved T266086: Nuria's volunteer account from Manager/NDA Approval/Confirmation to Awaiting User Input on the SRE-Access-Requests board.
Fri, Oct 23, 7:53 AM · Analytics-Radar, Operations, SRE-Access-Requests
Marostegui moved T266249: Requesting access to production shell groups for JAnstee from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Fri, Oct 23, 7:53 AM · Analytics, Operations, SRE-Access-Requests
Marostegui moved T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Fri, Oct 23, 7:53 AM · Analytics, Operations, SRE-Access-Requests
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 6:56 AM · DBA
Marostegui added a comment to T261411: Add a link engineering: Determine format for accessing and storing link recommendations.

If that needs to happen, we'd need to get some sleep between iterations to avoid replication lag, as even if the table is small, given that there will be one per wiki...there will be lots of them and that can cause some replication lag on x1.

Fri, Oct 23, 6:25 AM · Growth-Team (Current Sprint), DBA, Growth-Structured-Tasks
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 6:11 AM · DBA
Marostegui added a comment to T266249: Requesting access to production shell groups for JAnstee.

Confirmed janstee@wikimedia.org via ldap corp as staff.
@JAnstee_WMF we'd need your manager to sign this off.
Thanks!

Fri, Oct 23, 6:01 AM · Analytics, Operations, SRE-Access-Requests
Marostegui moved T266249: Requesting access to production shell groups for JAnstee from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Fri, Oct 23, 6:00 AM · Analytics, Operations, SRE-Access-Requests
Marostegui updated the task description for T266249: Requesting access to production shell groups for JAnstee.
Fri, Oct 23, 6:00 AM · Analytics, Operations, SRE-Access-Requests
Marostegui updated the task description for T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung.
Fri, Oct 23, 5:59 AM · Analytics, Operations, SRE-Access-Requests
Marostegui added a comment to T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung.

Confirmed that @Rmaung is staff by checking via ldap-corp.
@Rmaung we'd also need your manager to sign off this request.

Fri, Oct 23, 5:59 AM · Analytics, Operations, SRE-Access-Requests
Marostegui updated the task description for T266249: Requesting access to production shell groups for JAnstee.
Fri, Oct 23, 5:55 AM · Analytics, Operations, SRE-Access-Requests
Marostegui added a project to T266249: Requesting access to production shell groups for JAnstee: Analytics.
Fri, Oct 23, 5:54 AM · Analytics, Operations, SRE-Access-Requests
Marostegui moved T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung from Untriaged to Awaiting User Input on the SRE-Access-Requests board.

@KFrancis can you confirm if @Rmaung has a valid NDA signed? I cannot see it on the NDA tracking sheet.

Fri, Oct 23, 5:53 AM · Analytics, Operations, SRE-Access-Requests
Marostegui added a project to T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung: Analytics.
Fri, Oct 23, 5:53 AM · Analytics, Operations, SRE-Access-Requests
Marostegui updated the task description for T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung.
Fri, Oct 23, 5:48 AM · Analytics, Operations, SRE-Access-Requests
Marostegui updated the task description for T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung.
Fri, Oct 23, 5:45 AM · Analytics, Operations, SRE-Access-Requests
Marostegui added a comment to T261411: Add a link engineering: Determine format for accessing and storing link recommendations.

I think this is done, per the last two comments (please correct me if I'm overinterpreting). We'll with go MySQL, x1, one table per wiki, one row per page ("denormalized" format). Thanks all!
We'll file a separate task for the actual schema change.

Fri, Oct 23, 5:42 AM · Growth-Team (Current Sprint), DBA, Growth-Structured-Tasks
Marostegui triaged T266249: Requesting access to production shell groups for JAnstee as Medium priority.
Fri, Oct 23, 5:38 AM · Analytics, Operations, SRE-Access-Requests
Marostegui triaged T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung as Medium priority.
Fri, Oct 23, 5:38 AM · Analytics, Operations, SRE-Access-Requests
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 5:34 AM · DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 5:25 AM · DBA
Marostegui added a comment to T261410: Add a link engineering: Create MySQL table for caching link recommendations.

Although by looking at https://gerrit.wikimedia.org/r/635912, I don't see the JSON data-type, but just using a mediumblob to store that sort of data, that would work. I thought you specifically wanted to use the JSON column data-type.
Nevermind then!
Thanks!

Fri, Oct 23, 5:08 AM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Structured-Tasks
Marostegui added a comment to T261410: Add a link engineering: Create MySQL table for caching link recommendations.

We decided on a JSON-in-MySQL approach. See T261411#6546055 and T261411#6562849.

Fri, Oct 23, 5:06 AM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Structured-Tasks

Thu, Oct 22

Marostegui removed a project from T103084: Have a cron job delete files that haven't been modified in the last X days / months in /data/scratch: Operations.
Thu, Oct 22, 1:25 PM · Data-Services, cloud-services-team (Kanban)
Marostegui moved T266119: mariadb::config: parameterize event_scheduler from Backlog to Acknowledged on the Operations board.
Thu, Oct 22, 1:17 PM · DBA, Operations, User-Kormat
Marostegui moved T265969: Add sbisson to analytics-privatedata-users and create a kerberos identity from Acknowledged to Radar on the Operations board.
Thu, Oct 22, 1:17 PM · Operations, Analytics, SRE-Access-Requests
Marostegui moved T266198: Move labstore1004 and labstore1005 to 10G Ethernet from Backlog to Acknowledged on the Operations board.
Thu, Oct 22, 1:16 PM · Epic, ops-eqiad, Data-Services, cloud-services-team (Kanban), Operations
Marostegui moved T266199: Move labstore1005 to 10Gbps rack and ethernet from Backlog to Acknowledged on the Operations board.
Thu, Oct 22, 1:16 PM · ops-eqiad, Data-Services, cloud-services-team (Kanban), Operations
Marostegui moved T266214: Degraded RAID on ms-be2017 from Backlog to Acknowledged on the Operations board.
Thu, Oct 22, 1:15 PM · SRE-swift-storage, Operations, ops-codfw
Marostegui removed a project from T207253: Automatically compare a few tables per section between hosts and DC: User-Banyek.
Thu, Oct 22, 1:14 PM · Sustainability (Incident Followup), Patch-For-Review, DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Thu, Oct 22, 11:42 AM · DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Thu, Oct 22, 11:33 AM · DBA
Marostegui added a comment to T261914: Enable replication eqiad -> codfw and other checks.

The following tables will be checked across all the wikis:

revision rev_id
text old_id
user user_id
change_tag ct_id
actor actor_id
ipblocks ipb_id
comment comment_id
user user_id
watchlist wl_id
text old_id
logging log_id
page page_id
revision rev_id
revision_actor_temp revactor_rev
revision_comment_temp revcomment_rev
slots slot_revision_id
archive ar_id
Thu, Oct 22, 11:32 AM · DBA
Marostegui created P13051 MDEV-21813.
Thu, Oct 22, 11:21 AM
Marostegui added a comment to T265866: Run check table periodically on backup source hosts.

After running in a very supervising way mysqlcheck on almost all hosts, I can say this is not as easy as "just setting up a cron and run it every week". The CHECK TABLES command on all tables can take up to 24 hours per host, and it is very impacting. We don't have the proper monitoring tuning configuration to handle this, plus it makes backups fail frequently if both run concurrently (at least 3 snapshots failed because of ongoing checks).

With this I don't say we shouldn't do this, but it is going to be harder to implement than even T104459 and as a large project to get right, even if only restricted to source backup hosts, due to its impact on lag and backup taking.

Thu, Oct 22, 10:42 AM · Data-Persistence-Backup
Marostegui added a comment to T266214: Degraded RAID on ms-be2017.

Please note that I have disabled event handler for this host for those checks, so it would need to be re-enabled once the disk is swapped

Thu, Oct 22, 9:55 AM · SRE-swift-storage, Operations, ops-codfw
Marostegui edited Description on Operations.
Thu, Oct 22, 9:38 AM
Marostegui added a comment to T261914: Enable replication eqiad -> codfw and other checks.

Thanks Stevie for working on that!
We have double checked independently and replication is running ,without GTID everywhere within core on both directions:

db1081
                  Master_Host: db2090.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1083
                  Master_Host: db2112.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1086
                  Master_Host: db2118.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1100
                  Master_Host: db2123.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1103
                   Master_Host: db2096.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
db1104
                  Master_Host: db2079.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1122
                  Master_Host: db2107.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1123
                  Master_Host: db2105.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1131
                  Master_Host: db2129.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2079
                  Master_Host: db1104.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2090
                  Master_Host: db1081.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2096
                   Master_Host: db1103.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
db2105
                  Master_Host: db1123.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2107
                  Master_Host: db1122.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2112
                  Master_Host: db1083.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2118
                  Master_Host: db1086.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2123
                  Master_Host: db1100.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2129
                  Master_Host: db1131.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
es1021
                   Master_Host: es2021.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
es1024
                   Master_Host: es2023.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
es2021
                   Master_Host: es1021.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
es2023
                   Master_Host: es1024.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc1007
                   Master_Host: pc2007.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc1008
                   Master_Host: pc2008.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc1009
                   Master_Host: pc2009.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc2007
                   Master_Host: pc1007.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc2008
                   Master_Host: pc1008.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc2009
                   Master_Host: pc1009.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
Thu, Oct 22, 9:18 AM · DBA
Marostegui added a comment to T266214: Degraded RAID on ms-be2017.
[1814258.987868] sd 0:1:0:8: [sdi] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987874] sd 0:1:0:8: [sdi] tag#17 Sense Key : Hardware Error [current]
[1814258.987880] sd 0:1:0:8: [sdi] tag#17 Add. Sense: Logical unit failure
[1814258.987885] sd 0:1:0:8: [sdi] tag#17 CDB: Read(16) 88 00 00 00 00 01 53 85 d6 00 00 00 00 10 00 00
[1814258.987888] blk_update_request: critical target error, dev sdi, sector 5696247296
[1814258.987894] sd 0:1:0:8: [sdi] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987897] sd 0:1:0:8: [sdi] tag#10 Sense Key : Hardware Error [current]
[1814258.987900] sd 0:1:0:8: [sdi] tag#10 Add. Sense: Logical unit failure
[1814258.987904] sd 0:1:0:8: [sdi] tag#10 CDB: Read(16) 88 00 00 00 00 01 6f 6f c4 c0 00 00 00 08 00 00
[1814258.987906] blk_update_request: critical target error, dev sdi, sector 6164563136
[1814258.987911] sd 0:1:0:8: [sdi] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987915] sd 0:1:0:8: [sdi] tag#3 Sense Key : Hardware Error [current]
[1814258.987919] sd 0:1:0:8: [sdi] tag#3 Add. Sense: Logical unit failure
[1814258.987922] blk_update_request: critical target error, dev sdi, sector 5888386544
[1814258.987923] sd 0:1:0:8: [sdi] tag#3 CDB: Read(16) 88 00 00 00 00 01 16 93 db 90 00 00 00 08 00 00
[1814258.987926] sd 0:1:0:8: [sdi] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987928] blk_update_request: critical target error, dev sdi, sector 4673756048
[1814258.987930] sd 0:1:0:8: [sdi] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987932] sd 0:1:0:8: [sdi] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987934] sd 0:1:0:8: [sdi] tag#5 Sense Key : Hardware Error [current]
[1814258.987935] sd 0:1:0:8: [sdi] tag#0 Sense Key : Hardware Error [current]
[1814258.987937] sd 0:1:0:8: [sdi] tag#4 Sense Key : Hardware Error [current]
[1814258.987938] sd 0:1:0:8: [sdi] tag#5 Add. Sense: Logical unit failure
[1814258.987940] sd 0:1:0:8: [sdi] tag#0 Add. Sense: Logical unit failure
[1814258.987941] sd 0:1:0:8: [sdi] tag#4 Add. Sense: Logical unit failure
[1814258.987943] sd 0:1:0:8: [sdi] tag#5 CDB: Read(16) 88 00 00 00 00 00 c2 23 06 10 00 00 00 10 00 00
[1814258.987944] sd 0:1:0:8: [sdi] tag#0 CDB: Read(16) 88 00 00 00 00 00 21 30 78 10 00 00 00 10 00 00
[1814258.987946] blk_update_request: critical target error, dev sdi, sector 3257075216
[1814258.987948] sd 0:1:0:8: [sdi] tag#4 CDB: Read(16) 88 00 00 00 00 00 03 cb 3e 30 00 00 00 08 00 00
[1814258.987950] blk_update_request: critical target error, dev sdi, sector 556824592
[1814258.987952] sd 0:1:0:8: [sdi] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987954] blk_update_request: critical target error, dev sdi, sector 63651376
[1814258.987956] sd 0:1:0:8: [sdi] tag#1 Sense Key : Hardware Error [current]
[1814258.987959] sd 0:1:0:8: [sdi] tag#1 Add. Sense: Logical unit failure
[1814258.987963] XFS (sdi1): metadata I/O error: block 0x16f6fc2c0 ("xfs_trans_read_buf_map") error 121 numblks 8
[1814258.987965] sd 0:1:0:8: [sdi] tag#1 CDB: Read(16) 88 00 00 00 00 00 fb aa 35 60 00 00 00 10 00 00
[1814258.987967] blk_update_request: critical target error, dev sdi, sector 4222236000
[1814258.987972] XFS (sdi1): metadata I/O error: block 0x15ef9a3f0 ("xfs_trans_read_buf_map") error 121 numblks 16
[1814258.987977] XFS (sdi1): metadata I/O error: block 0x1708efe30 ("xfs_trans_read_buf_map") error 121 numblks 16
[1814258.987981] XFS (sdi1): xfs_do_force_shutdown(0x1) called from line 315 of file /build/linux-dqnRSc/linux-4.9.228/fs/xfs/xfs_trans_buf.c.  Return address = 0xffffffffc124a058
[1814258.987983] XFS (sdi1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -121.
Thu, Oct 22, 8:25 AM · SRE-swift-storage, Operations, ops-codfw
Marostegui triaged T266214: Degraded RAID on ms-be2017 as Medium priority.
Thu, Oct 22, 8:24 AM · SRE-swift-storage, Operations, ops-codfw
Marostegui closed T198209: Graphite returning 500 @ nagf and graphite url as Resolved.

@Paladox https://nagf.toolforge.org/?project=tools works for me, so maybe this is already gone. Going to close it.
If you feel this still needs investigation, please reopen!

Thu, Oct 22, 8:03 AM · Operations, Cloud-Services, Graphite
Marostegui moved T266119: mariadb::config: parameterize event_scheduler from Triage to Ready on the DBA board.
Thu, Oct 22, 7:15 AM · DBA, Operations, User-Kormat
Marostegui moved T266125: Drop table profiling from WMF wiki mariadb servers from Triage to Ready on the DBA board.
Thu, Oct 22, 7:15 AM · Data-Persistence-Backup, DBA
Marostegui triaged T266147: Port prometheus-openldap-exporter to Python 3 as Medium priority.
Thu, Oct 22, 5:11 AM · Python3-Porting, LDAP, Operations
Marostegui added a comment to T265866: Run check table periodically on backup source hosts.

Unfortunately that is expected. A table rebuild (alter table engine=innodb,force fixes it in most cases, and then replication should flow again.

Thu, Oct 22, 5:00 AM · Data-Persistence-Backup

Wed, Oct 21

Marostegui committed rOSTD1c9948736099: global_tables: Empty table schemas for global tables (authored by Marostegui).
global_tables: Empty table schemas for global tables
Wed, Oct 21, 3:37 PM
Marostegui triaged T266118: Revisit use of swap and related kernel settings as Medium priority.
Wed, Oct 21, 11:59 AM · User-MoritzMuehlenhoff, Operations
Marostegui edited projects for T266003: orchestrator: Select backend database solution, added: Data-Persistence; removed Operations.

Upgraded db2093 from 10.4.12 to 10.4.15
Rebooted it to pick the new kernels too.

Wed, Oct 21, 11:09 AM · Patch-For-Review, Data-Persistence, User-Kormat, DBA
Marostegui closed T265982: eqiad: New ganeti instance for orchestrator installation, a subtask of T265990: orchestrator: Puppetize , as Resolved.
Wed, Oct 21, 10:46 AM · Patch-For-Review, Operations, User-Kormat, DBA
Marostegui closed T265982: eqiad: New ganeti instance for orchestrator installation as Resolved.

Finally this VM is up and running.

Wed, Oct 21, 10:46 AM · Operations, vm-requests, serviceops
Marostegui removed a project from T266064: Site: 1 VM request for Analytics test cluster: serviceops.
Wed, Oct 21, 10:32 AM · vm-requests, Operations
Marostegui added a comment to T264703: Race condition when re-importing a logical backup and a new one is generated.

Aside from that, what I can do is add a check just before rotation to "latest" to see if there is something "reading" the dir and kill it before moving it to latest? Maybe restrict it to myloader pid/recovery script?

Wed, Oct 21, 9:51 AM · Data-Persistence-Backup
Marostegui closed T266060: Add Nahid to WMF-NDA group as Resolved.

Indeed, I was following the wrong procedure I believe.
I have added @Nahid to WMF-NDA

Wed, Oct 21, 9:15 AM · WMF-NDA-Requests