Page MenuHomePhabricator

Marostegui (Manuel Aróstegui)
Staff Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Sep 1 2016, 6:48 AM (216 w, 3 d)
Availability
Available
IRC Nick
marostegui
LDAP User
Marostegui
MediaWiki User
MArostegui (WMF) [ Global Accounts ]

TZ: UTC +1/+2

Recent Activity

Fri, Oct 23

Marostegui added a comment to T261914: Enable replication eqiad -> codfw and other checks.

All tables came clean. All those that reported differences, were confirmed as false positives by second runs.
The false positives were found at:

Fri, Oct 23, 12:08 PM · Data-Persistence, DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 12:07 PM · Data-Persistence, DBA
Marostegui moved T266331: Cloud: define relationship between wikimediacloud.org domain, CIDR prefixes and netbox automation from Backlog to Acknowledged on the Operations board.
Fri, Oct 23, 11:57 AM · Traffic, Operations, netbox, DNS, netops, cloud-services-team (Kanban)
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 11:32 AM · Data-Persistence, DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 9:29 AM · Data-Persistence, DBA
Marostegui triaged T266314: Decom cookbook should also remove keytabs as Medium priority.
Fri, Oct 23, 8:54 AM · User-MoritzMuehlenhoff, Operations
Marostegui moved T266086: Nuria's volunteer account from Manager/NDA Approval/Confirmation to Awaiting User Input on the SRE-Access-Requests board.
Fri, Oct 23, 7:53 AM · Analytics, Operations, SRE-Access-Requests
Marostegui moved T266249: Requesting access to production shell groups for JAnstee from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Fri, Oct 23, 7:53 AM · Analytics, Operations, SRE-Access-Requests
Marostegui moved T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Fri, Oct 23, 7:53 AM · Analytics, SRE-Access-Requests, Operations
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 6:56 AM · Data-Persistence, DBA
Marostegui added a comment to T261411: Add a link engineering: Determine format for accessing and storing link recommendations.

If that needs to happen, we'd need to get some sleep between iterations to avoid replication lag, as even if the table is small, given that there will be one per wiki...there will be lots of them and that can cause some replication lag on x1.

Fri, Oct 23, 6:25 AM · Growth-Team (Current Sprint), Data-Persistence, DBA, Growth-Structured-Tasks
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 6:11 AM · Data-Persistence, DBA
Marostegui added a comment to T266249: Requesting access to production shell groups for JAnstee.

Confirmed janstee@wikimedia.org via ldap corp as staff.
@JAnstee_WMF we'd need your manager to sign this off.
Thanks!

Fri, Oct 23, 6:01 AM · Analytics, Operations, SRE-Access-Requests
Marostegui moved T266249: Requesting access to production shell groups for JAnstee from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Fri, Oct 23, 6:00 AM · Analytics, Operations, SRE-Access-Requests
Marostegui updated the task description for T266249: Requesting access to production shell groups for JAnstee.
Fri, Oct 23, 6:00 AM · Analytics, Operations, SRE-Access-Requests
Marostegui updated the task description for T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung.
Fri, Oct 23, 5:59 AM · Analytics, SRE-Access-Requests, Operations
Marostegui added a comment to T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung.

Confirmed that @Rmaung is staff by checking via ldap-corp.
@Rmaung we'd also need your manager to sign off this request.

Fri, Oct 23, 5:59 AM · Analytics, SRE-Access-Requests, Operations
Marostegui updated the task description for T266249: Requesting access to production shell groups for JAnstee.
Fri, Oct 23, 5:55 AM · Analytics, Operations, SRE-Access-Requests
Marostegui added a project to T266249: Requesting access to production shell groups for JAnstee: Analytics.
Fri, Oct 23, 5:54 AM · Analytics, Operations, SRE-Access-Requests
Marostegui moved T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung from Untriaged to Awaiting User Input on the SRE-Access-Requests board.

@KFrancis can you confirm if @Rmaung has a valid NDA signed? I cannot see it on the NDA tracking sheet.

Fri, Oct 23, 5:53 AM · Analytics, SRE-Access-Requests, Operations
Marostegui added a project to T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung: Analytics.
Fri, Oct 23, 5:53 AM · Analytics, SRE-Access-Requests, Operations
Marostegui updated the task description for T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung.
Fri, Oct 23, 5:48 AM · Analytics, SRE-Access-Requests, Operations
Marostegui updated the task description for T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung.
Fri, Oct 23, 5:45 AM · Analytics, SRE-Access-Requests, Operations
Marostegui added a comment to T261411: Add a link engineering: Determine format for accessing and storing link recommendations.

I think this is done, per the last two comments (please correct me if I'm overinterpreting). We'll with go MySQL, x1, one table per wiki, one row per page ("denormalized" format). Thanks all!
We'll file a separate task for the actual schema change.

Fri, Oct 23, 5:42 AM · Growth-Team (Current Sprint), Data-Persistence, DBA, Growth-Structured-Tasks
Marostegui triaged T266249: Requesting access to production shell groups for JAnstee as Medium priority.
Fri, Oct 23, 5:38 AM · Analytics, Operations, SRE-Access-Requests
Marostegui triaged T266250: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung as Medium priority.
Fri, Oct 23, 5:38 AM · Analytics, SRE-Access-Requests, Operations
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 5:34 AM · Data-Persistence, DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Fri, Oct 23, 5:25 AM · Data-Persistence, DBA
Marostegui added a comment to T261410: Add a link engineering: Create MySQL table for caching link recommendations.

Although by looking at https://gerrit.wikimedia.org/r/635912, I don't see the JSON data-type, but just using a mediumblob to store that sort of data, that would work. I thought you specifically wanted to use the JSON column data-type.
Nevermind then!
Thanks!

Fri, Oct 23, 5:08 AM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Structured-Tasks
Marostegui added a comment to T261410: Add a link engineering: Create MySQL table for caching link recommendations.

We decided on a JSON-in-MySQL approach. See T261411#6546055 and T261411#6562849.

Fri, Oct 23, 5:06 AM · Patch-For-Review, Growth-Team (Current Sprint), Growth-Structured-Tasks

Thu, Oct 22

Marostegui removed a project from T103084: Have a cron job delete files that haven't been modified in the last X days / months in /data/scratch: Operations.
Thu, Oct 22, 1:25 PM · Cloud-Services
Marostegui moved T266119: mariadb::config: parameterize event_scheduler from Backlog to Acknowledged on the Operations board.
Thu, Oct 22, 1:17 PM · DBA, Operations, User-Kormat
Marostegui moved T265969: Add sbisson to analytics-privatedata-users and create a kerberos identity from Acknowledged to Radar on the Operations board.
Thu, Oct 22, 1:17 PM · Operations, Analytics, SRE-Access-Requests
Marostegui moved T266198: Move labstore1004 and labstore1005 to 10G Ethernet from Backlog to Acknowledged on the Operations board.
Thu, Oct 22, 1:16 PM · ops-eqiad, Data-Services, cloud-services-team (Kanban), Operations
Marostegui moved T266199: Move labstore1005 to 10Gbps rack and ethernet from Backlog to Acknowledged on the Operations board.
Thu, Oct 22, 1:16 PM · ops-eqiad, Data-Services, cloud-services-team (Kanban), Operations
Marostegui moved T266214: Degraded RAID on ms-be2017 from Backlog to Acknowledged on the Operations board.
Thu, Oct 22, 1:15 PM · SRE-swift-storage, Operations, ops-codfw
Marostegui removed a project from T207253: Automatically compare a few tables per section between hosts and DC: User-Banyek.
Thu, Oct 22, 1:14 PM · Sustainability (Incident Followup), Patch-For-Review, DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Thu, Oct 22, 11:42 AM · Data-Persistence, DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Thu, Oct 22, 11:33 AM · Data-Persistence, DBA
Marostegui added a comment to T261914: Enable replication eqiad -> codfw and other checks.

The following tables will be checked across all the wikis:

revision rev_id
text old_id
user user_id
change_tag ct_id
actor actor_id
ipblocks ipb_id
comment comment_id
user user_id
watchlist wl_id
text old_id
logging log_id
page page_id
revision rev_id
revision_actor_temp revactor_rev
revision_comment_temp revcomment_rev
slots slot_revision_id
archive ar_id
Thu, Oct 22, 11:32 AM · Data-Persistence, DBA
Marostegui created P13051 MDEV-21813.
Thu, Oct 22, 11:21 AM
Marostegui added a comment to T265866: Run check table periodically on backup source hosts.

After running in a very supervising way mysqlcheck on almost all hosts, I can say this is not as easy as "just setting up a cron and run it every week". The CHECK TABLES command on all tables can take up to 24 hours per host, and it is very impacting. We don't have the proper monitoring tuning configuration to handle this, plus it makes backups fail frequently if both run concurrently (at least 3 snapshots failed because of ongoing checks).

With this I don't say we shouldn't do this, but it is going to be harder to implement than even T104459 and as a large project to get right, even if only restricted to source backup hosts, due to its impact on lag and backup taking.

Thu, Oct 22, 10:42 AM · Data-Persistence-Backup
Marostegui added a comment to T266214: Degraded RAID on ms-be2017.

Please note that I have disabled event handler for this host for those checks, so it would need to be re-enabled once the disk is swapped

Thu, Oct 22, 9:55 AM · SRE-swift-storage, Operations, ops-codfw
Marostegui edited Description on Operations.
Thu, Oct 22, 9:38 AM
Marostegui added a comment to T261914: Enable replication eqiad -> codfw and other checks.

Thanks Stevie for working on that!
We have double checked independently and replication is running ,without GTID everywhere within core on both directions:

db1081
                  Master_Host: db2090.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1083
                  Master_Host: db2112.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1086
                  Master_Host: db2118.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1100
                  Master_Host: db2123.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1103
                   Master_Host: db2096.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
db1104
                  Master_Host: db2079.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1122
                  Master_Host: db2107.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1123
                  Master_Host: db2105.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db1131
                  Master_Host: db2129.codfw.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2079
                  Master_Host: db1104.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2090
                  Master_Host: db1081.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2096
                   Master_Host: db1103.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
db2105
                  Master_Host: db1123.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2107
                  Master_Host: db1122.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2112
                  Master_Host: db1083.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2118
                  Master_Host: db1086.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2123
                  Master_Host: db1100.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
db2129
                  Master_Host: db1131.eqiad.wmnet
        Seconds_Behind_Master: 0
                   Using_Gtid: No
es1021
                   Master_Host: es2021.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
es1024
                   Master_Host: es2023.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
es2021
                   Master_Host: es1021.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
es2023
                   Master_Host: es1024.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc1007
                   Master_Host: pc2007.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc1008
                   Master_Host: pc2008.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc1009
                   Master_Host: pc2009.codfw.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc2007
                   Master_Host: pc1007.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc2008
                   Master_Host: pc1008.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
pc2009
                   Master_Host: pc1009.eqiad.wmnet
         Seconds_Behind_Master: 0
                    Using_Gtid: No
Thu, Oct 22, 9:18 AM · Data-Persistence, DBA
Marostegui added a comment to T266214: Degraded RAID on ms-be2017.
[1814258.987868] sd 0:1:0:8: [sdi] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987874] sd 0:1:0:8: [sdi] tag#17 Sense Key : Hardware Error [current]
[1814258.987880] sd 0:1:0:8: [sdi] tag#17 Add. Sense: Logical unit failure
[1814258.987885] sd 0:1:0:8: [sdi] tag#17 CDB: Read(16) 88 00 00 00 00 01 53 85 d6 00 00 00 00 10 00 00
[1814258.987888] blk_update_request: critical target error, dev sdi, sector 5696247296
[1814258.987894] sd 0:1:0:8: [sdi] tag#10 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987897] sd 0:1:0:8: [sdi] tag#10 Sense Key : Hardware Error [current]
[1814258.987900] sd 0:1:0:8: [sdi] tag#10 Add. Sense: Logical unit failure
[1814258.987904] sd 0:1:0:8: [sdi] tag#10 CDB: Read(16) 88 00 00 00 00 01 6f 6f c4 c0 00 00 00 08 00 00
[1814258.987906] blk_update_request: critical target error, dev sdi, sector 6164563136
[1814258.987911] sd 0:1:0:8: [sdi] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987915] sd 0:1:0:8: [sdi] tag#3 Sense Key : Hardware Error [current]
[1814258.987919] sd 0:1:0:8: [sdi] tag#3 Add. Sense: Logical unit failure
[1814258.987922] blk_update_request: critical target error, dev sdi, sector 5888386544
[1814258.987923] sd 0:1:0:8: [sdi] tag#3 CDB: Read(16) 88 00 00 00 00 01 16 93 db 90 00 00 00 08 00 00
[1814258.987926] sd 0:1:0:8: [sdi] tag#5 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987928] blk_update_request: critical target error, dev sdi, sector 4673756048
[1814258.987930] sd 0:1:0:8: [sdi] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987932] sd 0:1:0:8: [sdi] tag#4 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987934] sd 0:1:0:8: [sdi] tag#5 Sense Key : Hardware Error [current]
[1814258.987935] sd 0:1:0:8: [sdi] tag#0 Sense Key : Hardware Error [current]
[1814258.987937] sd 0:1:0:8: [sdi] tag#4 Sense Key : Hardware Error [current]
[1814258.987938] sd 0:1:0:8: [sdi] tag#5 Add. Sense: Logical unit failure
[1814258.987940] sd 0:1:0:8: [sdi] tag#0 Add. Sense: Logical unit failure
[1814258.987941] sd 0:1:0:8: [sdi] tag#4 Add. Sense: Logical unit failure
[1814258.987943] sd 0:1:0:8: [sdi] tag#5 CDB: Read(16) 88 00 00 00 00 00 c2 23 06 10 00 00 00 10 00 00
[1814258.987944] sd 0:1:0:8: [sdi] tag#0 CDB: Read(16) 88 00 00 00 00 00 21 30 78 10 00 00 00 10 00 00
[1814258.987946] blk_update_request: critical target error, dev sdi, sector 3257075216
[1814258.987948] sd 0:1:0:8: [sdi] tag#4 CDB: Read(16) 88 00 00 00 00 00 03 cb 3e 30 00 00 00 08 00 00
[1814258.987950] blk_update_request: critical target error, dev sdi, sector 556824592
[1814258.987952] sd 0:1:0:8: [sdi] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[1814258.987954] blk_update_request: critical target error, dev sdi, sector 63651376
[1814258.987956] sd 0:1:0:8: [sdi] tag#1 Sense Key : Hardware Error [current]
[1814258.987959] sd 0:1:0:8: [sdi] tag#1 Add. Sense: Logical unit failure
[1814258.987963] XFS (sdi1): metadata I/O error: block 0x16f6fc2c0 ("xfs_trans_read_buf_map") error 121 numblks 8
[1814258.987965] sd 0:1:0:8: [sdi] tag#1 CDB: Read(16) 88 00 00 00 00 00 fb aa 35 60 00 00 00 10 00 00
[1814258.987967] blk_update_request: critical target error, dev sdi, sector 4222236000
[1814258.987972] XFS (sdi1): metadata I/O error: block 0x15ef9a3f0 ("xfs_trans_read_buf_map") error 121 numblks 16
[1814258.987977] XFS (sdi1): metadata I/O error: block 0x1708efe30 ("xfs_trans_read_buf_map") error 121 numblks 16
[1814258.987981] XFS (sdi1): xfs_do_force_shutdown(0x1) called from line 315 of file /build/linux-dqnRSc/linux-4.9.228/fs/xfs/xfs_trans_buf.c.  Return address = 0xffffffffc124a058
[1814258.987983] XFS (sdi1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -121.
Thu, Oct 22, 8:25 AM · SRE-swift-storage, Operations, ops-codfw
Marostegui triaged T266214: Degraded RAID on ms-be2017 as Medium priority.
Thu, Oct 22, 8:24 AM · SRE-swift-storage, Operations, ops-codfw
Marostegui closed T198209: Graphite returning 500 @ nagf and graphite url as Resolved.

@Paladox https://nagf.toolforge.org/?project=tools works for me, so maybe this is already gone. Going to close it.
If you feel this still needs investigation, please reopen!

Thu, Oct 22, 8:03 AM · Operations, Cloud-Services, Graphite
Marostegui moved T266119: mariadb::config: parameterize event_scheduler from Triage to Ready on the DBA board.
Thu, Oct 22, 7:15 AM · DBA, Operations, User-Kormat
Marostegui moved T266125: Drop table profiling from WMF wiki mariadb servers from Triage to Ready on the DBA board.
Thu, Oct 22, 7:15 AM · Data-Persistence-Backup, DBA
Marostegui triaged T266147: Port prometheus-openldap-exporter to Python 3 as Medium priority.
Thu, Oct 22, 5:11 AM · Python3-Porting, LDAP, Operations
Marostegui added a comment to T265866: Run check table periodically on backup source hosts.

Unfortunately that is expected. A table rebuild (alter table engine=innodb,force fixes it in most cases, and then replication should flow again.

Thu, Oct 22, 5:00 AM · Data-Persistence-Backup

Wed, Oct 21

Marostegui committed rOSTD1c9948736099: global_tables: Empty table schemas for global tables (authored by Marostegui).
global_tables: Empty table schemas for global tables
Wed, Oct 21, 3:37 PM
Marostegui triaged T266118: Revisit use of swap and related kernel settings as Medium priority.
Wed, Oct 21, 11:59 AM · User-MoritzMuehlenhoff, Operations
Marostegui edited projects for T266003: orchestrator: Select backend database solution, added: Data-Persistence; removed Operations.

Upgraded db2093 from 10.4.12 to 10.4.15
Rebooted it to pick the new kernels too.

Wed, Oct 21, 11:09 AM · Data-Persistence, User-Kormat, DBA
Marostegui closed T265982: eqiad: New ganeti instance for orchestrator installation, a subtask of T265990: orchestrator: Puppetize , as Resolved.
Wed, Oct 21, 10:46 AM · Patch-For-Review, Operations, User-Kormat, DBA
Marostegui closed T265982: eqiad: New ganeti instance for orchestrator installation as Resolved.

Finally this VM is up and running.

Wed, Oct 21, 10:46 AM · Operations, vm-requests, serviceops
Marostegui removed a project from T266064: Site: 1 VM request for Analytics test cluster: serviceops.
Wed, Oct 21, 10:32 AM · vm-requests, Operations
Marostegui added a comment to T264703: Race condition when re-importing a logical backup and a new one is generated.

Aside from that, what I can do is add a check just before rotation to "latest" to see if there is something "reading" the dir and kill it before moving it to latest? Maybe restrict it to myloader pid/recovery script?

Wed, Oct 21, 9:51 AM · Data-Persistence, Data-Persistence-Backup
Marostegui closed T266060: Add Nahid to WMF-NDA group as Resolved.

Indeed, I was following the wrong procedure I believe.
I have added @Nahid to WMF-NDA

Wed, Oct 21, 9:15 AM · WMF-NDA-Requests
Marostegui added a member for WMF-NDA: Nahid.
Wed, Oct 21, 9:15 AM
Marostegui added a comment to T265490: rate limited etherpad.

Same here, I am not being rate limited anymore.

Wed, Oct 21, 8:03 AM · Patch-For-Review, Operations, Wikimedia-Etherpad
Marostegui moved T266086: Nuria's volunteer account from Awaiting User Input to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.
Wed, Oct 21, 7:30 AM · Analytics, Operations, SRE-Access-Requests
Marostegui moved T266086: Nuria's volunteer account from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Wed, Oct 21, 7:30 AM · Analytics, Operations, SRE-Access-Requests
Marostegui reassigned T265835: Create a mailing list for frwiki nominators from Marostegui to Kvardek_du.
Wed, Oct 21, 7:30 AM · Operations, Wikimedia-Mailing-lists
Marostegui updated subscribers of T266086: Nuria's volunteer account.

Thanks Luca!
@Nuria can we get a manager to approve this as well? @faidon or @mark maybe?
Further, can you sign https://phabricator.wikimedia.org/L2

Wed, Oct 21, 6:56 AM · Analytics, Operations, SRE-Access-Requests
Marostegui placed T264900: Prepare and check storage layer for smnwiki up for grabs.
Wed, Oct 21, 6:04 AM · User-bd808, cloud-services-team (Kanban), Data-Services, DBA
Marostegui added a comment to T264703: Race condition when re-importing a logical backup and a new one is generated.

So the second part is kinda expected "Running myloader..." will indicate that the process has started and it won't finish as long as the underlying myloader hasn't finished... which unless I understood incorrectly, it hadn't finish (it was blocked)?

Wed, Oct 21, 5:49 AM · Data-Persistence, Data-Persistence-Backup
Marostegui added a comment to T265323: Add toil::systemd_scope_cleanup to dbprov hosts.

@Marostegui 2 questions:

  • When you said:

the disk went full

only full in activity, not on disk space, right?

Wed, Oct 21, 5:34 AM · Data-Persistence-Backup, Operations, SRE-tools
Marostegui created P13039 (An Untitled Masterwork).
Wed, Oct 21, 5:30 AM
Marostegui updated the task description for T265344: Monitor the growth of CheckUser tables at large wikis.
Wed, Oct 21, 5:26 AM · Data-Persistence, DBA
Marostegui added a comment to T265344: Monitor the growth of CheckUser tables at large wikis.

@Huji thanks for the ping. I have a calendar alert for this, but yesterday I was super busy and I couldn't do it, but it is on my radar.

Wed, Oct 21, 5:21 AM · Data-Persistence, DBA
Marostegui triaged T266060: Add Nahid to WMF-NDA group as Medium priority.
Wed, Oct 21, 5:17 AM · WMF-NDA-Requests
Marostegui added a comment to T266060: Add Nahid to WMF-NDA group.

@jrbs are you @Nahid managers? We'd need that signed off by them

Wed, Oct 21, 5:17 AM · WMF-NDA-Requests
Marostegui triaged T266086: Nuria's volunteer account as Medium priority.

@MoritzMuehlenhoff can you advise on what is the process for handling this?
I guess we need to follow https://wikitech.wikimedia.org/wiki/Volunteer_NDA ?

Wed, Oct 21, 5:14 AM · Analytics, Operations, SRE-Access-Requests
Marostegui triaged T266075: Using $facts['networking']['ip'] breaks puppet on cloud hosts as Medium priority.
Wed, Oct 21, 5:12 AM · cloud-services-team (Kanban), Puppet, Operations
Marostegui triaged T266064: Site: 1 VM request for Analytics test cluster as Medium priority.
Wed, Oct 21, 5:12 AM · vm-requests, Operations
Marostegui triaged T266023: orchestrator: Get packages into WMF apt as Medium priority.
Wed, Oct 21, 5:11 AM · Operations, User-Kormat, DBA
Marostegui added a comment to T265135: wikireplicas: Define MW sections per host.

We could do the switch from model a to model b but that means we'd need to repopulate the data across the hosts which means downtime for them, I would estimate around 5-7 days if all goes well.

Wed, Oct 21, 5:07 AM · Data-Services, cloud-services-team (Kanban)

Tue, Oct 20

Marostegui added a comment to T263587: CAPEX for ParserCache for Parsoid.

I'm going by the Dell quotes for the hw, backtracking from the racking task. If those are wrong, can $someone point me to the right ones?

I went by cat /sys/block/sda/queue/rotational and memory, maybe I am wrong.

Tue, Oct 20, 1:47 PM · DBA, serviceops, Platform Team Workboards (Green), MediaWiki-Parser, Parsoid
Marostegui added a comment to T263587: CAPEX for ParserCache for Parsoid.

@ArielGlenn the current parsercache hosts run SSDs.

Tue, Oct 20, 1:35 PM · DBA, serviceops, Platform Team Workboards (Green), MediaWiki-Parser, Parsoid
Marostegui added a comment to T263587: CAPEX for ParserCache for Parsoid.

I guess we have to begin here.

TLDR of the problem is that we will not have enough space in MySQL for ParserCache for transitioning from old PHP parser to Parsoid. We would need to roughly double the storage capacity of the cache and very roughly triple throughput.

We have 3 options:

  1. Buy more hardware for ParserCache, keep using MySQL. The downside of this is procurement time and more importantly, once the transition to Parsoid is complete in several years we will end up with drastically over provisioned cluster.
Tue, Oct 20, 1:18 PM · DBA, serviceops, Platform Team Workboards (Green), MediaWiki-Parser, Parsoid
Marostegui moved T266002: orchestrator: integrate promotion rules into puppet from Triage to Refine on the DBA board.
Tue, Oct 20, 12:55 PM · Operations, User-Kormat, DBA
Marostegui triaged T266002: orchestrator: integrate promotion rules into puppet as Medium priority.

It is especially important to specify hosts that should never be masters

Tue, Oct 20, 12:55 PM · Operations, User-Kormat, DBA
Marostegui updated the task description for T261914: Enable replication eqiad -> codfw and other checks.
Tue, Oct 20, 12:51 PM · Data-Persistence, DBA
Marostegui added a comment to T170298: sshd stretch puppet support.

On buster UsePrivilegeSeparation is deprecated

Tue, Oct 20, 12:13 PM · Patch-For-Review, Operations
Marostegui closed T143556: Setting up grafana should also setup Anonymous read-only access for the default org as Resolved.

Per our IRC chat, let's close this for now!

Tue, Oct 20, 11:52 AM · observability, Cloud-Services, Operations
Marostegui added a comment to T236292: php-fpm invalid opcode on mw1317.

@jijiki what do you want to do with this task?

Tue, Oct 20, 11:43 AM · Operations, serviceops
Marostegui closed T243149: Increased latency in CODFW API and APP monitoring urls (~07:20 UTC 19 Jan 2020) as Resolved.

I am going to close this, as there is not much else we can really do here and it looks like a one time thing

Tue, Oct 20, 11:41 AM · Performance-Team (Radar), serviceops, Operations
Marostegui lowered the priority of T226908: ops-monitoring-bot creating dupes from Medium to Low.
Tue, Oct 20, 11:38 AM · SRE-tools, Icinga, observability, Operations
Marostegui added a subtask for T253810: Alert on ECC warnings in SEL: T197084: Report problems found in server's IPMI SEL.
Tue, Oct 20, 11:38 AM · User-MoritzMuehlenhoff, Wikimedia-Incident, observability, Operations
Marostegui added a parent task for T197084: Report problems found in server's IPMI SEL: T253810: Alert on ECC warnings in SEL.
Tue, Oct 20, 11:37 AM · Operations, observability
Marostegui added a comment to T254011: Why do we have 2 sets of squid proxies?.

@Dzahn good to close after Alex's answer?

Tue, Oct 20, 11:33 AM · Operations
Marostegui removed a project from T224475: Return sulfur to spares: Operations.
Tue, Oct 20, 11:30 AM · Operations, decommission-hardware, ops-eqiad
Marostegui added a comment to T234698: ms-be1020 - firmware upgrade: (was: host went down).

Is this firmware upgrade still needed?

Tue, Oct 20, 11:27 AM · ops-eqiad, SRE-swift-storage, Operations
Marostegui closed T174432: Unclear LVS bandwidth graph in "load balancers" dashboard as Resolved.

Per the last two comments, looks like this is fixed.

Tue, Oct 20, 11:17 AM · Traffic, Operations
Marostegui closed T185306: ms-be2023 unresponsive while rebuilding one disk as Resolved.

Unlikely it can be reproduced again, closing!
Reopen if you feel it still needs work

Tue, Oct 20, 11:10 AM · SRE-swift-storage, Operations
Marostegui triaged T265620: Rename an-scheduler1001 to an-coord1002 as Medium priority.
Tue, Oct 20, 10:58 AM · Operations, Analytics-Clusters
Marostegui closed T101980: Icinga alert for labnet1001 for conntrack saturation graphite check as Resolved.

This host is no more: T221818: Decommission labnet1001 & labnet1002

Tue, Oct 20, 10:56 AM · Operations, Cloud-Services
Marostegui closed T116627: Include 5xx numbers in fluorine fatalmonitor as Resolved.

Closing per: T116627#3452914

Tue, Oct 20, 10:37 AM · Operations