Page MenuHomePhabricator

Marostegui (Manuel Aróstegui)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Sep 1 2016, 6:48 AM (170 w, 4 d)
Availability
Available
IRC Nick
marostegui
LDAP User
Marostegui
MediaWiki User
MArostegui (WMF) [ Global Accounts ]

TZ: UTC +1/+2

Recent Activity

Today

Marostegui updated subscribers of T240177: backup2001 rebooted itself.
Mon, Dec 9, 6:42 AM · Operations, DBA
Marostegui created T240177: backup2001 rebooted itself.
Mon, Dec 9, 6:41 AM · Operations, DBA

Thu, Dec 5

Marostegui added a comment to T239901: Disallow 'weight: 0' for MW db config in dbctl.

I am not sure if I want to fully disallow weight 0 for replicas, there are some cases where we might actually want that,
Cross posting from: T239900

Thu, Dec 5, 5:25 PM · Operations, DBA, Wikimedia-Incident
Marostegui added a comment to T239900: Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs.
  1. Does MediaWiki as used by WMF (wikimedia/rdbms, LBFactoryMulti) support giving the master db non-zero weight in terms of read queries intended for replicas?

From looking through the code, it looks like the answer is "Yes". Unless I missed something, this code is very complex.

Thu, Dec 5, 5:21 PM · Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-Rdbms
Marostegui added a comment to T239900: Sync understanding of MediaWiki rdbms 'weight' behaviour with DBAs.

I would definitely like to be able to give read traffic to the master. It won't happen often and in 3 years I only recall once or twice where we gave some read traffic to the master (usually during emergencies or where multiple slaves need to go for maintenance), but we should have the possibility.

Thu, Dec 5, 1:22 PM · Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-Rdbms
Marostegui added a comment to T239877: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error".

Given the dbs (in s7) I highly doubt it's wikidata but I also want to mention that s7 has only frwiktionary and metawiki as their group1 wikis. Is there anything special about those?

Thu, Dec 5, 10:22 AM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-production-error
Marostegui added a comment to T239877: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error".

These queries showed up at the time of the incident, and I cannot see them happening before it:
{P9822}

Thu, Dec 5, 10:20 AM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-production-error
Marostegui updated the task description for T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes.
Thu, Dec 5, 9:18 AM · DBA, User-jbond, Puppet, Operations
Marostegui updated the task description for T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes.
Thu, Dec 5, 9:10 AM · DBA, User-jbond, Puppet, Operations
Marostegui claimed T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes.
Thu, Dec 5, 8:41 AM · DBA, User-jbond, Puppet, Operations
Marostegui updated the task description for T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes.
Thu, Dec 5, 8:25 AM · DBA, User-jbond, Puppet, Operations
Marostegui updated the task description for T239188: Decommission db1062.eqiad.wmnet.
Thu, Dec 5, 8:07 AM · DBA
Marostegui moved T239046: decommission db2065.codfw.wmnet from Backlog to pending onsite steps (codfw) on the decommission board.
Thu, Dec 5, 7:46 AM · Operations, DC-Ops, ops-codfw, decommission
Marostegui updated the task description for T228258: Decommission db2043-db2070.
Thu, Dec 5, 7:46 AM · Operations, DBA
Marostegui updated the task description for T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes.
Thu, Dec 5, 7:13 AM · DBA, User-jbond, Puppet, Operations
Marostegui added a comment to T239046: decommission db2065.codfw.wmnet.

Host ready for @Papaul

Thu, Dec 5, 6:54 AM · Operations, DC-Ops, ops-codfw, decommission
Marostegui reassigned T239046: decommission db2065.codfw.wmnet from Marostegui to Papaul.
Thu, Dec 5, 6:54 AM · Operations, DC-Ops, ops-codfw, decommission
Marostegui added a comment to T239877: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error".

I have not been able to find errors for those two hosts outside of the spikes, which I believe means that they are normally working fine and they only had issues during those two spikes which match the train
First spike starts at 20:15 (per logstatsh https://logstash.wikimedia.org/goto/cebe10b2507c977c10fd5bace04d937c) which matches:

20:14 brennen@deploy1001: Synchronized php: group1 wikis to 1.35.0-wmf.8 (duration: 01m 29s)
Thu, Dec 5, 6:27 AM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-production-error
Marostegui added a comment to T239217: Degraded RAID on dbstore1003.

It looks like disk #4:

root@dbstore1003:~# megacli -LDPDInfo -aAll
Thu, Dec 5, 6:13 AM · Analytics, ops-eqiad, Operations
Marostegui updated the task description for T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes.
Thu, Dec 5, 6:01 AM · DBA, User-jbond, Puppet, Operations
Marostegui moved T239877: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error" from Triage to In progress on the DBA board.
Thu, Dec 5, 6:00 AM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-production-error
Marostegui added a comment to T239877: After deploy of 1.35.0-wmf.8 to group1, surge of "Cannot access the database: Unknown error".

I believe db1094 and db1136 being unreachable was the consequence and not the cause.
Both hosts had a huge spike on connections:
db1094: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1575476805856&to=1575515716494&var-dc=eqiad%20prometheus%2Fops&var-server=db1094&var-port=9104&fullscreen&panelId=37
db1136: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1575476805856&to=1575515716494&fullscreen&panelId=37&var-dc=eqiad%20prometheus%2Fops&var-server=db1136&var-port=9104

Thu, Dec 5, 5:56 AM · MW-1.35-notes (1.35.0-wmf.8; 2019-11-26), Core Platform Team Workboards (Clinic Duty Team), DBA, Wikimedia-production-error
Marostegui updated the task description for T232446: Compress new Wikibase tables.
Thu, Dec 5, 5:45 AM · DBA
Marostegui added a comment to T235599: Recompress special slaves across eqiad and codfw.

All done

Thu, Dec 5, 5:45 AM · DBA
Marostegui closed T235599: Recompress special slaves across eqiad and codfw as Resolved.
Thu, Dec 5, 5:45 AM · DBA
Marostegui updated the task description for T233135: Schema change for refactored actor and comment storage.
Thu, Dec 5, 5:43 AM · Core Platform Team, Blocked-on-schema-change, DBA
Marostegui added a comment to T239874: MediaWiki: "host db1062 is unreachable" (Connection refused).

Thanks for tackling this and apologies for not depooling it when I should've.
db1062 stopped being a master a week ago

1--- eqiad/readOnlyBySection live
2+++ eqiad/readOnlyBySection generated
3@@ -1,3 +1 @@
4-{
5- "s7": "Maintenance till 06:30AM UTC T238044"
6-}
7+{}
8--- eqiad/sectionLoads live
9+++ eqiad/sectionLoads generated
10@@ -78,25 +78,25 @@
11 "db1085": 300,
12 "db1088": 500,
13 "db1093": 400,
14 "db1096:3316": 1,
15 "db1098:3316": 1,
16 "db1113:3316": 1
17 }
18 ],
19 "s7": [
20 {
21- "db1062": 0
22+ "db1086": 0
23 },
24 {
25+ "db1062": 0,
26 "db1079": 200,
27- "db1086": 0,
28 "db1090:3317": 1,
29 "db1094": 500,
30 "db1098:3317": 150,
31 "db1101:3317": 150,
32 "db1136": 400
33 }
34 ],
35 "s8": [
36 {
37 "db1109": 0

Thu, Dec 5, 5:42 AM · DBA, Wikimedia-production-error

Wed, Dec 4

Marostegui added a comment to T143870: Some mw snapshot hosts are accessing main db servers.

For what is worth - confirming that db1118 is not configured on vslow/dumps:

root@cumin1001:~#  dbctl instance db1118 get
{
    "db1118": {
        "host_ip": "10.64.16.12",
        "note": "",
        "port": 3306,
        "sections": {
            "s1": {
                "percentage": 100,
                "pooled": true,
                "weight": 500
            }
        }
    },
    "tags": "datacenter=eqiad"
}
Wed, Dec 4, 7:09 PM · MW-1.35-notes (1.35.0-wmf.10; 2019-12-10), Dumps-Generation, DBA
Marostegui updated the task description for T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes.
Wed, Dec 4, 2:02 PM · DBA, User-jbond, Puppet, Operations
Marostegui updated the task description for T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes.
Wed, Dec 4, 1:59 PM · DBA, User-jbond, Puppet, Operations
Marostegui moved T239814: Automate DB upgrades from Triage to Backlog on the DBA board.
Wed, Dec 4, 1:50 PM · DBA
Marostegui triaged T239814: Automate DB upgrades as Medium priority.
Wed, Dec 4, 1:49 PM · DBA
Marostegui created T239814: Automate DB upgrades.
Wed, Dec 4, 1:48 PM · DBA
Marostegui added a comment to T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes.

Probably, some of them will probably be covered with the reimage to buster I would say.

Wed, Dec 4, 11:19 AM · DBA, User-jbond, Puppet, Operations
Marostegui added a subtask for T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes: T239238: Switchover s8 primary database master db1109 -> db1104 - Date TBD.
Wed, Dec 4, 11:01 AM · DBA, User-jbond, Puppet, Operations
Marostegui added a parent task for T239238: Switchover s8 primary database master db1109 -> db1104 - Date TBD: T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes.
Wed, Dec 4, 11:01 AM · DBA
Marostegui updated the task description for T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes.
Wed, Dec 4, 10:58 AM · DBA, User-jbond, Puppet, Operations
Marostegui moved T239791: DB: perform rolling restart of mariadb deamons to pick up CA changes from Triage to In progress on the DBA board.
Wed, Dec 4, 10:47 AM · DBA, User-jbond, Puppet, Operations
Marostegui added a comment to T236277: Extend Puppet CA Expiry date .

Checks performed:

Wed, Dec 4, 10:45 AM · DBA, Patch-For-Review, User-jbond, Puppet, Operations
Marostegui moved T238399: Reimport wikidatawiki.{pagelinks,page} on labsdb1010 from Next to In progress on the DBA board.
Wed, Dec 4, 8:59 AM · Data-Services, DBA
Marostegui updated the task description for T238399: Reimport wikidatawiki.{pagelinks,page} on labsdb1010.
Wed, Dec 4, 8:58 AM · Data-Services, DBA
Marostegui added a comment to T238399: Reimport wikidatawiki.{pagelinks,page} on labsdb1010.

wikidatawiki.page has been reimported from labsdb1012.
Views for that table have been recreated as well.

Wed, Dec 4, 8:58 AM · Data-Services, DBA
Marostegui updated the task description for T239684: Decommission db2070.codfw.wmnet.
Wed, Dec 4, 6:10 AM · DBA
Marostegui moved T239046: decommission db2065.codfw.wmnet from Next to In progress on the DBA board.
Wed, Dec 4, 6:07 AM · Operations, DC-Ops, ops-codfw, decommission
Marostegui updated the task description for T233135: Schema change for refactored actor and comment storage.
Wed, Dec 4, 5:57 AM · Core Platform Team, Blocked-on-schema-change, DBA
Marostegui added a comment to T229686: #dbctl: manage 'externalLoads' data.

Thank you! :-)

Wed, Dec 4, 5:57 AM · Performance-Team, DBA, conftool

Tue, Dec 3

Marostegui updated the task description for T239453: Remove partitions from revision table.
Tue, Dec 3, 10:31 AM · DBA
Marostegui moved T239188: Decommission db1062.eqiad.wmnet from Next to In progress on the DBA board.
Tue, Dec 3, 10:25 AM · DBA
Marostegui updated the task description for T239046: decommission db2065.codfw.wmnet.
Tue, Dec 3, 9:45 AM · Operations, DC-Ops, ops-codfw, decommission
Marostegui updated the task description for T239188: Decommission db1062.eqiad.wmnet.
Tue, Dec 3, 8:27 AM · DBA
Marostegui added a comment to T235743: Prepare and check storage layer for mnwwiki.

Change 554159 merged by Phamhi:
[operations/puppet@production] wmcs: don't process lines starting with a comment
https://gerrit.wikimedia.org/r/554159

Tue, Dec 3, 7:38 AM · cloud-services-team (Kanban), Data-Services, DBA
Marostegui updated the task description for T228258: Decommission db2043-db2070.
Tue, Dec 3, 6:22 AM · Operations, DBA
Marostegui triaged T239684: Decommission db2070.codfw.wmnet as Medium priority.
Tue, Dec 3, 6:22 AM · DBA
Marostegui created T239684: Decommission db2070.codfw.wmnet.
Tue, Dec 3, 6:21 AM · DBA
Marostegui closed T208323: Predictive failures on disk S.M.A.R.T. status as Resolved.

All these hosts have been sent for decommissioning.
Going to close this for now.

Tue, Dec 3, 5:57 AM · Operations, DBA
Marostegui updated the task description for T208323: Predictive failures on disk S.M.A.R.T. status.
Tue, Dec 3, 5:56 AM · Operations, DBA
Marostegui updated the task description for T228258: Decommission db2043-db2070.
Tue, Dec 3, 5:56 AM · Operations, DBA
Marostegui moved T233185: Decommission db2067.codfw.wmnet from Backlog to pending onsite steps (codfw) on the decommission board.

host ready for @Papaul to take over the last steps

Tue, Dec 3, 5:56 AM · Patch-For-Review, Operations, DC-Ops, ops-codfw, decommission
Marostegui reassigned T233185: Decommission db2067.codfw.wmnet from Marostegui to Papaul.
Tue, Dec 3, 5:56 AM · Patch-For-Review, Operations, DC-Ops, ops-codfw, decommission
Marostegui updated the task description for T234704: Remove ar_comment from sanitarium triggers.
Tue, Dec 3, 5:48 AM · DBA

Mon, Dec 2

Marostegui added a comment to T239217: Degraded RAID on dbstore1003.

Any update on this? Thanks!

Mon, Dec 2, 5:02 PM · Analytics, ops-eqiad, Operations
Marostegui updated the task description for T234704: Remove ar_comment from sanitarium triggers.
Mon, Dec 2, 3:00 PM · DBA
Marostegui closed T238113: Repurpose db1107 as a generic database, a subtask of T234826: Repurpose db1108 as generic Analytics db replica, as Resolved.
Mon, Dec 2, 2:26 PM · Patch-For-Review, User-Elukey, Analytics-Kanban, Analytics
Marostegui closed T238113: Repurpose db1107 as a generic database, a subtask of T159170: Sunset MySQL data store for eventlogging, as Resolved.
Mon, Dec 2, 2:26 PM · Analytics-Kanban, Analytics-EventLogging
Marostegui closed T238113: Repurpose db1107 as a generic database as Resolved.

db1107 has been reimaged into buster and placed on test-s1 with MariaDB 10.3.20 with replicating from enwiki master and being an intermediate master for db1114 (running percona mysql 8)

Mon, Dec 2, 2:26 PM · Analytics, DBA
Marostegui added a comment to T193224: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished.

db1107 is now running the latest 10.3.20 from MariaDB replicating from s1 master and db1114 (which runs percona-server 8.0) replicates from it.

Mon, Dec 2, 2:25 PM · MediaWiki-General, Operations, DBA
Marostegui reopened T239041: cp3053 is unreachable, a subtask of T238305: servers freeze across the caching cluster, as Open.
Mon, Dec 2, 10:25 AM · Traffic, Operations
Marostegui reopened T239041: cp3053 is unreachable as "Open".

This host went down again:

And [10:23:27]  <+icinga-wm>	PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100%
Mon, Dec 2, 10:25 AM · Operations, Traffic, ops-esams
Marostegui added a comment to T238305: servers freeze across the caching cluster.

And [10:23:27] <+icinga-wm> PROBLEM - Host cp3053 is DOWN: PING CRITICAL - Packet loss = 100% which already failed: T239041

Mon, Dec 2, 10:24 AM · Traffic, Operations
Marostegui added a comment to T229686: #dbctl: manage 'externalLoads' data.

Any rough ETA on when externalLoads will be able to be handled by dbctl?

Mon, Dec 2, 10:10 AM · Performance-Team, DBA, conftool
Marostegui updated the task description for T233135: Schema change for refactored actor and comment storage.
Mon, Dec 2, 6:26 AM · Core Platform Team, Blocked-on-schema-change, DBA
Marostegui added a comment to T233135: Schema change for refactored actor and comment storage.

s3 eqiad progress

  • labsdb1012
  • labsdb1011
  • labsdb1010
  • labsdb1009
  • dbstore1004
  • db1124
  • db1123
  • db1112
  • db1095
  • db1078
  • db1075
Mon, Dec 2, 5:55 AM · Core Platform Team, Blocked-on-schema-change, DBA
Marostegui removed a project from T239524: My Toolforge bot can't execute an SQL request that quickly executes on Quarry: DBA.

I have tested both queries on the hosts themselves and they do work fine (as expecting seeing the quarry results you pasted). Could it be something with your client/connectors? Are you running the last versions?

Mon, Dec 2, 5:51 AM · Data-Services
Marostegui added a comment to T239238: Switchover s8 primary database master db1109 -> db1104 - Date TBD.

Absolutely! Thanks

Mon, Dec 2, 5:47 AM · DBA

Sat, Nov 30

Marostegui added a comment to T238305: servers freeze across the caching cluster.
06:13:59 <+icinga-wm> PROBLEM - Host cp3057 is DOWN: PING CRITICAL - Packet loss = 100
Sat, Nov 30, 7:02 AM · Traffic, Operations

Fri, Nov 29

Marostegui updated the task description for T233135: Schema change for refactored actor and comment storage.
Fri, Nov 29, 6:25 PM · Core Platform Team, Blocked-on-schema-change, DBA
Marostegui added a comment to T239453: Remove partitions from revision table.

MariaDB confirmed it is a bug: https://jira.mariadb.org/browse/MDEV-21176?focusedCommentId=138891&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-138891

Fri, Nov 29, 1:41 PM · DBA
Marostegui updated the task description for T234704: Remove ar_comment from sanitarium triggers.
Fri, Nov 29, 9:44 AM · DBA
Marostegui updated the task description for T233135: Schema change for refactored actor and comment storage.
Fri, Nov 29, 9:20 AM · Core Platform Team, Blocked-on-schema-change, DBA
Marostegui added a comment to T239453: Remove partitions from revision table.

I have put the ALTERs in separate transactions, because I am finding something weird when testing it on my lab. I have reported it to MariaDB just in case: https://jira.mariadb.org/browse/MDEV-21176

Fri, Nov 29, 9:17 AM · DBA
Marostegui updated the task description for T239453: Remove partitions from revision table.
Fri, Nov 29, 8:34 AM · DBA
Marostegui created P9782 (An Untitled Masterwork).
Fri, Nov 29, 8:34 AM
Marostegui added a comment to T223151: Review special replica partitioning of certain tables by `xx_user`.

Nothing has showed up with db1089, so I believe we should slowly start moving forward to remove revision table partitions from those wikis where it exists.
I have created T239453 to track/discuss

Fri, Nov 29, 7:13 AM · mariadb-optimizer-bug, Core Platform Team, MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), Performance Issue, DBA
Marostegui triaged T239453: Remove partitions from revision table as Medium priority.
Fri, Nov 29, 7:10 AM · DBA
Marostegui created T239453: Remove partitions from revision table.
Fri, Nov 29, 7:10 AM · DBA
Marostegui updated the task description for T232446: Compress new Wikibase tables.
Fri, Nov 29, 6:49 AM · DBA
Marostegui updated the task description for T228258: Decommission db2043-db2070.
Fri, Nov 29, 6:19 AM · Operations, DBA
Marostegui moved T238297: decommission db1067.eqiad.wmnet from Backlog to pending onsite steps (eqiad) on the decommission board.
Fri, Nov 29, 6:18 AM · DC-Ops, decommission, ops-eqiad, Operations
Marostegui moved T238624: Decommission db1061.eqiad.wmnet from Backlog to pending onsite steps (eqiad) on the decommission board.
Fri, Nov 29, 6:18 AM · Operations, DC-Ops, ops-eqiad, decommission
Marostegui moved T238726: Decommission db2062.codfw.wmnet from Backlog to pending onsite steps (codfw) on the decommission board.

Host ready for @Papaul to take over

Fri, Nov 29, 6:18 AM · Patch-For-Review, Operations, DC-Ops, ops-codfw, decommission
Marostegui updated the task description for T238726: Decommission db2062.codfw.wmnet.
Fri, Nov 29, 6:17 AM · Patch-For-Review, Operations, DC-Ops, ops-codfw, decommission
Marostegui reassigned T238726: Decommission db2062.codfw.wmnet from Marostegui to Papaul.
Fri, Nov 29, 6:17 AM · Patch-For-Review, Operations, DC-Ops, ops-codfw, decommission
Marostegui moved T233185: Decommission db2067.codfw.wmnet from Next to In progress on the DBA board.
Fri, Nov 29, 6:10 AM · Patch-For-Review, Operations, DC-Ops, ops-codfw, decommission
Marostegui closed T234066: Schema change to rename user_newtalk indexes, a subtask of T233240: Remove MySQL aliasing for user_newtalk indexes, as Resolved.
Fri, Nov 29, 6:06 AM · MW-1.34-notes (1.34.0-wmf.25; 2019-10-01), User-Marostegui, Schema-change, MediaWiki-General, Core Platform Team Workboards (Clinic Duty Team)
Marostegui closed T234066: Schema change to rename user_newtalk indexes as Resolved.

All done

Fri, Nov 29, 6:06 AM · Blocked-on-schema-change, DBA
Marostegui updated the task description for T234066: Schema change to rename user_newtalk indexes.
Fri, Nov 29, 6:06 AM · Blocked-on-schema-change, DBA
Marostegui updated the task description for T234066: Schema change to rename user_newtalk indexes.
Fri, Nov 29, 6:03 AM · Blocked-on-schema-change, DBA

Thu, Nov 28

Elitre awarded T234801: Community Relations support needed for a read-only window for s1 (enwiki) a Like token.
Thu, Nov 28, 5:39 PM · CommRel-Specialists-Support (Oct-Dec-2019)
Marostegui updated the task description for T232446: Compress new Wikibase tables.
Thu, Nov 28, 3:40 PM · DBA
Marostegui updated the task description for T234066: Schema change to rename user_newtalk indexes.
Thu, Nov 28, 2:57 PM · Blocked-on-schema-change, DBA