Page MenuHomePhabricator

MatthewVernon (Matthew Vernon)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Aug 2 2021, 1:52 PM (18 w, 2 d)
Availability
Available
LDAP User
MVernon
MediaWiki User
MVernon (WMF) [ Global Accounts ]

Recent Activity

Yesterday

MatthewVernon closed T296289: swift-proxy not starting on ms-fe2009 due to missing python-monotonic as Resolved.
Wed, Dec 8, 4:44 PM · SRE-swift-storage
MatthewVernon claimed T296289: swift-proxy not starting on ms-fe2009 due to missing python-monotonic.

> AFAIK rewrite in puppet is now the canonical place for this work, i.e. SwiftMedia is no longer

Wed, Dec 8, 3:42 PM · SRE-swift-storage

Tue, Dec 7

MatthewVernon added a comment to T296289: swift-proxy not starting on ms-fe2009 due to missing python-monotonic.

OK, I know what the problem is (at least at one level). Our swift front-ends use a bit of middleware wmf.rewrite which is shipped by us from puppet; that calls import monotonic. But there's nothing to say that python-monotonic (or python3-monotonic) should be installed from swift front-ends as far as I can see.

Tue, Dec 7, 4:28 PM · SRE-swift-storage
MatthewVernon moved T296945: Deploy research_poc Swift credidentials to Hadoop from Inbox to In progress on the SRE-swift-storage board.
Tue, Dec 7, 10:14 AM · Data-Engineering-Kanban, Data-Engineering, SRE-swift-storage

Thu, Dec 2

MatthewVernon added a comment to T294380: Storage request for datasets published by research team.

For S3, you need three things - access key, secret key, endpoint.

Thu, Dec 2, 3:59 PM · SRE-swift-storage

Tue, Nov 30

MatthewVernon added a comment to T295563: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet.
Tue, Nov 30, 1:42 PM · SRE, SRE-swift-storage, ops-codfw, DC-Ops

Thu, Nov 25

MatthewVernon added a comment to T295965: Test MariaDB 10.4 with Bullseye.

The problem is that /lib/systemd/system/mariadb.service lacks the changes from https://gerrit.wikimedia.org/r/c/operations/software/+/715926

Thu, Nov 25, 2:17 PM · Patch-For-Review, DBA

Tue, Nov 23

MatthewVernon added a comment to T294380: Storage request for datasets published by research team.

Account is created; I gather the usual approach is to instruct puppet to write a configuration file with the relevant details in it (taken from profile::thanos::swift::accounts_keys ), similar to how objstore.yaml is written by modules/thanos/manifests/compact.pp or the lookups in modules/profile/manifests/docker_registry_ha/registry.pp.

Tue, Nov 23, 5:00 PM · SRE-swift-storage
MatthewVernon archived P17802 thanos restart failure.
Tue, Nov 23, 3:30 PM
MatthewVernon created P17802 thanos restart failure.
Tue, Nov 23, 3:29 PM
MatthewVernon committed rLPRI964a8f919fdc: profile::thanos::swift: fake creds for research_poc (authored by MatthewVernon).
profile::thanos::swift: fake creds for research_poc
Tue, Nov 23, 3:17 PM
MatthewVernon added a comment to T294016: Swift-recon -d overstates disk capacity and usage.

https://review.opendev.org/c/openstack/swift/+/818881 is this patch sent upstream.

Tue, Nov 23, 11:17 AM · SRE-swift-storage

Wed, Nov 17

MatthewVernon added a comment to T295563: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet.

xfs_repair found a number of problems with the filesystem, and more medium errors were reported by the kernel:

Nov 17 14:24:40 ms-be2059 kernel: [21720811.039143] sd 0:2:17:0: [sdr] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 17 14:24:40 ms-be2059 kernel: [21720811.039147] sd 0:2:17:0: [sdr] tag#2 Sense Key : Medium Error [current] 
Nov 17 14:24:40 ms-be2059 kernel: [21720811.039150] sd 0:2:17:0: [sdr] tag#2 Add. Sense: No additional sense information
Nov 17 14:24:40 ms-be2059 kernel: [21720811.039153] sd 0:2:17:0: [sdr] tag#2 CDB: Read(16) 88 00 00 00 00 00 0b 57 42 f0 00 00 00 18 00 00
Nov 17 14:24:40 ms-be2059 kernel: [21720811.039155] blk_update_request: I/O error, dev sdr, sector 190268144
Nov 17 14:49:37 ms-be2059 kernel: [21722304.380409] sd 0:2:17:0: [sdr] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 17 14:49:37 ms-be2059 kernel: [21722304.380413] sd 0:2:17:0: [sdr] tag#3 Sense Key : Medium Error [current] 
Nov 17 14:49:37 ms-be2059 kernel: [21722304.380415] sd 0:2:17:0: [sdr] tag#3 Add. Sense: No additional sense information
Nov 17 14:49:37 ms-be2059 kernel: [21722304.380418] sd 0:2:17:0: [sdr] tag#3 CDB: Read(16) 88 00 00 00 00 00 0b 57 42 f8 00 00 00 08 00 00
Nov 17 14:49:37 ms-be2059 kernel: [21722304.380420] blk_update_request: I/O error, dev sdr, sector 190268152
Nov 17 15:36:21 ms-be2059 kernel: [21725102.144703] sd 0:2:17:0: [sdr] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 17 15:36:21 ms-be2059 kernel: [21725102.144717] sd 0:2:17:0: [sdr] tag#0 Sense Key : Medium Error [current] 
Nov 17 15:36:21 ms-be2059 kernel: [21725102.144720] sd 0:2:17:0: [sdr] tag#0 Add. Sense: No additional sense information
Nov 17 15:36:21 ms-be2059 kernel: [21725102.144724] sd 0:2:17:0: [sdr] tag#0 CDB: Read(16) 88 00 00 00 00 00 0b 57 43 00 00 00 00 08 00 00
Nov 17 15:36:21 ms-be2059 kernel: [21725102.144727] blk_update_request: I/O error, dev sdr, sector 190268160

I hope this is sufficient to get this drive replaced?

Wed, Nov 17, 4:31 PM · SRE, SRE-swift-storage, ops-codfw, DC-Ops
MatthewVernon added a comment to T295563: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet.

@Papaul I'm trying to xfs_repair the filesystem, which is a lengthy process, but I'm seeing medium errors in the kernel log again:

Nov 17 13:43:29 ms-be2059 kernel: [21718345.521149] sd 0:2:17:0: [sdr] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 17 13:43:29 ms-be2059 kernel: [21718345.521163] sd 0:2:17:0: [sdr] tag#1 Sense Key : Medium Error [current] 
Nov 17 13:43:29 ms-be2059 kernel: [21718345.521166] sd 0:2:17:0: [sdr] tag#1 Add. Sense: No additional sense information
Nov 17 13:43:29 ms-be2059 kernel: [21718345.521171] sd 0:2:17:0: [sdr] tag#1 CDB: Read(16) 88 00 00 00 00 00 0b 57 42 f8 00 00 00 08 00 00
Nov 17 13:43:29 ms-be2059 kernel: [21718345.521175] blk_update_request: I/O error, dev sdr, sector 190268152

That's the same sector as previously, which really does make me think there's a hardware fault here. Is that not enough to convince Dell?
[I'll update when xfs_repair completes]

Wed, Nov 17, 2:13 PM · SRE, SRE-swift-storage, ops-codfw, DC-Ops
MatthewVernon added a comment to T295118: Can't commit on asw-b-codfw.

I don't think so, no - the frontends will not route requests to down servers (at least in theory!); we'll be more vulnerable to failures elsewhere, I think we have to live with that.

Wed, Nov 17, 10:53 AM · SRE-swift-storage, ops-codfw, SRE, Infrastructure-Foundations, netops

Thu, Nov 11

MatthewVernon added a comment to T295563: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet.


Here's the SupportAssistCollection output.

Thu, Nov 11, 5:14 PM · SRE, SRE-swift-storage, ops-codfw, DC-Ops
MatthewVernon created T295563: hw troubleshooting: disk failure (sdr) for ms-be2059.codfw.wmnet.
Thu, Nov 11, 5:06 PM · SRE, SRE-swift-storage, ops-codfw, DC-Ops

Wed, Nov 10

MatthewVernon claimed T294380: Storage request for datasets published by research team.

Please let me know what the next steps are.

Wed, Nov 10, 1:38 PM · SRE-swift-storage

Nov 8 2021

MatthewVernon added a comment to T294380: Storage request for datasets published by research team.

Hi,

  1. Are you OK with using the S3 protocol (rather than the Swift protocol)?

Yes that would work well with the largish file like objects we intend to store. However, I was under the impression that the S3 protocol will only be enabled for the new misc swift cluster which will not be available for a while

It's true that the main Swift cluster doesn't support S3, but the Thanos cluster does, and that has enough capacity at least for your proof-of-concept needs.

Nov 8 2021, 5:03 PM · SRE-swift-storage

Nov 3 2021

MatthewVernon added a comment to T294380: Storage request for datasets published by research team.

Sorry for the delay in getting back to you. I have a couple of questions about your request, if I may:

  1. Are you OK with using the S3 protocol (rather than the Swift protocol)?
  2. Do you have an idea of how much performance/bandwidth you need (read and write)?
  3. What sort of timescale do you need this storage on?
Nov 3 2021, 4:47 PM · SRE-swift-storage

Oct 26 2021

MatthewVernon closed T288458: Put ms-be20[62-65] in service as Resolved.
Oct 26 2021, 8:56 AM · User-fgiunchedi, SRE-swift-storage

Oct 25 2021

MatthewVernon added a comment to T294001: Degraded RAID on ms-be2028.

@Papaul thanks :)

Oct 25 2021, 2:37 PM · Data-Persistence, SRE-swift-storage, SRE, ops-codfw
MatthewVernon added a watcher for SRE-swift-storage: MatthewVernon.
Oct 25 2021, 9:48 AM
MatthewVernon added a comment to T294001: Degraded RAID on ms-be2028.

[subscribing so I get a ping once we know if there's an available spare or not]

Oct 25 2021, 8:48 AM · Data-Persistence, SRE-swift-storage, SRE, ops-codfw
MatthewVernon added a project to T294001: Degraded RAID on ms-be2028: Data-Persistence.
Oct 25 2021, 8:46 AM · Data-Persistence, SRE-swift-storage, SRE, ops-codfw

Oct 21 2021

MatthewVernon moved T265117: Consider swift ring management automation from Backlog to In progress on the SRE-swift-storage board.
Oct 21 2021, 1:38 PM · SRE-swift-storage
MatthewVernon moved T294019: Monitoring (?+alerting) for Swift capacity from Inbox to In progress on the SRE-swift-storage board.
Oct 21 2021, 1:38 PM · SRE-swift-storage
MatthewVernon created T294019: Monitoring (?+alerting) for Swift capacity.
Oct 21 2021, 1:38 PM · SRE-swift-storage
MatthewVernon added a comment to T294016: Swift-recon -d overstates disk capacity and usage.

Stick the patch here just in case...

diff --git a/swift/cli/recon.py b/swift/cli/recon.py
index cd0952875..304a75a90 100644
--- a/swift/cli/recon.py
+++ b/swift/cli/recon.py
@@ -895,6 +895,7 @@ class SwiftRecon(object):
         percents = {}
         top_percents = [(None, 0)] * top
         low_percents = [(None, 100)] * lowest
+        hosts_checked = []
         recon = Scout("diskusage", self.verbose, self.suppress_errors,
                       self.timeout)
         print("[%s] Checking disk usage now" % self._ptime())
@@ -902,6 +903,10 @@ class SwiftRecon(object):
                 recon.scout, hosts):
             if status == 200:
                 hostusage = []
+                host = urlparse(url).netloc.split(':')[0]
+                if host in hosts_checked:
+                    continue
+                hosts_checked.append(host)
                 for entry in response:
                     if not isinstance(entry['mounted'], bool):
                         print("-> %s/%s: Error: %s" % (url, entry['device'],
Oct 21 2021, 1:30 PM · SRE-swift-storage
MatthewVernon added a project to T294016: Swift-recon -d overstates disk capacity and usage: Data-Persistence.
Oct 21 2021, 1:28 PM · SRE-swift-storage
MatthewVernon moved T294016: Swift-recon -d overstates disk capacity and usage from Inbox to In progress on the SRE-swift-storage board.
Oct 21 2021, 1:26 PM · SRE-swift-storage
MatthewVernon created T294016: Swift-recon -d overstates disk capacity and usage.
Oct 21 2021, 1:26 PM · SRE-swift-storage
MatthewVernon claimed T288458: Put ms-be20[62-65] in service.
Oct 21 2021, 1:15 PM · User-fgiunchedi, SRE-swift-storage

Oct 20 2021

MatthewVernon added a comment to T257056: Add alert for prometheus-mysql-exporter failing to scrape mysql.

Yes, it actually fired on 2021-10-09 (and resolved when I restarted the offending unit) -

<jinxer-wm> (PrometheusMysqldExporterFailed) firing: Prometheus-mysqld-exporter failed (db1119:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org
<jinxer-wm> (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (db1119:9104)  - https://grafana.wikimedia.org/d/000000278/mysql-aggregated -https://alerts.wikimedia.org

This was in #wikimedia-data-persistence channel.

Oct 20 2021, 12:23 PM · User-Kormat, DBA

Oct 18 2021

MatthewVernon closed T290881: Spontaneous reboot of ms-be2045 as Resolved.

Full weight restored, so closing this (again ;-) )

Oct 18 2021, 9:57 AM · Patch-For-Review, SRE, SRE-swift-storage

Oct 12 2021

MatthewVernon claimed T265117: Consider swift ring management automation.

I think we have an outline of how to make this work, so I'll take ownership of this item.

Oct 12 2021, 3:31 PM · SRE-swift-storage

Oct 11 2021

MatthewVernon claimed T290881: Spontaneous reboot of ms-be2045.

@Papaul system was stable over the weekend, so I'll take this ticket and start restoring this system to the Swift rings. Thanks!

Oct 11 2021, 2:26 PM · Patch-For-Review, SRE, SRE-swift-storage
MatthewVernon archived P17451 rebalancing in swift-recon.
Oct 11 2021, 11:08 AM
MatthewVernon added a comment to T292957: Investigate the root cause of prometheus-mysqld-exporter alert on db1119.

In case it's useful for future comparison, this is what lsof says about the exporter process now (ie during normal behviour)

COMMAND    PID       USER   FD      TYPE             DEVICE SIZE/OFF      NODE NAME
prometheu 8593 prometheus  cwd       DIR                8,1     4096         2 /
prometheu 8593 prometheus  rtd       DIR                8,1     4096         2 /
prometheu 8593 prometheus  txt       REG                8,1  9681936   1982817 /usr/bin/prometheus-mysqld-exporter
prometheu 8593 prometheus  mem       REG                8,1  1824496   1971450 /usr/lib/x86_64-linux-gnu/libc-2.28.so
prometheu 8593 prometheus  mem       REG                8,1   146968   1971464 /usr/lib/x86_64-linux-gnu/libpthread-2.28.so
prometheu 8593 prometheus  mem       REG                8,1   165632   1971446 /usr/lib/x86_64-linux-gnu/ld-2.28.so
prometheu 8593 prometheus    0r      CHR                1,3      0t0         6 /dev/null
prometheu 8593 prometheus    1u     unix 0xffff968af4e47400      0t0 719869330 type=STREAM
prometheu 8593 prometheus    2u     unix 0xffff968af4e47400      0t0 719869330 type=STREAM
prometheu 8593 prometheus    3u     IPv6          719782869      0t0       TCP *:9104 (LISTEN)
prometheu 8593 prometheus    4u  a_inode               0,13        0     10390 [eventpoll]
prometheu 8593 prometheus    5u     IPv6          719811990      0t0       TCP db1119.eqiad.wmnet:9104->prometheus1003.eqiad.wmnet:36304 (ESTABLISHED)
prometheu 8593 prometheus    6u     IPv6          719801884      0t0       TCP db1119.eqiad.wmnet:9104->prometheus1004.eqiad.wmnet:52084 (ESTABLISHED)
Oct 11 2021, 10:42 AM · DBA
MatthewVernon added a comment to T292957: Investigate the root cause of prometheus-mysqld-exporter alert on db1119.

journalctl output is below - you can see the restart I did on 9 October. I presume the failure on 5 April is the "pme started before mysqld" error that we resolved with the work on T289488 ; so this does look like a different issue. At least we know the alerting works ;-)

-- Logs begin at Mon 2021-04-05 11:12:08 UTC, end at Mon 2021-10-11 10:09:09 UTC. --
Apr 05 11:12:15 db1119 systemd[1]: Started Prometheus exporter for MySQL server.
Apr 05 11:12:15 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:12:15Z" level=info msg="Starting mysqld_exporter (version=0.11.0+ds, branch=debian/sid, revision=0.11.0+ds-1+b20)" source="mysqld_exporter.go:206"
Apr 05 11:12:15 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:12:15Z" level=info msg="Build context (go=go1.11.5, user=pkg-go-maintainers@lists.alioth.debian.org, date=20190311-02:06:43)" source="mysqld_exporter.go:207"
Apr 05 11:12:15 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:12:15Z" level=info msg="Enabled scrapers:" source="mysqld_exporter.go:218"
Apr 05 11:12:15 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:12:15Z" level=info msg=" --collect.global_status" source="mysqld_exporter.go:222"
Apr 05 11:12:15 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:12:15Z" level=info msg=" --collect.global_variables" source="mysqld_exporter.go:222"
Apr 05 11:12:15 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:12:15Z" level=info msg=" --collect.slave_status" source="mysqld_exporter.go:222"
Apr 05 11:12:15 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:12:15Z" level=info msg=" --collect.info_schema.processlist" source="mysqld_exporter.go:222"
Apr 05 11:12:15 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:12:15Z" level=info msg="Listening on :9104" source="mysqld_exporter.go:232"
Apr 05 11:12:36 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:12:36Z" level=error msg="Error pinging mysqld: dial unix /run/mysqld/mysqld.sock: connect: no such file or directory" source="exporter.go:119"
Apr 05 11:12:36 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:12:36Z" level=error msg="Error pinging mysqld: dial unix /run/mysqld/mysqld.sock: connect: no such file or directory" source="exporter.go:119"
Apr 05 11:13:36 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:13:36Z" level=error msg="Error pinging mysqld: dial unix /run/mysqld/mysqld.sock: connect: no such file or directory" source="exporter.go:119"
Apr 05 11:13:36 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:13:36Z" level=error msg="Error pinging mysqld: dial unix /run/mysqld/mysqld.sock: connect: no such file or directory" source="exporter.go:119"
Apr 05 11:14:36 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:14:36Z" level=error msg="Error pinging mysqld: dial unix /run/mysqld/mysqld.sock: connect: no such file or directory" source="exporter.go:119"
Apr 05 11:14:36 db1119 prometheus-mysqld-exporter[692]: time="2021-04-05T11:14:36Z" level=error msg="Error pinging mysqld: dial unix /run/mysqld/mysqld.sock: connect: no such file or directory" source="exporter.go:119"
Apr 05 11:17:15 db1119 systemd[1]: Stopping Prometheus exporter for MySQL server...
Apr 05 11:17:15 db1119 systemd[1]: prometheus-mysqld-exporter.service: Main process exited, code=killed, status=15/TERM
Apr 05 11:17:15 db1119 systemd[1]: prometheus-mysqld-exporter.service: Succeeded.
Apr 05 11:17:15 db1119 systemd[1]: Stopped Prometheus exporter for MySQL server.
Apr 05 11:17:15 db1119 systemd[1]: Started Prometheus exporter for MySQL server.
Apr 05 11:17:15 db1119 prometheus-mysqld-exporter[3610]: time="2021-04-05T11:17:15Z" level=info msg="Starting mysqld_exporter (version=0.11.0+ds, branch=debian/sid, revision=0.11.0+ds-1+b20)" source="mysqld_exporter.go:206"
Apr 05 11:17:15 db1119 prometheus-mysqld-exporter[3610]: time="2021-04-05T11:17:15Z" level=info msg="Build context (go=go1.11.5, user=pkg-go-maintainers@lists.alioth.debian.org, date=20190311-02:06:43)" source="mysqld_exporter.go:207"
Apr 05 11:17:15 db1119 prometheus-mysqld-exporter[3610]: time="2021-04-05T11:17:15Z" level=info msg="Enabled scrapers:" source="mysqld_exporter.go:218"
Apr 05 11:17:15 db1119 prometheus-mysqld-exporter[3610]: time="2021-04-05T11:17:15Z" level=info msg=" --collect.global_status" source="mysqld_exporter.go:222"
Apr 05 11:17:15 db1119 prometheus-mysqld-exporter[3610]: time="2021-04-05T11:17:15Z" level=info msg=" --collect.global_variables" source="mysqld_exporter.go:222"
Apr 05 11:17:15 db1119 prometheus-mysqld-exporter[3610]: time="2021-04-05T11:17:15Z" level=info msg=" --collect.slave_status" source="mysqld_exporter.go:222"
Apr 05 11:17:15 db1119 prometheus-mysqld-exporter[3610]: time="2021-04-05T11:17:15Z" level=info msg=" --collect.info_schema.processlist" source="mysqld_exporter.go:222"
Apr 05 11:17:15 db1119 prometheus-mysqld-exporter[3610]: time="2021-04-05T11:17:15Z" level=info msg="Listening on :9104" source="mysqld_exporter.go:232"
Oct 01 05:22:00 db1119 systemd[1]: Stopping Prometheus exporter for MySQL server...
Oct 01 05:22:00 db1119 systemd[1]: prometheus-mysqld-exporter.service: Main process exited, code=killed, status=15/TERM
Oct 01 05:22:00 db1119 systemd[1]: prometheus-mysqld-exporter.service: Succeeded.
Oct 01 05:22:00 db1119 systemd[1]: Stopped Prometheus exporter for MySQL server.
Oct 01 05:23:59 db1119 systemd[1]: Started Prometheus exporter for MySQL server.
Oct 01 05:23:59 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-01T05:23:59Z" level=info msg="Starting mysqld_exporter (version=0.11.0+ds, branch=debian/sid, revision=0.11.0+ds-1+b20)" source="mysqld_exporter.go:206"
Oct 01 05:23:59 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-01T05:23:59Z" level=info msg="Build context (go=go1.11.5, user=pkg-go-maintainers@lists.alioth.debian.org, date=20190311-02:06:43)" source="mysqld_exporter.go:207"
Oct 01 05:23:59 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-01T05:23:59Z" level=info msg="Enabled scrapers:" source="mysqld_exporter.go:218"
Oct 01 05:23:59 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-01T05:23:59Z" level=info msg=" --collect.global_status" source="mysqld_exporter.go:222"
Oct 01 05:23:59 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-01T05:23:59Z" level=info msg=" --collect.global_variables" source="mysqld_exporter.go:222"
Oct 01 05:23:59 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-01T05:23:59Z" level=info msg=" --collect.slave_status" source="mysqld_exporter.go:222"
Oct 01 05:23:59 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-01T05:23:59Z" level=info msg=" --collect.info_schema.processlist" source="mysqld_exporter.go:222"
Oct 01 05:23:59 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-01T05:23:59Z" level=info msg="Listening on :9104" source="mysqld_exporter.go:232"
Oct 09 09:12:02 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-09T09:12:02Z" level=error msg="Error pinging mysqld: Error 1226: User 'prometheus' has exceeded the 'max_user_connections' resource (current value: 5)" source="exporter.go:119"
Oct 09 09:12:02 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-09T09:12:02Z" level=error msg="Error pinging mysqld: Error 1226: User 'prometheus' has exceeded the 'max_user_connections' resource (current value: 5)" source="exporter.go:119"
Oct 09 09:12:02 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-09T09:12:02Z" level=error msg="Error pinging mysqld: Error 1226: User 'prometheus' has exceeded the 'max_user_connections' resource (current value: 5)" source="exporter.go:119"
Oct 09 09:12:02 db1119 prometheus-mysqld-exporter[25582]: time="2021-10-09T09:12:02Z" level=error msg="Error pinging mysqld: Error 1226: User 'prometheus' has exceeded the 'max_user_connections' resource (current value: 5)" source="exporter.go:119"
Oct 09 10:11:43 db1119 systemd[1]: Stopping Prometheus exporter for MySQL server...
Oct 09 10:11:43 db1119 systemd[1]: prometheus-mysqld-exporter.service: Main process exited, code=killed, status=15/TERM
Oct 09 10:11:43 db1119 systemd[1]: prometheus-mysqld-exporter.service: Succeeded.
Oct 09 10:11:43 db1119 systemd[1]: Stopped Prometheus exporter for MySQL server.
Oct 09 10:11:43 db1119 systemd[1]: Started Prometheus exporter for MySQL server.
Oct 09 10:11:43 db1119 prometheus-mysqld-exporter[8593]: time="2021-10-09T10:11:43Z" level=info msg="Starting mysqld_exporter (version=0.11.0+ds, branch=debian/sid, revision=0.11.0+ds-1+b20)" source="mysqld_exporter.go:206"
Oct 09 10:11:43 db1119 prometheus-mysqld-exporter[8593]: time="2021-10-09T10:11:43Z" level=info msg="Build context (go=go1.11.5, user=pkg-go-maintainers@lists.alioth.debian.org, date=20190311-02:06:43)" source="mysqld_exporter.go:207"
Oct 09 10:11:43 db1119 prometheus-mysqld-exporter[8593]: time="2021-10-09T10:11:43Z" level=info msg="Enabled scrapers:" source="mysqld_exporter.go:218"
Oct 09 10:11:43 db1119 prometheus-mysqld-exporter[8593]: time="2021-10-09T10:11:43Z" level=info msg=" --collect.global_status" source="mysqld_exporter.go:222"
Oct 09 10:11:43 db1119 prometheus-mysqld-exporter[8593]: time="2021-10-09T10:11:43Z" level=info msg=" --collect.global_variables" source="mysqld_exporter.go:222"
Oct 09 10:11:43 db1119 prometheus-mysqld-exporter[8593]: time="2021-10-09T10:11:43Z" level=info msg=" --collect.slave_status" source="mysqld_exporter.go:222"
Oct 09 10:11:43 db1119 prometheus-mysqld-exporter[8593]: time="2021-10-09T10:11:43Z" level=info msg=" --collect.info_schema.processlist" source="mysqld_exporter.go:222"
Oct 09 10:11:43 db1119 prometheus-mysqld-exporter[8593]: time="2021-10-09T10:11:43Z" level=info msg="Listening on :9104" source="mysqld_exporter.go:232"
Oct 11 2021, 10:34 AM · DBA
MatthewVernon created P17451 rebalancing in swift-recon.
Oct 11 2021, 8:49 AM

Oct 7 2021

MatthewVernon reopened T290881: Spontaneous reboot of ms-be2045 as "Open".

Hi @Papaul We reimaged this host today to try and bring it back into service. After about half an hour of uptime it dropped off the network, and from the management console it looks like the network hardware has failed?

Oct 7 2021, 10:04 AM · Patch-For-Review, SRE, SRE-swift-storage
MatthewVernon created P17433 ms-be2045 kernel log.
Oct 7 2021, 9:57 AM
MatthewVernon created P17432 Puppet sadness on ms-be2045.
Oct 7 2021, 8:16 AM

Oct 5 2021

MatthewVernon closed T257056: Add alert for prometheus-mysql-exporter failing to scrape mysql as Resolved.

The two changes just merged (to be auto-deployed) should be necessary and sufficient to resolve this.

Oct 5 2021, 2:34 PM · User-Kormat, DBA
MatthewVernon claimed T257056: Add alert for prometheus-mysql-exporter failing to scrape mysql.
Oct 5 2021, 2:33 PM · User-Kormat, DBA
MatthewVernon committed rOALE3921c3d2300c: data-protection: add alerting for prometheus-mysqld-exporter failing (authored by MatthewVernon).
data-protection: add alerting for prometheus-mysqld-exporter failing
Oct 5 2021, 2:32 PM

Sep 23 2021

MatthewVernon added a comment to T257056: Add alert for prometheus-mysql-exporter failing to scrape mysql.

AFAICT from https://wikitech.wikimedia.org/wiki/Alertmanager alerts are (now?) meant to go into operations/alerts rather than into puppet directly. So I think we want a data-persistence routing for alerts, and then a suitable alert defined in operations/alerts (I've checked and mysql_exporter_last_scrape_error is in prometheus)

Sep 23 2021, 2:20 PM · User-Kormat, DBA

Sep 21 2021

MatthewVernon added a comment to T276961: Support Openstack Swift APIs via the radosgw.

[I was pointed at this task from IRC, I'm new in data persistence team, used to do quite a bit of Ceph at the Sanger]

Sep 21 2021, 10:24 AM · cloud-services-team (Kanban), Data-Services, Cloud-VPS, User-Marostegui

Sep 15 2021

MatthewVernon closed T289488: Systemd enhancements for mariadb and prometheus-mysql-exporter as Resolved.

Marking this as resolved - as we deploy 10.4.21-2 everywhere, the fix will get rolled out.

Sep 15 2021, 1:16 PM · Patch-For-Review, DBA
MatthewVernon added a comment to T289488: Systemd enhancements for mariadb and prometheus-mysql-exporter.

I've also checked the stop/start/restart behaviour, which is as expected.
Also, that on reboot PME isn't started, but when you start mariadb, it does then get started for you.

Sep 15 2021, 12:29 PM · Patch-For-Review, DBA

Sep 13 2021

MatthewVernon added a comment to T290881: Spontaneous reboot of ms-be2045.

Hi @Papaul this system seems to have had a hardware fault(s), and is (just) still within its warranty, could you get the hardware checked out, please? Thanks :)

Sep 13 2021, 3:50 PM · Patch-For-Review, SRE, SRE-swift-storage
MatthewVernon assigned T290881: Spontaneous reboot of ms-be2045 to Papaul.
Sep 13 2021, 3:49 PM · Patch-For-Review, SRE, SRE-swift-storage
MatthewVernon added a comment to T290881: Spontaneous reboot of ms-be2045.

On reboot, the disks came back, but many of the filesystems are unhappy:
mvernon@ms-be2045:~$ sudo dmesg | grep 'Shutting down filesystem'
[ 18.244602] XFS (sda3): Corruption of in-memory data detected. Shutting down filesystem
[ 18.724649] XFS (sdf1): Corruption of in-memory data detected. Shutting down filesystem
[ 19.448076] XFS (sdg1): Corruption of in-memory data detected. Shutting down filesystem
[ 20.610420] XFS (sdj1): I/O Error Detected. Shutting down filesystem
[ 20.745769] XFS (sdn1): I/O Error Detected. Shutting down filesystem
[ 20.938081] XFS (sdi1): I/O Error Detected. Shutting down filesystem
[ 23.719222] XFS (sdh1): I/O Error Detected. Shutting down filesystem
[ 24.802161] XFS (sde1): I/O Error Detected. Shutting down filesystem
[ 30.091057] XFS (sdm1): I/O Error Detected. Shutting down filesystem
[ 31.761276] XFS (sdc1): I/O Error Detected. Shutting down filesystem

Sep 13 2021, 3:46 PM · Patch-For-Review, SRE, SRE-swift-storage
MatthewVernon updated subscribers of T290881: Spontaneous reboot of ms-be2045.
Sep 13 2021, 3:31 PM · Patch-For-Review, SRE, SRE-swift-storage
MatthewVernon created T290881: Spontaneous reboot of ms-be2045.
Sep 13 2021, 3:19 PM · Patch-For-Review, SRE, SRE-swift-storage

Sep 7 2021

MatthewVernon added a comment to T289117: decommission pc2010.codfw.wmnet.

This host is ready for DC-Ops to decommission

Sep 7 2021, 2:07 PM · SRE, ops-codfw, DC-Ops, decommission-hardware
MatthewVernon reassigned T289117: decommission pc2010.codfw.wmnet from MatthewVernon to Papaul.
Sep 7 2021, 2:07 PM · SRE, ops-codfw, DC-Ops, decommission-hardware
MatthewVernon added a comment to T289116: decommission pc2009.codfw.wmnet.

This host is ready for DC-Ops to decommission

Sep 7 2021, 1:24 PM · SRE, ops-codfw, DC-Ops, decommission-hardware
MatthewVernon reassigned T289116: decommission pc2009.codfw.wmnet from MatthewVernon to Papaul.
Sep 7 2021, 1:24 PM · SRE, ops-codfw, DC-Ops, decommission-hardware
MatthewVernon added a comment to T289115: decommission pc2008.codfw.wmnet.

This host is ready for DC-Ops to decommission

Sep 7 2021, 10:55 AM · Patch-For-Review, SRE, ops-codfw, DC-Ops, decommission-hardware
MatthewVernon reassigned T289115: decommission pc2008.codfw.wmnet from MatthewVernon to Papaul.
Sep 7 2021, 10:55 AM · Patch-For-Review, SRE, ops-codfw, DC-Ops, decommission-hardware
MatthewVernon claimed T289117: decommission pc2010.codfw.wmnet.
Sep 7 2021, 10:39 AM · SRE, ops-codfw, DC-Ops, decommission-hardware
MatthewVernon claimed T289116: decommission pc2009.codfw.wmnet.
Sep 7 2021, 10:39 AM · SRE, ops-codfw, DC-Ops, decommission-hardware
MatthewVernon added a comment to T289122: decommission pc1010.eqiad.wmnet.

This host is ready for DC-Ops to decommission

Sep 7 2021, 10:29 AM · Patch-For-Review, SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon reassigned T289122: decommission pc1010.eqiad.wmnet from MatthewVernon to wiki_willy.
Sep 7 2021, 10:28 AM · Patch-For-Review, SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon claimed T289115: decommission pc2008.codfw.wmnet.
Sep 7 2021, 10:24 AM · Patch-For-Review, SRE, ops-codfw, DC-Ops, decommission-hardware
MatthewVernon claimed T289122: decommission pc1010.eqiad.wmnet.
Sep 7 2021, 9:55 AM · Patch-For-Review, SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon added a comment to T289120: decommission pc1009.eqiad.wmnet.

This host is ready for DC-Ops to decommission

Sep 7 2021, 9:48 AM · SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon assigned T289120: decommission pc1009.eqiad.wmnet to wiki_willy.
Sep 7 2021, 9:48 AM · SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon added a comment to T289119: decommission pc1008.eqiad.wmnet.

This host is ready for DC-Ops to decommission

Sep 7 2021, 8:54 AM · Patch-For-Review, SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon reassigned T289119: decommission pc1008.eqiad.wmnet from MatthewVernon to wiki_willy.
Sep 7 2021, 8:53 AM · Patch-For-Review, SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon added a comment to T289118: decommission pc1007.eqiad.wmnet..

This host is ready for DC-Ops to decommission

Sep 7 2021, 8:46 AM · SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon claimed T289119: decommission pc1008.eqiad.wmnet.
Sep 7 2021, 8:18 AM · Patch-For-Review, SRE, ops-eqiad, DC-Ops, decommission-hardware

Sep 6 2021

MatthewVernon edited projects for T289118: decommission pc1007.eqiad.wmnet., added: DC-Ops, ops-eqiad; removed DBA.
Sep 6 2021, 3:04 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon reassigned T289118: decommission pc1007.eqiad.wmnet. from MatthewVernon to wiki_willy.
Sep 6 2021, 3:03 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon created P17227 (An Untitled Masterwork).
Sep 6 2021, 2:55 PM
MatthewVernon claimed T289118: decommission pc1007.eqiad.wmnet..
Sep 6 2021, 2:01 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon updated the task description for T289118: decommission pc1007.eqiad.wmnet..
Sep 6 2021, 1:48 PM · SRE, ops-eqiad, DC-Ops, decommission-hardware
MatthewVernon closed T289488: Systemd enhancements for mariadb and prometheus-mysql-exporter as Resolved.
Sep 6 2021, 1:38 PM · Patch-For-Review, DBA
MatthewVernon added a comment to T257056: Add alert for prometheus-mysql-exporter failing to scrape mysql.

I think this could be achieved by setting a grafana alert on the MySQL Aggregated dashboard? But I don't really know much about how alerts are set up at WMF, or when one should use a Grafana alerts vs an icinga one, or...

Sep 6 2021, 1:34 PM · User-Kormat, DBA
MatthewVernon updated subscribers of T289488: Systemd enhancements for mariadb and prometheus-mysql-exporter.

I think we concluded that the mariadb.target idea isn't all that useful (since mostly folk don't stop and start >1 instance at once I think @Kormat said).

Sep 6 2021, 1:28 PM · Patch-For-Review, DBA
MatthewVernon added a comment to T289488: Systemd enhancements for mariadb and prometheus-mysql-exporter.

I know it is not totally related to this task, but maybe this can be also looked at as part of this? T257056: Add alert for prometheus-mysql-exporter failing to scrape mysql

Sep 6 2021, 1:25 PM · Patch-For-Review, DBA
MatthewVernon updated the task description for T289488: Systemd enhancements for mariadb and prometheus-mysql-exporter.
Sep 6 2021, 1:15 PM · Patch-For-Review, DBA

Aug 24 2021

MatthewVernon edited P17066 In which I break puppet-lint.
Aug 24 2021, 10:50 AM
MatthewVernon edited P17066 In which I break puppet-lint.
Aug 24 2021, 10:49 AM
MatthewVernon created P17066 In which I break puppet-lint.
Aug 24 2021, 10:48 AM

Aug 23 2021

MatthewVernon updated the task description for T289488: Systemd enhancements for mariadb and prometheus-mysql-exporter.
Aug 23 2021, 12:50 PM · Patch-For-Review, DBA
MatthewVernon added a comment to T252761: Research performance changes on prometheus-mysqld-exporter after buster/mariadb upgrade.

I've split the systemd bits into a separate task - T289488

Aug 23 2021, 12:38 PM · DBA
MatthewVernon created T289488: Systemd enhancements for mariadb and prometheus-mysql-exporter.
Aug 23 2021, 12:30 PM · Patch-For-Review, DBA
MatthewVernon added a comment to P17060 (An Untitled Masterwork).
-    ensure_packages('prometheus-mysqld-exporter', {'notify' => "Exec['systemctl try-restart prometheus-mysqld-exporter']"})
+    ensure_packages('prometheus-mysqld-exporter', {'notify' => Exec['systemctl try-restart prometheus-mysqld-exporter']})
Aug 23 2021, 10:11 AM
MatthewVernon created P17060 (An Untitled Masterwork).
Aug 23 2021, 9:51 AM

Aug 17 2021

MatthewVernon updated the task description for T288244: Upgrade s7 to Debian Buster and MariaDB 10.4.
Aug 17 2021, 1:30 PM · Patch-For-Review, DBA

Aug 9 2021

MatthewVernon added a comment to T252761: Research performance changes on prometheus-mysqld-exporter after buster/mariadb upgrade.

Another thought here - if we want the exporter to be automatically restarted if mysqld is restarted, then we should be able to get systemd to do this for us.

Aug 9 2021, 4:27 PM · DBA

Aug 6 2021

MatthewVernon created T288350: Add Matthew Vernon (@mcv21) to Wikimedia github.
Aug 6 2021, 2:23 PM · Wikimedia-GitHub

Aug 4 2021

MatthewVernon created T288122: New VictorOps user request.
Aug 4 2021, 4:11 PM · SRE Observability (FY2021/2022-Q1)
MatthewVernon created T288038: Add Matthew Vernon to security@wikimedia.org.
Aug 4 2021, 9:10 AM · SecTeam-Processed, Security-Team