Page MenuHomePhabricator

BTullis (Ben)
Senior SRE

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Jun 29 2021, 9:56 AM (146 w, 2 h)
Availability
Available
IRC Nick
btullis
LDAP User
Btullis
MediaWiki User
BTullis (WMF) [ Global Accounts ]

Recent Activity

Today

BTullis added a comment to T361688: Upgrade datahub to v0.12.1.

First upgrade attempt has failed on codfw with some errors on the datahub-main-nocode-migration-job

ANTLR Tool version 4.5 used for code generation does not match the current runtime version 4.8ANTLR Runtime version 4.5 used for parser compilation does not match the current runtime version 4.8ANTLR Tool version 4.5 used for code generation does not match the current runtime version 4.8ANTLR Runtime version 4.5 used for parser compilation does not match the current runtime version 4.8java.net.ConnectException: Connection timed out (Connection timed out)

and

2024-04-16 11:31:12,768 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:15 - ERROR: Cannot connect to GMSat https://host datahub-gms-main-tls-service.datahub.svc.cluster.local port 8501. Make sure GMS is on the latest version and is running at that host before starting the migration.

Currenly investigating this

Tue, Apr 16, 11:38 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
BTullis removed a project from T361342: Provision the MPIC secrets in the private puppet repository: Epic.
Tue, Apr 16, 9:13 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis claimed T362602: Requesting kerberos identity for Surbhi Gupta .
Tue, Apr 16, 8:28 AM · SRE-Access-Requests, SRE, Data-Engineering

Yesterday

BTullis added a comment to T351552: Upgrade matomo (piwik.wikimedia.org) to latest stable version.

I feel that I'm making good progress on this now.

Mon, Apr 15, 11:53 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
BTullis renamed T310293: HDFS Namenode fail-back failure from HDFS Namenode failover failure to HDFS Namenode fail-back failure.
Mon, Apr 15, 11:14 AM · Data-Platform-SRE
BTullis updated the task description for T310293: HDFS Namenode fail-back failure.
Mon, Apr 15, 11:14 AM · Data-Platform-SRE
BTullis closed T351657: Application Security Review Request : Matomo upgrade and its campaign reporter plugin as Resolved.

Thanks all for your input. We're curerntly proceeding to install matomo version 4.16.1 - which is the most recent version available on https://debian.matomo.org/ at present.

Mon, Apr 15, 10:55 AM · SecTeam-Processed, secscrum, Security, Application Security Reviews
BTullis closed T351657: Application Security Review Request : Matomo upgrade and its campaign reporter plugin, a subtask of T351552: Upgrade matomo (piwik.wikimedia.org) to latest stable version, as Resolved.
Mon, Apr 15, 10:54 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
BTullis closed T351657: Application Security Review Request : Matomo upgrade and its campaign reporter plugin, a subtask of T319013: Enable the Marketing Campaigns Reporting plugin for matomo, as Resolved.
Mon, Apr 15, 10:54 AM · Data-Platform-SRE (2024.03.04 - 2024.03.24), Data-Engineering, Foundational Technology Requests
BTullis added a comment to T352783: Change data platform-related IRC channels to improve communication.

I think that the last thing we discussed was to take a hint from what the cloud-services-team to and create an IRC channel named: #wikimedia-data-platform-feed which receives all of the automated chatter (gerrit, gitlab, phab, alertmanager).

Mon, Apr 15, 9:02 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), observability
BTullis moved T356303: Review wikitech:Search and write processes for k8s world from Blocked / Waiting to Done on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Mon, Apr 15, 8:37 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Documentation, Discovery-Search (Current work)
BTullis moved T359993: Slowdown when querying via Hive from In Progress to Done on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Mon, Apr 15, 8:33 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Platform

Thu, Apr 11

BTullis added a comment to T359993: Slowdown when querying via Hive.

Thanks so much for that explanation @JAllemandou - it makes perfect sense.

Thu, Apr 11, 3:10 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Platform
BTullis added a comment to T362146: Site: eqiad 1 VM for Matomo.

I'm adding the second disk now.

btullis@ganeti1027:~$ sudo gnt-instance modify --disk add:size=80g matomo1003.eqiad.wmnet
Thu Apr 11 09:07:30 2024  - INFO: Waiting for instance matomo1003.eqiad.wmnet to sync disks
Thu Apr 11 09:07:30 2024  - INFO: - device disk/1:  0.10% done, 1h 7m 12s remaining (estimated)
Thu Apr 11 09:08:31 2024  - INFO: - device disk/1:  2.80% done, 35m 15s remaining (estimated)

I think that I will mount this as /srv and try to make the mariadb configuration more like one of our standard server setup.

Thu, Apr 11, 9:10 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), vm-requests, Infrastructure-Foundations, SRE

Wed, Apr 10

BTullis added a comment to T359993: Slowdown when querying via Hive.

I do see the following, which is suspicious.

image.png (547×1 px, 170 KB)

The final task had two attempts. One finished aftet 17 minutes, 5 seconds. The other was killed after 17 minutes, 22 seconds.

Wed, Apr 10, 6:25 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Platform
BTullis added a comment to T359993: Slowdown when querying via Hive.

This timing is confirmed by the mapreduce history server too.

image.png (587×1 px, 193 KB)

Wed, Apr 10, 5:20 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Platform
BTullis added a comment to T359993: Slowdown when querying via Hive.

I can certainly replicate the behaviour.

Wed, Apr 10, 5:11 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Platform
BTullis committed rLPRI563fc4a5e57d: Add dummy data for the new matomo service..
Add dummy data for the new matomo service.
Wed, Apr 10, 3:56 PM
BTullis added a comment to T351552: Upgrade matomo (piwik.wikimedia.org) to latest stable version.

I have pulled in the latest stable version from debian.matomo.org to the bookwork repository.

btullis@apt1002:/srv/wikimedia/conf$ sudo -i reprepro --component thirdparty/matomo checkupdate bookworm-wikimedia
Calculating packages to get...
Updates needed for 'bookworm-wikimedia|thirdparty/matomo|amd64':
'matomo': newly installed as '4.16.1-1' (from 'matomo'):
 files needed: pool/thirdparty/matomo/m/matomo/matomo_4.16.1-1_all.deb
btullis@apt1002:/srv/wikimedia/conf$ sudo -i reprepro --noskipold --component thirdparty/matomo update bookworm-wikimedia
Calculating packages to get...
Getting packages...
Installing (and possibly deleting) packages...
Exporting indices...
Wed, Apr 10, 3:08 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
BTullis closed T325232: Migrate Dumpsdata and Htmldumper Hosts From Buster to Bullseye as Resolved.

Having discussed this with @ArielGlenn, we feel that we can resolve this ticket without re-populating the /data volume on dumpsdata1005.

Wed, Apr 10, 11:21 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Dumps-Generation
BTullis moved T325232: Migrate Dumpsdata and Htmldumper Hosts From Buster to Bullseye from Active to Done on the Dumps-Generation board.
Wed, Apr 10, 11:21 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Dumps-Generation
BTullis updated the task description for T325232: Migrate Dumpsdata and Htmldumper Hosts From Buster to Bullseye.
Wed, Apr 10, 10:47 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Dumps-Generation
BTullis added a comment to T325232: Migrate Dumpsdata and Htmldumper Hosts From Buster to Bullseye.

I made a mistake when reimaging dumpsdata1005, so the /data volume was formatted and is now empty.

Wed, Apr 10, 10:47 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Dumps-Generation
BTullis added a comment to T325232: Migrate Dumpsdata and Htmldumper Hosts From Buster to Bullseye.

@BTullis can you please update https://wikitech.wikimedia.org/wiki/Dumps/Dumpsdata_hosts once this task is done?

Wed, Apr 10, 10:25 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Dumps-Generation
BTullis moved T360531: [Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid from Needs Review to Done on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.

I think that we can call this ticket done (for now) because the new chart is ready for use.
I'll move it to Done on the Data-Platform-SRE board, but I'll wait for confirmation from someone on the Data Products team to verify that it meets the requirements, before resolving it.

Wed, Apr 10, 9:40 AM · Data Products (Data Products Sprint 12), Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Commons-Impact-Metrics
BTullis added a comment to T360531: [Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid.

There is also an opportunity to remove the need to specify all of the cassandra server IPs by implementing T359423: Migrate charts to Calico Network Policies for this chart.

Only seeing this now so this is a bit of a driveby comment that's coming a little late, but there is also a cassandra sextant module in deployment-charts that will slot in easily also. Not sure how the the network policy module and it are reconciled though as they overlap as far as defining the network side of things.

Wed, Apr 10, 9:35 AM · Data Products (Data Products Sprint 12), Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Commons-Impact-Metrics
BTullis added a comment to T360531: [Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid.

Thanks for your considered reply @Eevans.
I confess that I hadn't spotted this comment until moments aftermerging the CR above, so apologies for jumping the gun there.
On the positive side, it's easier to rename the chart(s) later than it is to decide on a good name, so we can definitely carry on discussing it. I didn't mean to shut down the conversation by proceeding.

If we step back and think about the class of things we want to use this chart for, is there anything about them that makes the term "http gateway" meaningful (or is that just being carried forward from the original name)?

I do see your point, but is there anything about it that makes the term 'http gateway' incorrect?
When creating this new chart (based on merging the functionality of the druid and cassandra charts used by AQS 2.0 services) I only thought of the term 'http gateway' to be purely descriptive. I had no knowledge of The Cassandra HTTP Gateway™ as its precursor.

Wed, Apr 10, 9:24 AM · Data Products (Data Products Sprint 12), Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Commons-Impact-Metrics
BTullis moved T336040: Bring stat1010 into service with GPU from stat1005 from In Progress to Blocked / Waiting on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.

Unfortunately, the cable hadn't arrived so we will have to re-schedule this work again.

Wed, Apr 10, 8:39 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis updated the task description for T349619: Migrate roles to puppet7.
Wed, Apr 10, 8:36 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, serviceops, collaboration-services, SRE-tools, Puppet-Core, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE

Tue, Apr 9

BTullis closed T362146: Site: eqiad 1 VM for Matomo as Resolved.
Tue, Apr 9, 4:08 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), vm-requests, Infrastructure-Foundations, SRE
BTullis closed T362146: Site: eqiad 1 VM for Matomo, a subtask of T349397: Migrate matomo to Debian bullseye (or bookworm), as Resolved.
Tue, Apr 9, 4:06 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering
BTullis claimed T325232: Migrate Dumpsdata and Htmldumper Hosts From Buster to Bullseye.
Tue, Apr 9, 3:51 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Dumps-Generation
BTullis added a parent task for T362146: Site: eqiad 1 VM for Matomo: T349397: Migrate matomo to Debian bullseye (or bookworm).
Tue, Apr 9, 3:22 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), vm-requests, Infrastructure-Foundations, SRE
BTullis added a subtask for T349397: Migrate matomo to Debian bullseye (or bookworm): T362146: Site: eqiad 1 VM for Matomo.
Tue, Apr 9, 3:22 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering
BTullis moved T362146: Site: eqiad 1 VM for Matomo from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Tue, Apr 9, 1:21 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), vm-requests, Infrastructure-Foundations, SRE
BTullis moved T336040: Bring stat1010 into service with GPU from stat1005 from Blocked / Waiting to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Tue, Apr 9, 1:20 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T362146: Site: eqiad 1 VM for Matomo.

The Ganeti cluster report looks like it's fairly evenly balanced at the moment.

DRY-RUN: START - Cookbook sre.ganeti.resource-report
+-------+-------+-----------+----------+-----------+---------+-----------+
| Group | Nodes | Instances |  MFree   | MFree avg |  DFree  | DFree avg |
+-------+-------+-----------+----------+-----------+---------+-----------+
|   A   |   8   |     35    | 291.7GiB |  36.5GiB  | 16.6TiB |   2.1TiB  |
|   B   |   7   |     36    | 232.2GiB |  33.2GiB  | 11.9TiB |   1.7TiB  |
|   C   |   8   |     37    | 289.2GiB |  36.1GiB  | 15.6TiB |   1.9TiB  |
|   D   |   7   |     32    | 276.7GiB |  39.5GiB  | 13.1TiB |   1.9TiB  |
+-------+-------+-----------+----------+-----------+---------+-----------+

matomo1002 is currently in cluster group C, so I suppose that if I use this group then it will remain balanced after I decom the old host.

Tue, Apr 9, 11:11 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), vm-requests, Infrastructure-Foundations, SRE
BTullis claimed T362146: Site: eqiad 1 VM for Matomo.

I'll add the second disk after the initial creation by the cookbook. This will be useful to allow us to retain MariaDB data during an in-place reimage.

Tue, Apr 9, 11:08 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), vm-requests, Infrastructure-Foundations, SRE
BTullis created T362146: Site: eqiad 1 VM for Matomo.
Tue, Apr 9, 11:05 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), vm-requests, Infrastructure-Foundations, SRE
BTullis added a project to T325232: Migrate Dumpsdata and Htmldumper Hosts From Buster to Bullseye: Data-Platform-SRE (2024.03.25 - 2024.04.14).

I'm going to pick up this ticket and upgrade the last two remaining spare dumpsdata hosts.

Tue, Apr 9, 11:04 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Dumps-Generation
BTullis moved T325232: Migrate Dumpsdata and Htmldumper Hosts From Buster to Bullseye from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Tue, Apr 9, 11:04 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Dumps-Generation
BTullis claimed T349397: Migrate matomo to Debian bullseye (or bookworm).
Tue, Apr 9, 11:01 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering
BTullis moved T349397: Migrate matomo to Debian bullseye (or bookworm) from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Tue, Apr 9, 11:01 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering
BTullis claimed T351552: Upgrade matomo (piwik.wikimedia.org) to latest stable version.
Tue, Apr 9, 11:01 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
BTullis moved T351552: Upgrade matomo (piwik.wikimedia.org) to latest stable version from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Tue, Apr 9, 11:01 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
BTullis updated the task description for T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye.
Tue, Apr 9, 10:53 AM · Data-Platform-SRE, Epic
BTullis edited projects for T349397: Migrate matomo to Debian bullseye (or bookworm), added: Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE.
Tue, Apr 9, 10:50 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering
BTullis edited projects for T351552: Upgrade matomo (piwik.wikimedia.org) to latest stable version, added: Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE.
Tue, Apr 9, 10:50 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
BTullis updated the task description for T325228: Migrate Dumps Snapshot hosts from Buster to Bullseye.
Tue, Apr 9, 10:47 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE, Data-Engineering, Dumps-Generation
BTullis added a comment to T336040: Bring stat1010 into service with GPU from stat1005.

The cable has arrived and we're planning to shut down stat1010 to do the work today.

Tue, Apr 9, 10:47 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis moved T358314: Hardware requests for Data Platform Engineering - FY2024-2025 from Needs Review to Done on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Tue, Apr 9, 10:34 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14)

Mon, Apr 8

BTullis added a comment to T361511: Try to reverse wipefs on host using DRAC/iLO and document.

This task might take a long time to achieve, for something that might seldom (if ever) be used again.
There are also a lot of variables between hosts which might come into play, such as the different iDRAC firmware capabilities. e.g. attaching virtual media.

Mon, Apr 8, 4:47 PM · Data-Platform-SRE
BTullis claimed T359993: Slowdown when querying via Hive.

I'm looking at this issue now, but I would tend to agree with @mpopov here. Both presto and spark3-sql are preferred options, compared to the hive cli. Hue has also been decommissioned since this ticket was created.
The hive cli uses the mapreduce shuffler for yarn, which is superseded by the spark3 shuffler service for yarn.

Mon, Apr 8, 4:21 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Platform
BTullis closed T353787: Decom dumpsdata100[1-2] as Resolved.
Mon, Apr 8, 4:04 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Dumps-Generation
BTullis moved T353787: Decom dumpsdata100[1-2] from Other teams to Done on the Dumps-Generation board.
Mon, Apr 8, 4:03 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Dumps-Generation
BTullis placed T362065: decommission dumpsdata1002.eqiad.wmnet up for grabs.
Mon, Apr 8, 3:58 PM · SRE, ops-eqiad, decommission-hardware
BTullis placed T362064: decommission dumpsdata1001.eqiad.wmnet up for grabs.
Mon, Apr 8, 3:44 PM · SRE, ops-eqiad, decommission-hardware
BTullis moved T362065: decommission dumpsdata1002.eqiad.wmnet from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Mon, Apr 8, 1:09 PM · SRE, ops-eqiad, decommission-hardware
BTullis moved T362064: decommission dumpsdata1001.eqiad.wmnet from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Mon, Apr 8, 1:09 PM · SRE, ops-eqiad, decommission-hardware
BTullis updated the task description for T362064: decommission dumpsdata1001.eqiad.wmnet.
Mon, Apr 8, 1:08 PM · SRE, ops-eqiad, decommission-hardware
BTullis added a subtask for T353787: Decom dumpsdata100[1-2]: T362064: decommission dumpsdata1001.eqiad.wmnet.
Mon, Apr 8, 12:35 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Dumps-Generation
BTullis added a parent task for T362064: decommission dumpsdata1001.eqiad.wmnet: T353787: Decom dumpsdata100[1-2].
Mon, Apr 8, 12:35 PM · SRE, ops-eqiad, decommission-hardware
BTullis added a subtask for T353787: Decom dumpsdata100[1-2]: T362065: decommission dumpsdata1002.eqiad.wmnet.
Mon, Apr 8, 12:34 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Dumps-Generation
BTullis added a parent task for T362065: decommission dumpsdata1002.eqiad.wmnet: T353787: Decom dumpsdata100[1-2].
Mon, Apr 8, 12:34 PM · SRE, ops-eqiad, decommission-hardware
BTullis created T362065: decommission dumpsdata1002.eqiad.wmnet.
Mon, Apr 8, 12:34 PM · SRE, ops-eqiad, decommission-hardware
BTullis created T362064: decommission dumpsdata1001.eqiad.wmnet.
Mon, Apr 8, 12:33 PM · SRE, ops-eqiad, decommission-hardware
BTullis claimed T353787: Decom dumpsdata100[1-2].
Mon, Apr 8, 12:31 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Dumps-Generation
BTullis moved T353787: Decom dumpsdata100[1-2] from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Mon, Apr 8, 12:31 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Dumps-Generation
BTullis added a comment to T252396: Split page-meta-history wikidata dump job across multiple hosts.

I have started another manual run of the wikidata dump job on snapshot1009 with the following command, running in a screen session owned by the dumpsgen user.

dumpsgen@snapshot1009:~$ /bin/bash /var/lib/dumpsgen/dumps_fillin_wd.sh --verbose --startday 07 --endday 11 --numjobs 28 --jobinfo 25,27 --wiki wikidatawiki --config /etc/dumps/confs/wikidump.conf.dumps:wd

We should aim to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/993659 this month, so that it happens automatically next month. We missed the start time of the job again this month.

Mon, Apr 8, 11:07 AM · Patch-For-Review, Dumps-Generation
BTullis added a comment to T361955: Create an `mpic` MariaDB database.

We have the Analytics_Meta database service that we can use for this.

Mon, Apr 8, 9:00 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)

Thu, Mar 28

BTullis awarded T361225: Update GPU labels in Hadoop 's Yarn config a Yellow Medal token.
Thu, Mar 28, 3:02 PM · Data-Platform-SRE
BTullis added a comment to T361225: Update GPU labels in Hadoop 's Yarn config.

That's excellent, please feel free to proceed @elukey. I had forgotten to remove them.

Thu, Mar 28, 2:59 PM · Data-Platform-SRE
BTullis added a comment to T341895: Deprecate Hue and stop the services.

I have archived these Wikitech pages:

Thu, Mar 28, 11:36 AM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Data-Engineering
BTullis moved T360531: [Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid from In Progress to Needs Review on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.

@mforns - Does this unblock your work on the MW history and/or commons metrics?

Thu, Mar 28, 11:12 AM · Data Products (Data Products Sprint 12), Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Commons-Impact-Metrics
BTullis added a comment to T336040: Bring stat1010 into service with GPU from stat1005.

We're still awaiting the required cable from Dell for enabling the GPU in stat1010. That's being tracked in T359089.

Thu, Mar 28, 11:09 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T360531: [Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid.

I have created a stack of small patches to implement this.
https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1014655

image.png (222×454 px, 29 KB)

Thu, Mar 28, 11:06 AM · Data Products (Data Products Sprint 12), Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Commons-Impact-Metrics

Wed, Mar 27

BTullis updated subscribers of T360531: [Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid.

I've had an initial look at this and it seems that it's likely going to be easier to merge the existing cassandra-https-gateway and druid-http-gateway charts into a new combined chart, rather than make an additional one.

Wed, Mar 27, 5:03 PM · Data Products (Data Products Sprint 12), Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Commons-Impact-Metrics
BTullis closed T358675: Update the From: addresses of all email from DPE pipelines so that they use routable addresses as Resolved.

I believe that this is all done, with the exception of this patch to refinery-source, which updates the default email address: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1014004

Wed, Mar 27, 3:02 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Data-Engineering
BTullis closed T361096: hw troubleshooting: boot failure for an-worker1096.eqiad.wmnet as Resolved.

The reimage worked and the host is back in the cluster.

Wed, Mar 27, 1:59 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
BTullis added a comment to T307505: Refine jobs should be scheduled by Airflow.

We (Data-Platform-SRE) have been working on updating the alerting system so that all emails sent by automated monitoring systems use routable domains. This work is being carried out under T358675: Update the From: addresses of all email from DPE pipelines so that they use routable addresses

Wed, Mar 27, 1:58 PM · Data-Engineering, Data Pipelines
BTullis added a comment to T314956: [Event Platform] Declare webrequest as an Event Platform stream.

Hello. FYI we are receiving some alerts about failed produce_canary_events jobs, due to being unable to find the webrequest schema. e.g.

24/03/27 13:30:45 ERROR ResourceLoader: Caught exception when trying to load resource.
org.wikimedia.eventutilities.core.util.ResourceLoadingException: Failed loading resource. (resource: https://schema.discovery.wmnet/repositories/primary/jsonschema/development/webrequest/latest)
	at org.wikimedia.eventutilities.core.util.ResourceLoader.loadFirst(ResourceLoader.java:122)

Let me know if there's anything I can do to help, but maybe this is all fine.

Wed, Mar 27, 1:38 PM · Patch-For-Review, Data-Engineering, Event-Platform
BTullis added a comment to T356363: [Refine Refactoring] Refactor refinery code for compatibility with Airflow integration.

I had a question about this, because I just found out that refinery has its own email sending code. (re: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/1014004)

Wed, Mar 27, 1:31 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Patch-For-Review
BTullis created T361096: hw troubleshooting: boot failure for an-worker1096.eqiad.wmnet.
Wed, Mar 27, 12:23 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
BTullis added a comment to T358675: Update the From: addresses of all email from DPE pipelines so that they use routable addresses.

I can confirm receipt of a new refinemonitor report, using the new email address.

image.png (93×690 px, 14 KB)

Wed, Mar 27, 9:47 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), Patch-For-Review, Data-Engineering

Tue, Mar 26

BTullis updated the task description for T359423: Migrate charts to Calico Network Policies.
Tue, Mar 26, 11:08 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
BTullis updated the task description for T359423: Migrate charts to Calico Network Policies.
Tue, Mar 26, 10:58 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
BTullis added a comment to T359423: Migrate charts to Calico Network Policies.

I needed to start work on the DataHub migration because SSO broke due to the switch of IDP servers: https://sal.toolforge.org/log/UMCPdY4BhuQtenzvi5Pe

Tue, Mar 26, 10:30 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
BTullis updated the task description for T359423: Migrate charts to Calico Network Policies.
Tue, Mar 26, 10:26 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
BTullis added a comment to T361024: NEW BUG REPORT SSL certificate verification error when using internal API endpoints from conda-analytics and Jupyter on stat host.

Great! I'm glad it worked for you. It's frustrating that we still have to do it.

Tue, Mar 26, 9:02 PM · Data-Platform-SRE, Data-Platform
BTullis claimed T360531: [Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid.
Tue, Mar 26, 4:48 PM · Data Products (Data Products Sprint 12), Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Commons-Impact-Metrics
BTullis moved T360531: [Commons Impact Metrics] Refactor/create helm config for AQS service accessing both Cassandra and Druid from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Tue, Mar 26, 4:48 PM · Data Products (Data Products Sprint 12), Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), Commons-Impact-Metrics
BTullis added a comment to T361024: NEW BUG REPORT SSL certificate verification error when using internal API endpoints from conda-analytics and Jupyter on stat host.

Please could you try again with the following environment variable set?

export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

...or set it somehow in your notebook. Let us know if it makes a difference.

Tue, Mar 26, 4:04 PM · Data-Platform-SRE, Data-Platform
BTullis closed T358268: Update maxmind download to pull databases from new url as Resolved.

I set a reminder for the Data-Platform-SRE channel in Slack for April 17th and 18th.

image.png (663×681 px, 75 KB)

That links back to this ticket and the mitigation steps than we can take if we need to.

Tue, Mar 26, 3:57 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
BTullis added a comment to T358268: Update maxmind download to pull databases from new url.

Even with the warning...

you may need to upgrade to GeoIP Update 4.x or later version

...I'm not convinced that we will need to do anything.

Tue, Mar 26, 3:36 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
BTullis added a comment to T358268: Update maxmind download to pull databases from new url.

Ascertaining how many puppet 5 clients would still be affected. It's 175 hosts.

btullis@cumin1002:~$ sudo cumin C:geoip::data::puppet 'cat /etc/debian_version'
698 hosts will be targeted:
an-airflow1005.eqiad.wmnet,an-coord[1001-1004].eqiad.wmnet,an-launcher1002.eqiad.wmnet,an-test-client1002.eqiad.wmnet,an-test-coord1001.eqiad.wmnet,an-test-worker[1001-1003].eqiad.wmnet,an-worker[1078-1095,1097-1175].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet,centrallog2002.codfw.wmnet,centrallog1002.eqiad.wmnet,cloudweb2002-dev.wikimedia.org,cloudweb[1003-1004].wikimedia.org,cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[6009-6016].drmrs.wmnet,cp[1100,1102,1104,1106,1108,1110,1112,1114].eqiad.wmnet,cp[5017-5024].eqsin.wmnet,cp[3066-3073].esams.wmnet,cp[4037-4044].ulsfo.wmnet,deploy2002.codfw.wmnet,deploy1002.eqiad.wmnet,dns[1004-1006,2004-2006,3003-3004,4003-4004,5003-5004,6001-6002].wikimedia.org,kubernetes[2007-2014,2017-2060].codfw.wmnet,kubernetes[1007-1014,1017-1062].eqiad.wmnet,mw[2259-2279,2281-2339,2350-2451].codfw.wmnet,mw[1349-1496].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet,mwlog2002.codfw.wmnet,mwlog1002.eqiad.wmnet,mwmaint2002.codfw.wmnet,mwmaint1002.eqiad.wmnet,netflow2003.codfw.wmnet,netflow6001.drmrs.wmnet,netflow1002.eqiad.wmnet,netflow5002.eqsin.wmnet,netflow3003.esams.wmnet,netflow4002.ulsfo.wmnet,parse[2001-2020].codfw.wmnet,parse[1001-1024].eqiad.wmnet,scandium.eqiad.wmnet,snapshot[1008-1017].eqiad.wmnet,stat[1004-1011].eqiad.wmnet
OK to proceed on 698 hosts? Enter the number of affected hosts to confirm or "q" to quit: 698
===== NODE GROUP =====                                                                                                                                                                                             
(20) dns[1004-1006,2004-2006,3003-3004,4003-4004,5003-5004,6001-6002].wikimedia.org,netflow2003.codfw.wmnet,netflow6001.drmrs.wmnet,netflow1002.eqiad.wmnet,netflow5002.eqsin.wmnet,netflow3003.esams.wmnet,netflow4002.ulsfo.wmnet                                                                                                                                                                                                   
----- OUTPUT of 'cat /etc/debian_version' -----                                                                                                                                                                    
12.5                                                                                                                                                                                                               
===== NODE GROUP =====                                                                                                                                                                                             
(503) an-airflow1005.eqiad.wmnet,an-coord[1003-1004].eqiad.wmnet,an-test-client1002.eqiad.wmnet,an-test-coord1001.eqiad.wmnet,an-test-worker[1001-1003].eqiad.wmnet,an-worker[1078-1095,1097-1175].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet,centrallog2002.codfw.wmnet,centrallog1002.eqiad.wmnet,cloudweb2002-dev.wikimedia.org,cloudweb[1003-1004].wikimedia.org,cp[2027,2029,2031,2033,2035,2037,2039,2041].codfw.wmnet,cp[6009-6016].drmrs.wmnet,cp[1100,1102,1104,1106,1108,1110,1112,1114].eqiad.wmnet,cp[5017-5024].eqsin.wmnet,cp[3066-3073].esams.wmnet,cp[4037-4044].ulsfo.wmnet,kubernetes[2007-2014,2017-2060].codfw.wmnet,kubernetes[1007-1014,1017-1062].eqiad.wmnet,mw[2260,2267,2282,2291-2297,2301,2310-2322,2335-2337,2350-2357,2366-2389,2394-2395,2406,2419-2431,2434-2437,2440,2442-2451].codfw.wmnet,mw[1349-1354,1356-1357,1360-1363,1367-1370,1374-1397,1408,1419,1423-1425,1430-1434,1439-1442,1451-1455,1457-1479,1482-1486,1488,1494-1496].eqiad.wmnet,mwlog2002.codfw.wmnet,mwlog1002.eqiad.wmnet,parse[2002-2020].codfw.wmnet,parse[1002-1024].eqiad.wmnet,snapshot[1014,1016-1017].eqiad.wmnet,stat[1009-1011].eqiad.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----                                                                                                                                                                    
11.9                                                                                                                                                                                                               
===== NODE GROUP =====                                                                                                                                                                                             
(175) an-coord[1001-1002].eqiad.wmnet,an-launcher1002.eqiad.wmnet,deploy2002.codfw.wmnet,deploy1002.eqiad.wmnet,mw[2259,2261-2266,2268-2279,2281,2283-2290,2298-2300,2302-2309,2323-2334,2338-2339,2358-2365,2390-2393,2396-2405,2407-2418,2432-2433,2438-2439,2441].codfw.wmnet,mw[1355,1358-1359,1364-1366,1371-1373,1398-1407,1409-1418,1420-1422,1426-1429,1435-1438,1443-1450,1456,1480-1481,1487,1489-1493].eqiad.wmnet,mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet,mwmaint2002.codfw.wmnet,mwmaint1002.eqiad.wmnet,parse2001.codfw.wmnet,parse1001.eqiad.wmnet,scandium.eqiad.wmnet,snapshot[1008-1013,1015].eqiad.wmnet,stat[1004-1008].eqiad.wmnet
----- OUTPUT of 'cat /etc/debian_version' -----                                                                                                                                                                    
10.13                                                                                                                                                                                                              
================                                                                                                                                                                                                   
PASS |█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (698/698) [00:13<00:00, 50.34hosts/s]
FAIL |                                                                                                                                                                           |   0% (0/698) [00:13<?, ?hosts/s]
100.0% (698/698) success ratio (>= 100.0% threshold) for command: 'cat /etc/debian_version'.
100.0% (698/698) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Tue, Mar 26, 1:46 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
BTullis added a comment to T358268: Update maxmind download to pull databases from new url.

For reference, I executed geoipupdate -v -d /tmp/btullis on both puppetmaster1001 and puppetserver1001 to verify that they currently download the files correctly.
There is no mention of a 302 redirect from either the 3.1.1 or 4.10.0 versions of geoipupdate.

Tue, Mar 26, 1:41 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
BTullis added a comment to T358268: Update maxmind download to pull databases from new url.

Interestingly, the newer puppetserver (puppet 7) hosts have version 4.10 of geoipupdate.

btullis@puppetserver1001:~$ apt-cache policy geoipupdate
geoipupdate:
  Installed: 4.10.0-1
  Candidate: 4.10.0-1
  Version table:
 *** 4.10.0-1 500
        500 http://mirrors.wikimedia.org/debian bookworm/contrib amd64 Packages
        100 /var/lib/dpkg/status
``
Both the bullseye and bookworm hosts have the package installed from the `contrib` component of their debian distro, so it's not a component that we install from a third-party repository.
Tue, Mar 26, 1:06 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
BTullis added a comment to T358268: Update maxmind download to pull databases from new url.

On initial inspection, I don't believe that we need to do anything in order for the maxmind downloads to keep working. However, I'm still double-checking.

Tue, Mar 26, 12:57 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)
BTullis closed T294772: Superset Timeout Logging as Declined.

I'll be bold and decline this ticket, but please feel free to reopen it if anyone feels strongly that we should be instrumenting Superset like this..
I think that the better way for us to gain visibility into query timeouts is probably as part of T269832, although that only covers Presto, not Druid.

Tue, Mar 26, 10:42 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), superset.wikimedia.org, Data-Engineering
BTullis moved T294772: Superset Timeout Logging from Needs Review to Done on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.
Tue, Mar 26, 10:42 AM · Data-Platform-SRE (2024.03.25 - 2024.04.14), superset.wikimedia.org, Data-Engineering