Page MenuHomePhabricator
Feed Advanced Search

Yesterday

JAllemandou awarded T359215: mediawiki_cirrussearch_request data is regularly late a Barnstar token.
Wed, Apr 24, 6:26 PM · Performance Issue, Data-Platform
BTullis closed T336040: Bring stat1010 into service with GPU from stat1005 as Resolved.

Great! I believe that it is working. Here is the output from radeontop.

image.png (484×654 px, 45 KB)

Wed, Apr 24, 4:49 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T336040: Bring stat1010 into service with GPU from stat1005.

Hmm. This isn't working correctly yet, I get the following results from radeontop.

btullis@stat1010:~$ sudo radeontop -l 10 -d -
Failed to open DRM node, no VRAM support.
Dumping to -, line limit 10.
1713976775.777684: bus 3d, gpu 100.00%, ee 100.00%, vgt 100.00%, ta 100.00%, sx 100.00%, sh 100.00%, spi 100.00%, sc 100.00%, pa 100.00%, db 100.00%, cb 100.00%
1713976776.777859: bus 3d, gpu 100.00%, ee 100.00%, vgt 100.00%, ta 100.00%, sx 100.00%, sh 100.00%, spi 100.00%, sc 100.00%, pa 100.00%, db 100.00%, cb 100.00%
1713976777.778032: bus 3d, gpu 100.00%, ee 100.00%, vgt 100.00%, ta 100.00%, sx 100.00%, sh 100.00%, spi 100.00%, sc 100.00%, pa 100.00%, db 100.00%, cb 100.00%
1713976778.778199: bus 3d, gpu 100.00%, ee 100.00%, vgt 100.00%, ta 100.00%, sx 100.00%, sh 100.00%, spi 100.00%, sc 100.00%, pa 100.00%, db 100.00%, cb 100.00%

There is nobody logged onto the server yet, so I can give it a reboot and try again.

Wed, Apr 24, 4:41 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis committed rLPRI26c694990bf3: Add dummy keytabs for new stats servers.
Add dummy keytabs for new stats servers
Wed, Apr 24, 4:22 PM
BTullis added a comment to T336040: Bring stat1010 into service with GPU from stat1005.

The GPU is now correctly detected.

btullis@stat1010:~$ sudo lshw -class display
  *-display                 
       description: VGA compatible controller
       product: Integrated Matrox G200eW3 Graphics Controller
       vendor: Matrox Electronics Systems Ltd.
       physical id: 0
       bus info: pci@0000:03:00.0
       version: 04
       width: 32 bits
       clock: 66MHz
       capabilities: pm vga_controller bus_master cap_list rom
       configuration: driver=mgag200 latency=0 maxlatency=32 mingnt=16
       resources: irq:16 memory:91000000-91ffffff memory:92808000-9280bfff memory:92000000-927fffff memory:c0000-dffff
  *-display UNCLAIMED
       description: VGA compatible controller
       product: Vega 10 XT [Radeon PRO WX 9100]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:3d:00.0
       version: 00
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi vga_controller cap_list
       configuration: latency=0
       resources: iomemory:38bf0-38bef iomemory:38bf0-38bef memory:38bfe0000000-38bfefffffff memory:38bff0000000-38bff01fffff ioport:6000(size=256) memory:ab000000-ab07ffff memory:ab0a0000-ab0bffff

I will make a patch to add the necessary packages.

Wed, Apr 24, 3:01 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.

I have upgraded all of the mons successfully.

btullis@cephosd1001:~$ sudo ceph tell mon.* version
mon.cephosd1001: {
    "version": "17.2.7",
    "release": "quincy",
    "release_type": "stable"
}
mon.cephosd1002: {
    "version": "17.2.7",
    "release": "quincy",
    "release_type": "stable"
}
mon.cephosd1003: {
    "version": "17.2.7",
    "release": "quincy",
    "release_type": "stable"
}
mon.cephosd1004: {
    "version": "17.2.7",
    "release": "quincy",
    "release_type": "stable"
}
mon.cephosd1005: {
    "version": "17.2.7",
    "release": "quincy",
    "release_type": "stable"
}

Also all of the crash services.

Wed, Apr 24, 1:54 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.

The active monitor has been upgraded.

btullis@cephosd1001:~$ sudo ceph tell mgr version
{
    "version": "17.2.7",
    "release": "quincy",
    "release_type": "stable"
}

Proceeding to restart the mon services in sequence.

Wed, Apr 24, 12:30 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.

First standby mgr upgrade is fine.

btullis@cephosd1001:~$ sudo systemctl status ceph-mgr.target
● ceph-mgr.target - ceph target allowing to start/stop all ceph-mgr@.service instances at once
     Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; vendor preset: enabled)
     Active: active since Wed 2024-04-24 11:48:13 UTC; 8s ago
Wed, Apr 24, 12:10 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.

Our Ceph cluster isn't managed by cephadm (yet) so we do not have ready access to the ceph orch upgrade command, which would make the upgrade simpler.

Wed, Apr 24, 12:01 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.

I'm using a debdeploy spec to roll out the new packages to cephosd1001 first. I want to check if services are restarted as part of the upgrade.

btullis@cumin1002:~$ sudo debdeploy deploy -u 2024-04-24-ceph.yaml -Q cephosd1001.eqiad.wmnet
Rolling out ceph:
Library update, several services might need to be restarted
Wed, Apr 24, 11:35 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.

I have pulled in the latest quincy packages for bullseye with:

btullis@apt1002:~$ sudo -i reprepro --component thirdparty/ceph-quincy --noskipold update bullseye-wikimedia
Calculating packages to get...
Getting packages...
Installing (and possibly deleting) packages...
Exporting indices...
Deleting files no longer referenced...
Wed, Apr 24, 11:23 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis moved T360439: Phase out cergen for Search Platform services from In Progress to Needs Review on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.
Wed, Apr 24, 11:09 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
BTullis updated the task description for T360439: Phase out cergen for Search Platform services.
Wed, Apr 24, 10:38 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
BTullis updated the task description for T360439: Phase out cergen for Search Platform services.
Wed, Apr 24, 10:37 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
BTullis updated the task description for T360439: Phase out cergen for Search Platform services.
Wed, Apr 24, 10:32 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
BTullis updated the task description for T360439: Phase out cergen for Search Platform services.
Wed, Apr 24, 10:23 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
BTullis updated the task description for T360439: Phase out cergen for Search Platform services.
Wed, Apr 24, 10:20 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
BTullis updated the task description for T360439: Phase out cergen for Search Platform services.
Wed, Apr 24, 9:45 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
BTullis updated the task description for T360439: Phase out cergen for Search Platform services.
Wed, Apr 24, 9:39 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
BTullis updated the task description for T327259: Support PersistentVolumeClaim objects on dse-k8s cluster.
Wed, Apr 24, 9:30 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis triaged T362993: Update cephosd100[1-5] with the most recent stable version of Ceph as High priority.
Wed, Apr 24, 8:56 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis claimed T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.
Wed, Apr 24, 8:56 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis edited projects for T362993: Update cephosd100[1-5] with the most recent stable version of Ceph, added: Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE.
Wed, Apr 24, 8:55 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis claimed T362181: Encrypt Airflow connections to AQS Cassandra.
Wed, Apr 24, 8:54 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering, Data-Persistence, Cassandra

Tue, Apr 23

BTullis moved T336040: Bring stat1010 into service with GPU from stat1005 from Blocked / Waiting to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

We now have the cable, so we are planning to carry out the work at 13:30 UTC tomorrow. I will send out the comms for that today.

Tue, Apr 23, 3:02 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis updated subscribers of T360439: Phase out cergen for Search Platform services.

I have a whitespace-only change in the nginx configuration for tlsproxy here: https://gerrit.wikimedia.org/r/1023440
It looks safe to me, but since it touches all the maps servers and every elasticsearch::cirrus server, I think that I had better get a review from @hnowlan and either @bking or @RKemper.

Tue, Apr 23, 3:02 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
Dzahn awarded T358268: Update maxmind download to pull databases from new url a Like token.
Tue, Apr 23, 2:59 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T360439: Phase out cergen for Search Platform services.

I am starting by looking at the relforge cluster. I see that the certificates are served by nginx and they are still using the puppet CA based certificates.

btullis@relforge1003:/etc/nginx$ openssl x509 -in /etc/ssl/localcerts/relforge.svc.eqiad.wmnet.chained.crt -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 7899 (0x1edb)
        Signature Algorithm: sha256WithRSAEncryption
        Issuer: CN = Puppet CA: palladium.eqiad.wmnet
        Validity
            Not Before: Mar 18 02:55:32 2021 GMT
            Not After : Mar 18 02:55:32 2026 GMT
        Subject: CN = relforge.svc.eqiad.wmnet

I'll check to see if there is any code ready to deploy cfssl based certificates for nginx.

Tue, Apr 23, 12:26 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
BTullis claimed T360439: Phase out cergen for Search Platform services.
Tue, Apr 23, 11:47 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), SRE
BTullis moved T358518: Deploy streaming updater for 100% of writes to cloudelastic from Backlog to Done on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.
Tue, Apr 23, 11:44 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Discovery-Search (Current work)
BTullis closed T328473: hdfs-tools: Stop using git-fat as Resolved.

I did a successful deploy after having merged the patch to remove vestiges of git-fat intgreation, so I believe that we can close this ticket now.

btullis@deploy1002:/srv/deployment/analytics/hdfs-tools/deploy$ scap deploy
10:22:34 Started deploy [analytics/hdfs-tools/deploy@3618aab]
10:22:34 Deploying Rev: HEAD = 3618aab73c835f56a71f692c59b09dc1f973d094
10:22:34 Started deploy [analytics/hdfs-tools/deploy@3618aab]: (no justification provided)
10:22:34 
== DEFAULT ==
:* an-test-coord1001.eqiad.wmnet
:* an-launcher1002.eqiad.wmnet
:* stat1004.eqiad.wmnet
:* stat1006.eqiad.wmnet
:* stat1011.eqiad.wmnet
:* an-web1001.eqiad.wmnet
:* stat1007.eqiad.wmnet
:* an-coord1003.eqiad.wmnet
:* an-coord1004.eqiad.wmnet
:* clouddumps1001.wikimedia.org
:* stat1005.eqiad.wmnet
:* an-test-client1002.eqiad.wmnet
:* stat1009.eqiad.wmnet
:* stat1010.eqiad.wmnet
:* stat1008.eqiad.wmnet
10:22:42 analytics/hdfs-tools/deploy: fetch stage(s): 100% (in-flight: 0; ok: 15; fail: 0; left: 0) |
10:22:43 analytics/hdfs-tools/deploy: config_deploy stage(s): 100% (in-flight: 0; ok: 15; fail: 0; left: 0) |
10:22:44 analytics/hdfs-tools/deploy: promote stage(s): 100% (in-flight: 0; ok: 15; fail: 0; left: 0) |
10:22:44 default deploy successful
10:22:44 
== DEFAULT ==
:* an-test-coord1001.eqiad.wmnet
:* an-launcher1002.eqiad.wmnet
:* stat1004.eqiad.wmnet
:* stat1006.eqiad.wmnet
:* stat1011.eqiad.wmnet
:* an-web1001.eqiad.wmnet
:* stat1007.eqiad.wmnet
:* an-coord1003.eqiad.wmnet
:* an-coord1004.eqiad.wmnet
:* clouddumps1001.wikimedia.org
:* stat1005.eqiad.wmnet
:* an-test-client1002.eqiad.wmnet
:* stat1009.eqiad.wmnet
:* stat1010.eqiad.wmnet
:* stat1008.eqiad.wmnet
10:22:45 analytics/hdfs-tools/deploy: finalize stage(s): 100% (in-flight: 0; ok: 15; fail: 0; left: 0) |
10:22:45 default deploy successful
10:22:45 Finished deploy [analytics/hdfs-tools/deploy@3618aab]: (no justification provided) (duration: 00m 11s)
10:22:45 Finished deploy [analytics/hdfs-tools/deploy@3618aab] (duration: 00m 10s)
Tue, Apr 23, 10:26 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Scap
BTullis closed T328473: hdfs-tools: Stop using git-fat, a subtask of T279509: git-fat replacement/removal, as Resolved.
Tue, Apr 23, 10:24 AM · git-lfs, Release-Engineering-Team (Now this 🫠), serviceops-radar, Scap, Python3-Porting
BTullis added a comment to T328473: hdfs-tools: Stop using git-fat.

Upon investigation it was discovered that hdfs-tools no longer has any requirement for large file support, so we have merged this patch, which removes the git-fat configuration from the repository.

Tue, Apr 23, 10:11 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Scap
BTullis closed T358268: Update maxmind download to pull databases from new url as Resolved.

So now I think there was just no update to the DBs and the update command doesn't always overwrite the files with latest but instead does some checksum and knows when it doesn't even have to download and skips them if there was no change?

TLDR: The assumption that the timestamp must always change or the last pull failed might be wrong.

Tue, Apr 23, 10:09 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)

Fri, Apr 19

BTullis updated the task description for T362788: Migrate Airflow to the dse-k8s cluster.
Fri, Apr 19, 5:05 PM · Data-Platform-SRE, Epic
BTullis created T363003: Replicate airflow user group structure in LDAP.
Fri, Apr 19, 5:04 PM · Data-Platform-SRE
BTullis updated the task description for T362788: Migrate Airflow to the dse-k8s cluster.
Fri, Apr 19, 4:57 PM · Data-Platform-SRE, Epic
BTullis updated the task description for T362788: Migrate Airflow to the dse-k8s cluster.
Fri, Apr 19, 4:56 PM · Data-Platform-SRE, Epic
BTullis created T363001: Create a helm chart for airflow that is appropriate to our needs.
Fri, Apr 19, 4:55 PM · Data-Platform-SRE
BTullis updated subscribers of T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.

For reference, the latest point release for Quincy is 17.2.5 and we could sync these packages to our apt server at any time.

btullis@apt1002:~$ sudo -i reprepro --component thirdparty/ceph-quincy checkupdate bullseye-wikimedia
Calculating packages to get...
Updates needed for 'bullseye-wikimedia|thirdparty/ceph-quincy|amd64':
'ceph': '17.2.5-1~bpo11+1' will be upgraded to '17.2.7-1~bpo11+1' (from 'thirdparty/ceph-quincy'):
<snip snip>

The reef packages, which have been added recently by @MatthewVernon as part of T279621: Set up Misc Object Storage Service (moss) are at version 18.2.2, which is the latest point release.

btullis@apt1002:~$ sudo -i reprepro -C thirdparty/ceph-reef list bookworm-wikimedia ceph
bookworm-wikimedia|thirdparty/ceph-reef|amd64: ceph 18.2.2-1~bpo12+1
Fri, Apr 19, 4:51 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis updated the task description for T362788: Migrate Airflow to the dse-k8s cluster.
Fri, Apr 19, 4:40 PM · Data-Platform-SRE, Epic
BTullis created T363000: Create an airflow container image using blubber/kokkuri.
Fri, Apr 19, 4:40 PM · Data-Platform-SRE
BTullis updated the task description for T362788: Migrate Airflow to the dse-k8s cluster.
Fri, Apr 19, 4:29 PM · Data-Platform-SRE, Epic
BTullis created T362999: Decide on which postgresql operator to use.
Fri, Apr 19, 4:28 PM · Data-Platform-SRE
BTullis updated the task description for T362788: Migrate Airflow to the dse-k8s cluster.
Fri, Apr 19, 4:00 PM · Data-Platform-SRE, Epic
BTullis added a subtask for T362788: Migrate Airflow to the dse-k8s cluster: T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.
Fri, Apr 19, 3:59 PM · Data-Platform-SRE, Epic
BTullis added a parent task for T362993: Update cephosd100[1-5] with the most recent stable version of Ceph: T362788: Migrate Airflow to the dse-k8s cluster.
Fri, Apr 19, 3:59 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis updated the task description for T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.
Fri, Apr 19, 3:59 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis created T362993: Update cephosd100[1-5] with the most recent stable version of Ceph.
Fri, Apr 19, 3:59 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T358612: Investigate replacing Archiva with Gitlab repositories.

The dependence on Archiva for git-fat has been almost completely removed now, as far as I am aware. There is one patch relating to blazegraph/wdqs that is still to be merged, then we will have completed that part of it.

Fri, Apr 19, 3:33 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Java-Scala-Standardization, Security, collaboration-services, Release-Engineering-Team
BTullis raised the priority of T316876: wdqs: replace git-fat with git-lfs from Low to High.

I don't have the necessary rights in gerrit to +2 the patch, but I should proabably get them. Until then, perhaps @bking, @RKemper, or another member of the wikidata-deploy group can +2 it.

Fri, Apr 19, 3:32 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), git-lfs, Release-Engineering-Team (Priority Backlog 📥), Wikidata, Wikidata-Query-Service, Scap
BTullis triaged T328473: hdfs-tools: Stop using git-fat as High priority.
Fri, Apr 19, 3:26 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Scap
BTullis added a comment to T354936: Review the use of scap + git-fat for Data Platform Engineering use cases.

Ah, we already had a ticket for converting hdfs-tools but it was linked from the parent of this ticket: T328473: hdfs-tools: Stop using git-fat

Fri, Apr 19, 3:25 PM · Data-Platform-SRE
BTullis claimed T328473: hdfs-tools: Stop using git-fat.
Fri, Apr 19, 3:24 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Scap
BTullis moved T328473: hdfs-tools: Stop using git-fat from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.
Fri, Apr 19, 3:22 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Scap
BTullis moved T316876: wdqs: replace git-fat with git-lfs from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.
Fri, Apr 19, 3:07 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), git-lfs, Release-Engineering-Team (Priority Backlog 📥), Wikidata, Wikidata-Query-Service, Scap
BTullis placed T310293: HDFS Namenode fail-back failure up for grabs.
Fri, Apr 19, 1:59 PM · Data-Platform-SRE
BTullis added a comment to T358268: Update maxmind download to pull databases from new url.

Interestingly, the files are also out of date on puppetserver1001, which is running version 4 of geoipupdate from bookworm.

btullis@puppetserver1001:~$ ls -lrt /srv/puppet_fileserver/volatile/GeoIP|tail -n 4
-rw-r--r-- 1 root root 109944609 Apr 17 03:30 GeoIP2-City.mmdb
-rw-r--r-- 1 root root  11613299 Apr 17 03:30 GeoIP2-Connection-Type.mmdb
-rw-r--r-- 1 root root   6426222 Apr 17 03:30 GeoIP2-Country.mmdb
-rw-r--r-- 1 root root  14591352 Apr 17 03:30 GeoIP2-ISP.mmdb
btullis@puppetserver1001:~$ apt-cache policy geoipupdate
geoipupdate:
  Installed: 4.10.0-1
  Candidate: 4.10.0-1
  Version table:
 *** 4.10.0-1 500
        500 http://mirrors.wikimedia.org/debian bookworm/contrib amd64 Packages
        100 /var/lib/dpkg/status

The service that is fired from the daily timer says that it's running correctly, so we're not receiving any errors from systemd.

btullis@puppetserver1001:~$ systemctl status geoip_update_main.service 
○ geoip_update_main.service - download geoip databases from MaxMind
     Loaded: loaded (/lib/systemd/system/geoip_update_main.service; static)
     Active: inactive (dead) since Fri 2024-04-19 03:30:04 UTC; 8h ago
TriggeredBy: ● geoip_update_main.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 2779247 ExecStart=/usr/bin/geoipupdate -f /etc/GeoIP.conf -d /srv/puppet_fileserver/volatile/GeoIP (code=exited, status=0/SUCCESS)
   Main PID: 2779247 (code=exited, status=0/SUCCESS)
        CPU: 384ms
Fri, Apr 19, 11:37 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis closed T349397: Migrate the matomo host to bookworm as Resolved.
Fri, Apr 19, 11:30 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering
BTullis closed T349397: Migrate the matomo host to bookworm, a subtask of T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye, as Resolved.
Fri, Apr 19, 11:28 AM · Data-Platform-SRE, Epic
BTullis updated the task description for T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye.
Fri, Apr 19, 11:12 AM · Data-Platform-SRE, Epic
BTullis added a comment to T349397: Migrate the matomo host to bookworm.

I have updated the configuration on db1208 so that it replicates from matomo1003 instead of matomo1002.

Fri, Apr 19, 10:33 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering
BTullis added a comment to T358268: Update maxmind download to pull databases from new url.

The maxmind databases are still not updated. They still have the date of April 17.

btullis@puppetmaster1001:~$ sudo ls -lrt /var/lib/puppet/volatile/GeoIP|tail -n 4
-rw-r--r-- 1 root root 109944609 Apr 17 03:30 GeoIP2-City.mmdb
-rw-r--r-- 1 root root  11613299 Apr 17 03:30 GeoIP2-Connection-Type.mmdb
-rw-r--r-- 1 root root   6426222 Apr 17 03:30 GeoIP2-Country.mmdb
-rw-r--r-- 1 root root  14591352 Apr 17 03:30 GeoIP2-ISP.mmdb
Fri, Apr 19, 9:08 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis closed T351552: Upgrade matomo (piwik.wikimedia.org) to latest stable version as Resolved.
Fri, Apr 19, 9:07 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
BTullis closed T351552: Upgrade matomo (piwik.wikimedia.org) to latest stable version, a subtask of T349397: Migrate the matomo host to bookworm, as Resolved.
Fri, Apr 19, 9:06 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review, Data-Engineering

Thu, Apr 18

BTullis added a comment to T351552: Upgrade matomo (piwik.wikimedia.org) to latest stable version.

The marketing campaigns reporting plugin is active and is the latest 4.x version available.

image.png (103×1 px, 16 KB)

Thu, Apr 18, 5:06 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
BTullis added a comment to T351552: Upgrade matomo (piwik.wikimedia.org) to latest stable version.

The new version is now live.

image.png (630×678 px, 58 KB)

I see 200 responses from beacon POST requests from this command:

btullis@matomo1003:~$ tail -f /var/log/apache2/other_vhosts_access.log
Thu, Apr 18, 5:04 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
BTullis added a comment to T361688: Upgrade datahub to v0.12.1.

We found that the GMS pod wasn't starting properly on production, so it looks like it's unrelated to BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATE.

Thu, Apr 18, 2:05 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review
BTullis added a comment to T362871: hw troubleshooting: disk failure for an-worker1087.

The filesystem on the drive is unmounted and commented out from /etc/fstab, so the disk is out of service and can be hot-swapped at any time.

Thu, Apr 18, 11:36 AM · SRE, ops-eqiad, DC-Ops
BTullis moved T362860: Apparent disk failure on an-worker1087 from In Progress to Blocked / Waiting on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

I've created T362871 to track the hardware replacement. I'll mark this ticket as blocked until the new drive is in place, then I'll use this ticket to track putting the disk back in service.

Thu, Apr 18, 11:33 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a subtask for T362860: Apparent disk failure on an-worker1087: T362871: hw troubleshooting: disk failure for an-worker1087.
Thu, Apr 18, 11:30 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a parent task for T362871: hw troubleshooting: disk failure for an-worker1087: T362860: Apparent disk failure on an-worker1087.
Thu, Apr 18, 11:30 AM · SRE, ops-eqiad, DC-Ops
BTullis created T362871: hw troubleshooting: disk failure for an-worker1087.
Thu, Apr 18, 11:30 AM · SRE, ops-eqiad, DC-Ops
BTullis added a comment to T358268: Update maxmind download to pull databases from new url.

I tried running the commands manually and downloading to a temporary directory and they seem fine.

image.png (827×1 px, 171 KB)

Thu, Apr 18, 10:49 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis reopened T358268: Update maxmind download to pull databases from new url as "Open".

I'm reopening this ticket because the files did not download correctly today.

image.png (152×734 px, 33 KB)

Thu, Apr 18, 10:44 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis moved T358268: Update maxmind download to pull databases from new url from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.
Thu, Apr 18, 10:44 AM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a comment to T362860: Apparent disk failure on an-worker1087.

It's looking increasingly like a hardware issue.

btullis@an-worker1087:~$ sudo fsck /dev/sdh1
fsck from util-linux 2.36.1
e2fsck 1.46.2 (28-Feb-2021)
/dev/sdh1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Error reading block 660612973 (Input/output error) while reading directory block.  Ignore error<y>?

This from dmesg -T.

[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#304 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#304 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#304 Sense Key : Medium Error [current] 
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#304 Add. Sense: No additional sense information
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#304 CDB: Read(16) 88 00 00 00 00 01 3b 01 43 68 00 00 00 08 00 00
[Thu Apr 18 10:10:22 2024] blk_update_request: I/O error, dev sdh, sector 5284905832 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#325 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#326 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#194 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#262 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#262 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#263 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#263 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#263 Sense Key : Medium Error [current] 
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#263 Add. Sense: No additional sense information
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#263 CDB: Read(16) 88 00 00 00 00 01 3b 01 43 68 00 00 00 08 00 00
[Thu Apr 18 10:10:22 2024] blk_update_request: I/O error, dev sdh, sector 5284905832 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Thu Apr 18 10:10:22 2024] Buffer I/O error on dev sdh1, logical block 660612973, async page read
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#844 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#262 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#263 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#267 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#327 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#268 BRCM Debug mfi stat 0x2d, data len requested/completed 0x1000/0x0
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#268 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#268 Sense Key : Medium Error [current] 
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#268 Add. Sense: No additional sense information
[Thu Apr 18 10:10:22 2024] sd 0:2:7:0: [sdh] tag#268 CDB: Read(16) 88 00 00 00 00 01 3b 01 43 68 00 00 00 08 00 00
[Thu Apr 18 10:10:22 2024] blk_update_request: I/O error, dev sdh, sector 5284905832 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[Thu Apr 18 10:10:22 2024] Buffer I/O error on dev sdh1, logical block 660612973, async page read

I will file a hardware troubleshooting ticket for DC Ops.

Thu, Apr 18, 10:41 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis triaged T362860: Apparent disk failure on an-worker1087 as Medium priority.
Thu, Apr 18, 9:45 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis moved T362860: Apparent disk failure on an-worker1087 from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.

This is the current output from lsblk

btullis@an-worker1087:~$ lsblk
NAME                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                                  8:0    0 111.3G  0 disk 
├─sda1                               8:1    0   953M  0 part /boot
├─sda2                               8:2    0     1K  0 part 
└─sda5                               8:5    0 110.3G  0 part 
  ├─an--worker1087--vg-swap        254:0    0   9.3G  0 lvm  [SWAP]
  ├─an--worker1087--vg-root        254:1    0  55.9G  0 lvm  /
  └─an--worker1087--vg-journalnode 254:2    0    10G  0 lvm  /var/lib/hadoop/journal
sdb                                  8:16   0   3.6T  0 disk 
└─sdb1                               8:17   0   3.6T  0 part /var/lib/hadoop/data/b
sdc                                  8:32   0   3.6T  0 disk 
└─sdc1                               8:33   0   3.6T  0 part /var/lib/hadoop/data/c
sdd                                  8:48   0   3.6T  0 disk 
└─sdd1                               8:49   0   3.6T  0 part /var/lib/hadoop/data/d
sde                                  8:64   0   3.6T  0 disk 
└─sde1                               8:65   0   3.6T  0 part /var/lib/hadoop/data/e
sdf                                  8:80   0   3.6T  0 disk 
└─sdf1                               8:81   0   3.6T  0 part /var/lib/hadoop/data/f
sdg                                  8:96   0   3.6T  0 disk 
└─sdg1                               8:97   0   3.6T  0 part /var/lib/hadoop/data/h
sdh                                  8:112  0   3.6T  0 disk 
└─sdh1                               8:113  0   3.6T  0 part /var/lib/hadoop/data/g
sdi                                  8:128  0   3.6T  0 disk 
└─sdi1                               8:129  0   3.6T  0 part /var/lib/hadoop/data/j
sdj                                  8:144  0   3.6T  0 disk 
└─sdj1                               8:145  0   3.6T  0 part /var/lib/hadoop/data/k
sdk                                  8:160  0   3.6T  0 disk 
└─sdk1                               8:161  0   3.6T  0 part /var/lib/hadoop/data/i
sdl                                  8:176  0   3.6T  0 disk 
└─sdl1                               8:177  0   3.6T  0 part /var/lib/hadoop/data/m
sdm                                  8:192  0   3.6T  0 disk 
└─sdm1                               8:193  0   3.6T  0 part /var/lib/hadoop/data/l

We can see that /dev/sdh1 is currently mounted to /var/lib/hadoop/data/g. On the next boot, the drive letter assignments might be different.

Thu, Apr 18, 9:45 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis created T362860: Apparent disk failure on an-worker1087.
Thu, Apr 18, 9:35 AM · Data-Platform-SRE (2024.04.15 - 2024.05.05)

Wed, Apr 17

BTullis updated the task description for T362788: Migrate Airflow to the dse-k8s cluster.
Wed, Apr 17, 5:15 PM · Data-Platform-SRE, Epic
BTullis updated the task description for T362788: Migrate Airflow to the dse-k8s cluster.
Wed, Apr 17, 5:11 PM · Data-Platform-SRE, Epic
BTullis moved T327259: Support PersistentVolumeClaim objects on dse-k8s cluster from Backlog to In Progress on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.
Wed, Apr 17, 5:08 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis claimed T327259: Support PersistentVolumeClaim objects on dse-k8s cluster.
Wed, Apr 17, 4:44 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis added a subtask for T362788: Migrate Airflow to the dse-k8s cluster: T327259: Support PersistentVolumeClaim objects on dse-k8s cluster.
Wed, Apr 17, 4:01 PM · Data-Platform-SRE, Epic
BTullis added a parent task for T327259: Support PersistentVolumeClaim objects on dse-k8s cluster: T362788: Migrate Airflow to the dse-k8s cluster.
Wed, Apr 17, 4:01 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05)
BTullis triaged T362788: Migrate Airflow to the dse-k8s cluster as High priority.
Wed, Apr 17, 4:01 PM · Data-Platform-SRE, Epic
BTullis moved T362788: Migrate Airflow to the dse-k8s cluster from Incoming to Epics on the Data-Platform-SRE board.
Wed, Apr 17, 4:01 PM · Data-Platform-SRE, Epic
BTullis created T362788: Migrate Airflow to the dse-k8s cluster.
Wed, Apr 17, 3:27 PM · Data-Platform-SRE, Epic
BTullis added a comment to T354936: Review the use of scap + git-fat for Data Platform Engineering use cases.

I have found one more small repository that uses git-fat and that is https://gerrit.wikimedia.org/r/admin/repos/analytics/hdfs-tools/deploy,general

Wed, Apr 17, 2:36 PM · Data-Platform-SRE
BTullis added a comment to T316876: wdqs: replace git-fat with git-lfs.

I think I'm right in saying that we don't need to complete the archiva migration before switching to git-lfs.
That's right, isn't it @dancy? We migrated analytics/refinery recently and that is still using Archiva, for now.

Wed, Apr 17, 2:35 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), git-lfs, Release-Engineering-Team (Priority Backlog 📥), Wikidata, Wikidata-Query-Service, Scap
BTullis edited projects for T316876: wdqs: replace git-fat with git-lfs, added: Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE.
Wed, Apr 17, 2:24 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), git-lfs, Release-Engineering-Team (Priority Backlog 📥), Wikidata, Wikidata-Query-Service, Scap
BTullis edited parent tasks for T316876: wdqs: replace git-fat with git-lfs, added: T354936: Review the use of scap + git-fat for Data Platform Engineering use cases; removed: T279509: git-fat replacement/removal.
Wed, Apr 17, 2:22 PM · Patch-For-Review, Data-Platform-SRE (2024.04.15 - 2024.05.05), git-lfs, Release-Engineering-Team (Priority Backlog 📥), Wikidata, Wikidata-Query-Service, Scap
BTullis removed a subtask for T279509: git-fat replacement/removal: T316876: wdqs: replace git-fat with git-lfs.
Wed, Apr 17, 2:22 PM · git-lfs, Release-Engineering-Team (Now this 🫠), serviceops-radar, Scap, Python3-Porting
BTullis added a subtask for T354936: Review the use of scap + git-fat for Data Platform Engineering use cases: T316876: wdqs: replace git-fat with git-lfs.
Wed, Apr 17, 2:22 PM · Data-Platform-SRE
BTullis closed T328472: analytics/refinery: Stop using git-fat as Resolved.

Thanks so much @dancy and @hashar and everyone else who has helped.
I believe that this is resolved. If I'm wrong about that, please feel free to reopen.

Wed, Apr 17, 2:21 PM · Patch-For-Review, git-lfs, Release-Engineering-Team (Now this 🫠), Data-Engineering, Data-Platform-SRE, Scap
BTullis closed T328472: analytics/refinery: Stop using git-fat, a subtask of T279509: git-fat replacement/removal, as Resolved.
Wed, Apr 17, 2:20 PM · git-lfs, Release-Engineering-Team (Now this 🫠), serviceops-radar, Scap, Python3-Porting
BTullis closed T328472: analytics/refinery: Stop using git-fat, a subtask of T354936: Review the use of scap + git-fat for Data Platform Engineering use cases, as Resolved.
Wed, Apr 17, 2:20 PM · Data-Platform-SRE
BTullis added a comment to T362678: Package request: install elixir and erlang-otp to the analytics clients.

Hi @awight - I'm happy to try to help here, but as @MoritzMuehlenhoff points out, trying to get packages from the Debian repositories that match your requirements may be quite tricky.
Some of our stat hosts (stat1004-8) are also still running buster, so their versions will be even further behind, although we're currently working to bring them up to date.

Wed, Apr 17, 1:52 PM · Data-Platform-SRE, Data-Engineering
BTullis moved T351552: Upgrade matomo (piwik.wikimedia.org) to latest stable version from In Progress to To Be Deployed on the Data-Platform-SRE (2024.04.15 - 2024.05.05) board.
Wed, Apr 17, 12:10 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Patch-For-Review