Page MenuHomePhabricator

herron (Keith Herron)
Site Reliability Engineer

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
May 30 2017, 5:25 PM (457 w, 5 d)
Availability
Available
IRC Nick
herron
LDAP User
Herron
MediaWiki User
Unknown

Recent Activity

Fri, Mar 6

herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Fri, Mar 6, 10:00 PM · Patch-For-Review, SRE-SLO

Mon, Mar 2

herron added a comment to T416745: ThanosCompactHalted: add series - symbol table size exceeds.
root@titan2001:~# journalctl -u thanos-compact | grep halt | tail -n 1 | sed -nr 's/^.*\[(.*)\].*$/\1/p' | tr -s ' ' '\n' | awk -F '/' '{print $NF}' | xargs -I % thanos tools bucket --objstore.config-file /etc/thanos-compact/objstore.yaml mark --id=% --marker=no-compact-mark.json --details="compactor halted due to size"
ts=2026-03-02T23:01:17.563717752Z caller=factory.go:54 level=info msg="loading bucket configuration"
ts=2026-03-02T23:01:18.006507177Z caller=block.go:406 level=info msg="block has been marked for no compaction" block=01KJ0NM3F93EV8AVPN8F6WBPD9
ts=2026-03-02T23:01:18.006537409Z caller=tools_bucket.go:1134 level=info msg="marking done" marker=no-compact-mark.json IDs=01KJ0NM3F93EV8AVPN8F6WBPD9
ts=2026-03-02T23:01:18.006565428Z caller=main.go:174 level=info msg=exiting
ts=2026-03-02T23:01:18.03052689Z caller=factory.go:54 level=info msg="loading bucket configuration"
ts=2026-03-02T23:01:18.477043378Z caller=block.go:406 level=info msg="block has been marked for no compaction" block=01KJ0R8JBTZXJ37MB15DFX75W0
ts=2026-03-02T23:01:18.477073746Z caller=tools_bucket.go:1134 level=info msg="marking done" marker=no-compact-mark.json IDs=01KJ0R8JBTZXJ37MB15DFX75W0
ts=2026-03-02T23:01:18.477104135Z caller=main.go:174 level=info msg=exiting
ts=2026-03-02T23:01:18.502219064Z caller=factory.go:54 level=info msg="loading bucket configuration"
ts=2026-03-02T23:01:18.968827169Z caller=block.go:406 level=info msg="block has been marked for no compaction" block=01KJ4YGDFGVC3X10837J9Q2MMW
ts=2026-03-02T23:01:18.968856936Z caller=tools_bucket.go:1134 level=info msg="marking done" marker=no-compact-mark.json IDs=01KJ4YGDFGVC3X10837J9Q2MMW
ts=2026-03-02T23:01:18.968892937Z caller=main.go:174 level=info msg=exiting
root@titan2001:~# systemctl restart thanos-compact.service
Mon, Mar 2, 11:03 PM · SRE Observability (FY2025/2026-Q3)

Sat, Feb 28

herron added a comment to T417742: ThanosCompactHalted: pre compaction overlap check - overlaps found while gathering blocks..
Feb 26 22:44:53 titan2001 thanos-compact[2096946]: ts=2026-02-26T22:44:53.899137199Z caller=compact.go:559 level=error msg="critical error detected; halting" err="compaction: 2 errors: group 300000@2015487672410861213: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1762732800000, maxt: 1762905600000, range: 48h0m0s, blocks: 2]: <ulid: 01KJ8C9BV9TDJRFJ4WXWJ1JYC5, mint: 1762387200000, maxt: 1762905600000, range: 144h0m0s>, <ulid: 01K9VVXTVR1PTF1DXTBQ6FZBTW, mint: 1762732800000, maxt: 1762905600000, range: 48h0m0s>\n[mint: 1763078400000, maxt: 1763424000000, range: 96h0m0s, blocks: 2]: <ulid: 01KJ8G05G55J7SZWNFQ0Y5HGS9, mint: 1763078400000, maxt: 1763596800000, range: 144h0m0s>, <ulid: 01KAQWEKESYTBPRPQDKD2RYQ2X, mint: 1763078400000, maxt: 1763424000000, range: 96h0m0s>\n[mint: 1763596800000, maxt: 1763942400000, range: 96h0m0s, blocks: 2]: <ulid: 01KJ2X3K5EXEAQAWSC01ZHZD3M, mint: 1763596800000, maxt: 1763942400000, range: 96h0m0s>, <ulid: 01KBSQZ0NAAD4ZXWDSH63VED58, mint: 1763596800000, maxt: 1763942400000, range: 96h0m0s>\n[mint: 1766016000001, maxt: 1767225600000, range: 335h59m59s, blocks: 2]: <ulid: 01KJ3W4RSQMGVXHQH4SBWN9HPZ, mint: 1766016000001, maxt: 1767225600000, range: 335h59m59s>, <ulid: 01KDZWJKBBN3ETNQADX5JCSMRB, mint: 1766016000001, maxt: 1767225600000, range: 335h59m59s>\n[mint: 1761868800000, maxt: 1762214400000, range: 96h0m0s, blocks: 2]: <ulid: 01KJ83M1QDFJ0EGFRVW5SH7S78, mint: 1761868800000, maxt: 1762387200000, range: 144h0m0s>, <ulid: 01K9JFKZ5Y70SNWSHCVQXHZHHF, mint: 1761868800000, maxt: 1762214400000, range: 96h0m0s>\n[mint: 1763424000000, maxt: 1763596800000, range: 48h0m0s, blocks: 2]: <ulid: 01KJ8G05G55J7SZWNFQ0Y5HGS9, mint: 1763078400000, maxt: 1763596800000, range: 144h0m0s>, <ulid: 01KAKDHWR3R1XNHR05MV3NRF8C, mint: 1763424000000, maxt: 1763596800000, range: 48h0m0s>\n[mint: 1764288000000, maxt: 1764633600000, range: 96h0m0s, blocks: 2]: <ulid: 01KJ8SCY4GHZAVV4T995RZS2DG, mint: 1764115200000, maxt: 1764806400000, range: 192h0m0s>, <ulid: 01KBT7GEGR283SQX4Z30ZSE99S, mint: 1764288000000, maxt: 1764633600000, range: 96h0m0s>\n[mint: 1764806400000, maxt: 1766016000000, range: 336h0m0s, blocks: 2]: <ulid: 01KDCJ3M6BM082MEQJYVPV2CE4, mint: 1764806400000, maxt: 1766016000000, range: 336h0m0s>, <ulid: 01KJ8X474PBTJ8GNRD8NFBVYPH, mint: 1764806400000, maxt: 1766016000000, range: 336h0m0s>\n[mint: 1760486400000, maxt: 1761004800000, range: 144h0m0s, blocks: 2]: <ulid: 01K8E6Y8R2R6RSRK57XS5256K0, mint: 1760486400000, maxt: 1761004800000, range: 144h0m0s>, <ulid: 01KJ780CRES6SNXJCZ0YD3ESR7, mint: 1760486400000, maxt: 1761177600000, range: 192h0m0s>\n[mint: 1761523200000, maxt: 1761696000000, range: 48h0m0s, blocks: 2]: <ulid: 01KJ7YQ7J4Q19D3QFSK80ZSA63, mint: 1761177600000, maxt: 1761696000000, range: 144h0m0s>, <ulid: 01K8QWZV1HKE3ED6N228X7ZP6V, mint: 1761523200000, maxt: 1761696000000, range: 48h0m0s>\n[mint: 1762214400000, maxt: 1762387200000, range: 48h0m0s, blocks: 2]: <ulid: 01KJ83M1QDFJ0EGFRVW5SH7S78, mint: 1761868800000, maxt: 1762387200000, range: 144h0m0s>, <ulid: 01K9DHKE61Y5AE6V1Z1BCPNSTJ, mint: 1762214400000, maxt: 1762387200000, range: 48h0m0s>\n[mint: 1763942400000, maxt: 1764115200000, range: 48h0m0s, blocks: 2]: <ulid: 01KB01HHAX2710WXDVGK53RW61, mint: 1763942400000, maxt: 1764115200000, range: 48h0m0s>, <ulid: 01KJ1QBCD7PR45AH6RY8QYHPVG, mint: 1763942400000, maxt: 1764115200000, range: 48h0m0s>\n[mint: 1764633600000, maxt: 1764806400000, range: 48h0m0s, blocks: 2]: <ulid: 01KJ8SCY4GHZAVV4T995RZS2DG, mint: 1764115200000, maxt: 1764806400000, range: 192h0m0s>, <ulid: 01KBNW9E3DQYG4Y189Z6KS2PKV, mint: 1764633600000, maxt: 1764806400000, range: 48h0m0s>\n[mint: 1758758400000, maxt: 1759276800000, range: 144h0m0s, blocks: 2]: <ulid: 01KJ23YVV02BAQJNFAP1SWFQP1, mint: 1758758400000, maxt: 1759276800000, range: 144h0m0s>, <ulid: 01KHRNCPS3MBKWBTYM8A8F9JNM, mint: 1758758400000, maxt: 1759449600000, range: 192h0m0s>\n[mint: 1759449600000, maxt: 1759795200000, range: 96h0m0s, blocks: 2]: <ulid: 01K7C3HK0RM901JYJRNRH4VW94, mint: 1759449600000, maxt: 1759795200000, range: 96h0m0s>, <ulid: 01KJ6E26QNK22BNX1HZ0G5QYF2, mint: 1759449600000, maxt: 1759968000000, range: 144h0m0s>\n[mint: 1759968000000, maxt: 1760313600000, range: 96h0m0s, blocks: 2]: <ulid: 01KJ2P5XHHN1VPFC7T0EJEC5HJ, mint: 1759968000000, maxt: 1760313600000, range: 96h0m0s>, <ulid: 01KHV02F6A4WB8NQ4CK09VWV7V, mint: 1759968000000, maxt: 1760486400000, range: 144h0m0s>\n[mint: 1761177600000, maxt: 1761523200000, range: 96h0m0s, blocks: 2]: <ulid: 01KJ7YQ7J4Q19D3QFSK80ZSA63, mint: 1761177600000, maxt: 1761696000000, range: 144h0m0s>, <ulid: 01K9J90PPJA8KXTD6GTRAZKHZF, mint: 1761177600000, maxt: 1761523200000, range: 96h0m0s>\n[mint: 1762387200000, maxt: 1762732800000, range: 96h0m0s, blocks: 2]: <ulid: 01KAQKRTHG7743PA4XAFZ8DYBK, mint: 1762387200000, maxt: 1762732800000, range: 96h0m0s>, <ulid: 01KJ8C9BV9TDJRFJ4WXWJ1JYC5, mint: 1762387200000, maxt: 1762905600000, range: 144h0m0s>\n[mint: 1762905600000, maxt: 1763078400000, range: 48h0m0s, blocks: 2]: <ulid: 01KJ15PF4N8QYCYR1XC4S7BRTX, mint: 1762905600000, maxt: 1763078400000, range: 48h0m0s>, <ulid: 01KA1A4XVD699TPV1MBD9CVD9N, mint: 1762905600000, maxt: 1763078400000, range: 48h0m0s>\n[mint: 1764115200000, maxt: 1764288000000, range: 48h0m0s, blocks: 2]: <ulid: 01KB4ZZ0GRDE9KCFKASYWVAH47, mint: 1764115200000, maxt: 1764288000000, range: 48h0m0s>, <ulid: 01KJ8SCY4GHZAVV4T995RZS2DG, mint: 1764115200000, maxt: 1764806400000, range: 192h0m0s>\n[mint: 1759795200000, maxt: 1759968000000, range: 48h0m0s, blocks: 2]: <ulid: 01KJ6E26QNK22BNX1HZ0G5QYF2, mint: 1759449600000, maxt: 1759968000000, range: 144h0m0s>, <ulid: 01K77KMHYKHT78NMQ9K2Z9PXAJ, mint: 1759795200000, maxt: 1759968000000, range: 48h0m0s>\n[mint: 1761004800000, maxt: 1761177600000, range: 48h0m0s, blocks: 2]: <ulid: 01KJ780CRES6SNXJCZ0YD3ESR7, mint: 1760486400000, maxt: 1761177600000, range: 192h0m0s>, <ulid: 01K89M9V91VBZYN2NX98T2PNDM, mint: 1761004800000, maxt: 1761177600000, range: 48h0m0s>\n[mint: 1761696000000, maxt: 1761868800000, range: 48h0m0s, blocks: 2]: <ulid: 01K8X1RE8GR0ABPA8878KDVWZE, mint: 1761696000000, maxt: 1761868800000, range: 48h0m0s>, <ulid: 01KJ0E56AHXF02N15VNPSCKNP7, mint: 1761696000000, maxt: 1761868800000, range: 48h0m0s>; group 300000@11257394797428657513: compact blocks [/srv/thanos-compact/compact/300000@11257394797428657513/01KJ0A98QFVAERZ2SW7PRK7GDM /srv/thanos-compact/compact/300000@11257394797428657513/01KJ0K2R2ZSSRKXYAQXPW2JH8F /srv/thanos-compact/compact/300000@11257394797428657513/01KJ0Q8GGTTE4WGMM1YDAY5KTX /srv/thanos-compact/compact/300000@11257394797428657513/01KJ4TTBYXCV65H4K0VSDXF6NS]: 2 errors: add series: symbol table size exceeds 4294967295 bytes: 8741100040; symbol table size exceeds 4294967295 bytes: 8741100040"
Sat, Feb 28, 12:35 AM · SRE Observability (FY2025/2026-Q3), Observability-Metrics

Fri, Feb 27

herron added a comment to T417163: Noise in #wikimedia-operations is making incident response more difficult.

We could consider setting bots to use direct messages to reduce the amount of chatter in the channel without losing notification/highlight functionality outright. Off hand:

Fri, Feb 27, 6:50 PM · SRE, Sustainability (Incident Followup)
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Fri, Feb 27, 6:02 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Fri, Feb 27, 6:01 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Fri, Feb 27, 5:56 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Fri, Feb 27, 5:50 PM · Patch-For-Review, SRE-SLO

Thu, Feb 26

herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Thu, Feb 26, 4:32 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Thu, Feb 26, 4:29 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Thu, Feb 26, 4:25 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Thu, Feb 26, 4:21 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Thu, Feb 26, 4:17 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Thu, Feb 26, 4:14 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Thu, Feb 26, 4:08 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Thu, Feb 26, 4:03 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Thu, Feb 26, 3:45 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Thu, Feb 26, 3:28 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Thu, Feb 26, 3:15 PM · Patch-For-Review, SRE-SLO

Wed, Feb 25

herron claimed T417002: Upgrade Observability mwlog hosts to trixie.

I can grab this, we have recently racked mwlog[12]003 hardware and can tackle the hw refresh and trixie upgrade at the same time

Wed, Feb 25, 4:19 PM · SRE Observability (FY2025/2026-Q3)
herron added a subtask for T417336: RAM upgrade availability for Titan hosts: T416741: RAM upgrade availability for Titan hosts.
Wed, Feb 25, 2:57 PM · SRE, DC-Ops, ops-codfw
herron added a parent task for T416741: RAM upgrade availability for Titan hosts: T417336: RAM upgrade availability for Titan hosts.
Wed, Feb 25, 2:57 PM · SRE Observability, SRE, ops-eqiad, DC-Ops

Mon, Feb 23

herron updated the task description for T418163: Sloth: onboard existing SLOs to sloth manifests.
Mon, Feb 23, 8:23 PM · Patch-For-Review, SRE-SLO
herron renamed T418163: Sloth: onboard existing SLOs to sloth manifests from Sloth: migrate existing SLOs to sloth manifests to Sloth: onboard existing SLOs to sloth manifests.
Mon, Feb 23, 8:05 PM · Patch-For-Review, SRE-SLO
herron created T418163: Sloth: onboard existing SLOs to sloth manifests.
Mon, Feb 23, 6:16 PM · Patch-For-Review, SRE-SLO
herron updated the task description for T416262: Sloth: productionize and onboard all SLOs.
Mon, Feb 23, 6:07 PM · SRE-SLO
herron closed T414579: sloth deployment as Resolved.
Mon, Feb 23, 5:35 PM · SRE Observability (FY2025/2026-Q3), Observability-Alerting, SRE-SLO
herron closed T414579: sloth deployment, a subtask of T416262: Sloth: productionize and onboard all SLOs, as Resolved.
Mon, Feb 23, 5:35 PM · SRE-SLO
herron updated the task description for T416262: Sloth: productionize and onboard all SLOs.
Mon, Feb 23, 5:30 PM · SRE-SLO
herron updated the task description for T416262: Sloth: productionize and onboard all SLOs.
Mon, Feb 23, 5:29 PM · SRE-SLO

Thu, Feb 19

herron added a comment to T416501: Grant sbassett, aranyap, and alexsanford expanded logstash access.

Hi there @tappof , we're able to authenticate to https://logs-api.svc.eqiad.wmnet/ (GET / works) using the credentials you gave us, but all search endpoints (/_search, /_msearch, _cat/*) return 401 Unauthorized.

Just wanted to check if we have permissions to execute read/search to run search queries or if there's another way we need to access this data.

Thu, Feb 19, 8:28 PM · FY2025-26 WE 4.6 - Account Security, Observability-Logging, Wikimedia-Logstash, Security-Team
herron updated the task description for T417934: puppetserver1002 /srv/git/operations/private out of sync.
Thu, Feb 19, 8:06 PM · Puppet, SRE
herron renamed T417934: puppetserver1002 /srv/git/operations/private out of sync from puppetserver1001 /srv/git/operations/private out of sync to puppetserver1002 /srv/git/operations/private out of sync.
Thu, Feb 19, 8:02 PM · Puppet, SRE
herron created T417934: puppetserver1002 /srv/git/operations/private out of sync.
Thu, Feb 19, 8:01 PM · Puppet, SRE
herron closed T417336: RAM upgrade availability for Titan hosts as Resolved.

Thanks @Jhancock.wm! Ram upgrades on titan200[12] look good!

Thu, Feb 19, 5:25 PM · SRE, DC-Ops, ops-codfw
herron added a comment to T417336: RAM upgrade availability for Titan hosts.

I'll be available today all day Eastern TZ. If that works, ping me on IRC when ready to start on titan2001? I'll depool and shutdown the host for you, and then fyi after rebooting it takes about 30 minutes for services to fully reload before moving on to the next host.

Thu, Feb 19, 3:35 PM · SRE, DC-Ops, ops-codfw

Wed, Feb 18

herron added a comment to T417336: RAM upgrade availability for Titan hosts.

we have plenty of decommissioned servers we can pull from to up the ram count on these servers. i can check in the morning if they are the same speed as the ones already in the server. 2666 MT/s

Wed, Feb 18, 4:54 PM · SRE, DC-Ops, ops-codfw
herron updated subscribers of T417001: Upgrade Observability Kafka hosts to trixie.

Looks like this would be the first kafka cluster on trixie. For Kafka 3.5 we would likely bring 7.5.12-1 (3.5) to trixie, along with an appropriate jdk version. I'm not sure off hand which jdk version is recommended for 3.5

Wed, Feb 18, 4:09 PM · SRE Observability
herron merged task T417710: ThanosCompactHalted: pre compaction overlap check: overlaps found while gathering blocks. into T417742: ThanosCompactHalted: pre compaction overlap check - overlaps found while gathering blocks..
Wed, Feb 18, 2:51 PM · SRE Observability
herron merged T417710: ThanosCompactHalted: pre compaction overlap check: overlaps found while gathering blocks. into T417742: ThanosCompactHalted: pre compaction overlap check - overlaps found while gathering blocks..
Wed, Feb 18, 2:51 PM · SRE Observability (FY2025/2026-Q3), Observability-Metrics

Tue, Feb 17

herron created T417710: ThanosCompactHalted: pre compaction overlap check: overlaps found while gathering blocks..
Tue, Feb 17, 11:14 PM · SRE Observability
herron added a comment to T416741: RAM upgrade availability for Titan hosts.

Excellent! What is a good start time for you today? I can depool titan1001 ahead of that. I should also mention that the titan hosts can take about an hour to reach green state after restarting. We could stagger this as rolling maintenance over a couple hours, or alternatively days, whichever would be easier from the datacenter perspective. Thanks!

Tue, Feb 17, 3:11 PM · SRE Observability, SRE, ops-eqiad, DC-Ops
herron added a comment to T416741: RAM upgrade availability for Titan hosts.

Hi @VRiley-WMF sure, by the way were you able to reclaim any more RAM?

Tue, Feb 17, 3:03 PM · SRE Observability, SRE, ops-eqiad, DC-Ops

Fri, Feb 13

herron added a comment to T416745: ThanosCompactHalted: add series - symbol table size exceeds.
titan2001:~# journalctl -u thanos-compact | grep halt | tail -n 1 | sed -nr 's/^.*\[(.*)\].*$/\1/p' | tr -s ' ' '\n' | awk -F '/' '{print $NF}' | xargs -I % thanos tools bucket --objstore.config-file /etc/thanos-compact/objstore.yaml mark --id=% --marker=no-compact-mark.json --details="compactor halted due to size"
ts=2026-02-13T19:24:09.4810147Z caller=factory.go:54 level=info msg="loading bucket configuration"
ts=2026-02-13T19:24:09.883338265Z caller=block.go:406 level=info msg="block has been marked for no compaction" block=01KFDRK14WJRPDK0ANJT01XVMX
ts=2026-02-13T19:24:09.883373278Z caller=tools_bucket.go:1134 level=info msg="marking done" marker=no-compact-mark.json IDs=01KFDRK14WJRPDK0ANJT01XVMX
ts=2026-02-13T19:24:09.883404009Z caller=main.go:174 level=info msg=exiting
ts=2026-02-13T19:24:09.90964839Z caller=factory.go:54 level=info msg="loading bucket configuration"
ts=2026-02-13T19:24:10.284998705Z caller=block.go:406 level=info msg="block has been marked for no compaction" block=01KGQ9N12TEE5HD4QAMH0NNTSR
ts=2026-02-13T19:24:10.285037134Z caller=tools_bucket.go:1134 level=info msg="marking done" marker=no-compact-mark.json IDs=01KGQ9N12TEE5HD4QAMH0NNTSR
ts=2026-02-13T19:24:10.285071936Z caller=main.go:174 level=info msg=exiting
ts=2026-02-13T19:24:10.323480343Z caller=factory.go:54 level=info msg="loading bucket configuration"
ts=2026-02-13T19:24:10.715128045Z caller=block.go:406 level=info msg="block has been marked for no compaction" block=01KFV16XPTVSDXK24M7X7J27J4
ts=2026-02-13T19:24:10.715167028Z caller=tools_bucket.go:1134 level=info msg="marking done" marker=no-compact-mark.json IDs=01KFV16XPTVSDXK24M7X7J27J4
ts=2026-02-13T19:24:10.715202207Z caller=main.go:174 level=info msg=exiting
ts=2026-02-13T19:24:10.740656721Z caller=factory.go:54 level=info msg="loading bucket configuration"
ts=2026-02-13T19:24:11.20937999Z caller=block.go:406 level=info msg="block has been marked for no compaction" block=01KFV9FDXQN7PGAJZEKF74Q21C
ts=2026-02-13T19:24:11.209414608Z caller=tools_bucket.go:1134 level=info msg="marking done" marker=no-compact-mark.json IDs=01KFV9FDXQN7PGAJZEKF74Q21C
ts=2026-02-13T19:24:11.209452482Z caller=main.go:174 level=info msg=exiting
ts=2026-02-13T19:24:11.234464264Z caller=factory.go:54 level=info msg="loading bucket configuration"
ts=2026-02-13T19:24:11.674423726Z caller=block.go:406 level=info msg="block has been marked for no compaction" block=01KGQMFF377VXBJTMXTDR2D0RG
ts=2026-02-13T19:24:11.674470577Z caller=tools_bucket.go:1134 level=info msg="marking done" marker=no-compact-mark.json IDs=01KGQMFF377VXBJTMXTDR2D0RG
ts=2026-02-13T19:24:11.674507674Z caller=main.go:174 level=info msg=exiting
Fri, Feb 13, 7:24 PM · SRE Observability (FY2025/2026-Q3)
herron added a comment to T416741: RAM upgrade availability for Titan hosts.

Yes, I can certainly plan for this on monday, if you'd like! Also, we have some more decomm servers we are addressing today. I'll be checking those servers to see if we can reclaim more RAM for this upgrade as well.

Fri, Feb 13, 7:22 PM · SRE Observability, SRE, ops-eqiad, DC-Ops
herron added a comment to T416741: RAM upgrade availability for Titan hosts.

@herron at EQIAD I have found a total of six (6) 32gig RAM sticks that we could use for this upgrade. I would recommend that we add 64 gigs per server (2 sticks) and we can hold onto the other 2 for the future if we wanted to expand it further. Is there any particular day you'd like to proceed with this? I can accommodate any time.

Fri, Feb 13, 12:01 AM · SRE Observability, SRE, ops-eqiad, DC-Ops

Thu, Feb 12

herron added a parent task for T417313: titan2001: expand ssd storage: T417308: DiskSpace: Disk space titan2001:9100:/srv 2.71% free.
Thu, Feb 12, 7:00 PM · SRE, DC-Ops, ops-codfw
herron added a subtask for T417308: DiskSpace: Disk space titan2001:9100:/srv 2.71% free: T417313: titan2001: expand ssd storage.
Thu, Feb 12, 7:00 PM · SRE Observability
herron created T417313: titan2001: expand ssd storage.
Thu, Feb 12, 6:58 PM · SRE, DC-Ops, ops-codfw
herron added a subtask for T410152: Disk space saturation (/srv) on Titan hosts: T417308: DiskSpace: Disk space titan2001:9100:/srv 2.71% free.
Thu, Feb 12, 6:55 PM · SRE Observability (FY2025/2026-Q3)
herron added a parent task for T417308: DiskSpace: Disk space titan2001:9100:/srv 2.71% free: T410152: Disk space saturation (/srv) on Titan hosts.
Thu, Feb 12, 6:55 PM · SRE Observability
herron closed T417308: DiskSpace: Disk space titan2001:9100:/srv 2.71% free as Resolved.

I've depooled frontend services on titan2001 and will plan to free the space used by this thanos-store instance as a stopgap measure to keep the compactor running

Thu, Feb 12, 6:54 PM · SRE Observability
herron created T417308: DiskSpace: Disk space titan2001:9100:/srv 2.71% free.
Thu, Feb 12, 6:45 PM · SRE Observability
herron added a comment to T416741: RAM upgrade availability for Titan hosts.

I'm assuming you're also looking to upgrade some at CODFW, correct @herron ?

Thu, Feb 12, 4:41 PM · SRE Observability, SRE, ops-eqiad, DC-Ops

Fri, Feb 6

herron added a comment to T416745: ThanosCompactHalted: add series - symbol table size exceeds.
Feb 06 21:29:32 titan2001 thanos-compact[1053544]: ts=2026-02-06T21:29:32.814266612Z caller=compact.go:559 level=error msg="critical error detected; halting" err="compaction: group 0@11257394797428657513: compact blocks [/srv/thanos-compact/compact/0@11257394797428657513/01KFHBRFDPPK0JMZPQFKTT1Y40 /srv/thanos-compact/compact/0@11257394797428657513/01KFHQ5VK2MP972WE4E14EQXRD]: 2 errors: add series: symbol table size exceeds 4294967295 bytes: 4982340497; symbol table size exceeds 4294967295 bytes: 4982340497"
Fri, Feb 6, 9:56 PM · SRE Observability (FY2025/2026-Q3)

Feb 6 2026

herron added a comment to T416745: ThanosCompactHalted: add series - symbol table size exceeds.

omitted these blocks from compaction for now

Feb 6 2026, 9:16 PM · SRE Observability (FY2025/2026-Q3)
herron created T416745: ThanosCompactHalted: add series - symbol table size exceeds.
Feb 6 2026, 9:03 PM · SRE Observability (FY2025/2026-Q3)
herron created T416741: RAM upgrade availability for Titan hosts.
Feb 6 2026, 8:08 PM · SRE Observability, SRE, ops-eqiad, DC-Ops

Feb 4 2026

herron added a comment to T414579: sloth deployment.

! In T414579#11581131, @tappof wrote:

  • Templates SLO manifests and allows default values (e.g. default alert state)
  • Allows sweeping changes centrally (e.g. introduce new tag, change window, etc.)

These are definitely pros of keeping the definitions in the Puppet repository. With a dedicated repository, we can set up a CI pipeline to highlight sensitive values that differ from the defaults.

Feb 4 2026, 4:09 AM · SRE Observability (FY2025/2026-Q3), Observability-Alerting, SRE-SLO

Feb 3 2026

herron added a comment to T414579: sloth deployment.

For the purposes of sloth onboarding I'd strongly prefer SLO definitions continue to live in puppet, for several reasons:

Feb 3 2026, 4:11 PM · SRE Observability (FY2025/2026-Q3), Observability-Alerting, SRE-SLO

Feb 2 2026

herron added a parent task for T414579: sloth deployment: T416262: Sloth: productionize and onboard all SLOs.
Feb 2 2026, 10:37 PM · SRE Observability (FY2025/2026-Q3), Observability-Alerting, SRE-SLO
herron added a subtask for T416262: Sloth: productionize and onboard all SLOs: T414579: sloth deployment.
Feb 2 2026, 10:37 PM · SRE-SLO
herron closed T416263: Sloth: create Debian package as Resolved.

Sloth package for 0.15.0 has been built via gitlab CI and uploaded to apt:

Feb 2 2026, 10:36 PM · SRE-SLO
herron closed T416263: Sloth: create Debian package, a subtask of T416262: Sloth: productionize and onboard all SLOs, as Resolved.
Feb 2 2026, 10:36 PM · SRE-SLO
herron updated the task description for T416262: Sloth: productionize and onboard all SLOs.
Feb 2 2026, 10:31 PM · SRE-SLO
herron created T416263: Sloth: create Debian package.
Feb 2 2026, 10:31 PM · SRE-SLO
herron created T416262: Sloth: productionize and onboard all SLOs.
Feb 2 2026, 10:29 PM · SRE-SLO
herron closed T409310: Sloth: onboard subset of existing SLOs to pilot, a subtask of T404171: Evaluate Sloth as a possible replacement for Pyrra, as Resolved.
Feb 2 2026, 10:23 PM · SRE-SLO
herron closed T409310: Sloth: onboard subset of existing SLOs to pilot as Resolved.

Closing this as pilot onboarding has finished, wider onboarding will be tracked in parent task!

Feb 2 2026, 10:23 PM · SRE-SLO
herron closed T404171: Evaluate Sloth as a possible replacement for Pyrra, a subtask of T403729: Pyrra calculations for the Initial error budget value of calendar windows, as Resolved.
Feb 2 2026, 10:22 PM · SRE-SLO
herron closed T404171: Evaluate Sloth as a possible replacement for Pyrra as Resolved.

SLO WG has decided together to proceed with a production roll out of sloth, which will be tracked in the parent task!

Feb 2 2026, 10:22 PM · SRE-SLO
herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Feb 2 2026, 10:21 PM · SRE-SLO

Jan 22 2026

herron added a comment to T412230: Q2:rack/setup/install mwlog1003.

Thanks so much for sorting through this @Papaul and @Jclark-ctr! Yes looks good to me, ready to revert to the reuse variant. Thanks again!

Jan 22 2026, 3:08 PM · SRE Observability (FY2025/2026-Q3), SRE, ops-eqiad, DC-Ops

Jan 14 2026

herron closed T414604: Increase Grafana VM memory as Resolved.

Bumped the grafana VMs in codfw and eqiad to 12G (up from 4) and used this as an opportunity to increase VCPUs to 2 as well

Jan 14 2026, 4:56 PM · SRE Observability
herron claimed T414604: Increase Grafana VM memory.
Jan 14 2026, 4:38 PM · SRE Observability
herron added a comment to T412229: Q2:rack/setup/install mwlog2003.

@herron two things

  • do you mind if i rack this in the new expansion cage at codfw?
Jan 14 2026, 12:44 AM · SRE Observability (FY2025/2026-Q3), SRE, ops-codfw, DC-Ops

Dec 17 2025

herron moved T412965: Tune phaultfinder task titles for SLO alerts from Inbox to Backlog on the SRE Observability board.
Dec 17 2025, 3:49 PM · SRE Observability
herron added a comment to T412493: ErrorBudgetBurn.

I think we can customize which fields go into the title, to add the slo field in there as you suggest, but I don't know how exactly -- I see in alertmanager.yml.erb that there's a title urlparam but I can't immediately tell what its format is. @herron, any ideas?

Dec 17 2025, 3:37 PM · Test Kitchen (Experiment Platform Sprint 20)
herron created T412965: Tune phaultfinder task titles for SLO alerts.
Dec 17 2025, 3:27 PM · SRE Observability
herron moved T412799: Improve MediaWiki periodic job alerts from Inbox to Radar on the SRE Observability board.
Dec 17 2025, 3:17 PM · ServiceOps new, MediaWiki-Platform-Team (Q3 Kanban Board), SRE Observability
herron moved T412801: Proof of Concept: Train Health Dashboard from Inbox to Radar on the SRE Observability board.
Dec 17 2025, 3:14 PM · Incident Tooling, ServiceOps new, ServiceOps-Mediawiki, Release-Engineering-Team
herron moved T412229: Q2:rack/setup/install mwlog2003 from Inbox to FY2025/2026-Q2 on the SRE Observability board.
Dec 17 2025, 2:47 PM · SRE Observability (FY2025/2026-Q3), SRE, ops-codfw, DC-Ops
herron edited projects for T412229: Q2:rack/setup/install mwlog2003, added: SRE Observability; removed observability.
Dec 17 2025, 2:47 PM · SRE Observability (FY2025/2026-Q3), SRE, ops-codfw, DC-Ops
herron moved T412230: Q2:rack/setup/install mwlog1003 from Inbox to FY2025/2026-Q2 on the SRE Observability board.
Dec 17 2025, 2:47 PM · SRE Observability (FY2025/2026-Q3), SRE, ops-eqiad, DC-Ops
herron edited projects for T412230: Q2:rack/setup/install mwlog1003, added: SRE Observability; removed observability.
Dec 17 2025, 2:46 PM · SRE Observability (FY2025/2026-Q3), SRE, ops-eqiad, DC-Ops

Dec 16 2025

herron added a comment to T412842: arclamp hosts ran out of space.

Change #1218821 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] arclamp: reduce compress days

https://gerrit.wikimedia.org/r/1218821

Dec 16 2025, 8:08 PM · Observability-Logging, Arc-Lamp

Dec 8 2025

herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Dec 8 2025, 8:06 PM · SRE-SLO
herron added a comment to T409310: Sloth: onboard subset of existing SLOs to pilot.

Updated the wikifunctions slot pilot SLO to enable low priority "ticket" alerting

Dec 8 2025, 8:05 PM · SRE-SLO

Dec 4 2025

herron added a comment to T409310: Sloth: onboard subset of existing SLOs to pilot.

onboarded wikifunctions today as well with config:

Dec 4 2025, 4:59 PM · SRE-SLO
herron updated the task description for T409310: Sloth: onboard subset of existing SLOs to pilot.
Dec 4 2025, 4:58 PM · SRE-SLO

Dec 2 2025

herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Dec 2 2025, 4:52 PM · SRE-SLO

Dec 1 2025

herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Dec 1 2025, 9:01 PM · SRE-SLO
herron added a comment to T409312: Sloth: adapt default month view to quarter view (pilot).

Made a couple more adjustments to https://grafana.wikimedia.org/d/slot-pilot-slo-detail/sloth-s-l-o-detail to clean up the rolling window portion

Dec 1 2025, 8:37 PM · SRE-SLO
herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Dec 1 2025, 4:15 PM · SRE-SLO
herron closed T409312: Sloth: adapt default month view to quarter view (pilot), a subtask of T404171: Evaluate Sloth as a possible replacement for Pyrra, as Resolved.
Dec 1 2025, 4:13 PM · SRE-SLO
herron closed T409312: Sloth: adapt default month view to quarter view (pilot) as Resolved.

Agreed, looks good!

Dec 1 2025, 4:13 PM · SRE-SLO
herron renamed T409312: Sloth: adapt default month view to quarter view (pilot) from Sloth: adapt default month view to quarter view to Sloth: adapt default month view to quarter view (pilot).
Dec 1 2025, 4:12 PM · SRE-SLO
herron updated the task description for T404171: Evaluate Sloth as a possible replacement for Pyrra.
Dec 1 2025, 4:11 PM · SRE-SLO

Nov 25 2025

herron added a comment to T407503: Verify that the Pyrra dashboard is measuring what we think it is, and what it should be measuring.

https://gerrit.wikimedia.org/r/1211177 Elukey Patchset 1 11:50 AM I think it could be a good test but I would try to explain why we get the difference outlined in https://w.wiki/GHoH, because IIUC we should really see the drops in the first place. Maybe there is something extra that we are not seeing?

Nov 25 2025, 7:41 PM · Abstract Wikipedia team (26Q3 (Jan–Mar)), Essential-Work

Nov 24 2025

herron created T410933: Add Druid as a Private Grafana Datasource.
Nov 24 2025, 6:26 PM · Observability-Metrics, SRE Observability (FY2025/2026-Q3), SRE