Page MenuHomePhabricator

arclamp hosts ran out of space
Open, Needs TriagePublic

Assigned To
None
Authored By
tappof
Dec 16 2025, 4:33 PM
Referenced Files
F71090840: image.png
Dec 16 2025, 5:17 PM
F71090832: image.png
Dec 16 2025, 5:17 PM
F71090608: image.png
Dec 16 2025, 4:33 PM
F71090600: image.png
Dec 16 2025, 4:33 PM

Description

image.png (1×2 px, 178 KB)

image.png (1×2 px, 180 KB)

Details

Related Changes in Gerrit:

Event Timeline

image.png (1×2 px, 190 KB)

image.png (1×2 px, 83 KB)

Noticed a change in the pattern since Nov 26 (more or less).
Custom log rotation seems to be working fine.
Took a look at the Arclamp Grafana dashboard ( https://grafana.wikimedia.org/goto/q6WP_PMvg?orgId=1 ) and found another perturbation, likely related to “Errors from MediaWiki sampler,” during June/July.

On Gerrit (https://gerrit.wikimedia.org/r/q/mergedbefore:2025-11-26+mergedafter:2025-11-24+excimer) , I found the patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1160800 with a comment by Timo (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1160800/comments/b5f4cad1_158b730c) stating that, theoretically, it does not affect Excimer.

Change #1218821 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] arclamp: reduce compress days

https://gerrit.wikimedia.org/r/1218821

Change #1218821 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] arclamp: reduce compress days

https://gerrit.wikimedia.org/r/1218821

^ I've done this to free space in the past, should be ok as a stopgap while the root cause is being addressed

Change #1218821 merged by Herron:

[operations/puppet@production] arclamp: reduce compress days

https://gerrit.wikimedia.org/r/1218821

Maybe we should look into implementing a way for arclamp to create tasks when this issue happens.

I learned some things.

  1. Each excimer stack frame is stored 4 times in uncompressed log files: daily-all, daily-tag, hourly-all, and hourly-tag.
  2. The compress job sorts each log file serially into a separate log file then compresses the sorted version. The sorting is done to enhance the gzip compression.
  3. Once the logs are compressed, the compressed log file is then uploaded to swift.

There was an increase in log volume comparing Nov 11-20 (avg 8,660,451) to Dec 1-10 (avg 9,909,823). (These averages are deduplicated, so 4x these values are on disk.)