Page MenuHomePhabricator

Disk space full on vanadium from logs in /var/log/upstart
Closed, ResolvedPublic

Description

Vanadium's / partition is full from logs in /var/log/upstart, of the form 'eventlogging_processor-client-side-events.log.1' etc. A lot of them seem to be large notices of events failing to validate?

Event Timeline

yuvipanda raised the priority of this task from to Needs Triage.
yuvipanda updated the task description. (Show Details)
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 19 2015, 10:32 AM

I am temporarily moving the biggest of the files (51G eventlogging_processor-client-side-events.log.1) on to /srv. Someone who knows more about the EventLogging service should take a look.

The biggest logs seem to be filled with validation errors about https://meta.wikimedia.org/wiki/Schema:Edit

yuvipanda triaged this task as Normal priority.Mar 19 2015, 11:09 AM
yuvipanda raised the priority of this task from Normal to Unbreak Now!.Mar 19 2015, 11:58 AM

Free space is being gobbled up really fast still, and won't last more than a few hours.

@Nuria Is the eventlogging deploy you did yesterday / day before (brrr timezones?) responsible, maybe?

Thanks @yuvipanda for the emergency fix, I'm tagging our team and making this high importance.

Milimetric set Security to None.

Cool :) I also copied logs_02_06_onward folder onto /srv to make space as well.

Cool :) I also copied logs_02_06_onward folder onto /srv to make space as well.

Nuria added a comment.Mar 19 2015, 1:55 PM

Sorry I did not send e-mail to team yesterday. The root cause is a huge volume of events that are invalid that we cannot deal with.

Volume needs to be lower and invalid events need to be fixed.

Change 197929 had a related patch set uploaded (by Ottomata):
Don't print out full exception details on validation error

https://gerrit.wikimedia.org/r/197929

Nuria added a comment.Mar 19 2015, 4:26 PM

Crisis averted, loogs are growing to 300kbs sec not to *ahem* 2MB per sec. Resolving ticket.

Nuria closed this task as Resolved.Mar 19 2015, 4:26 PM

Change 197929 merged by jenkins-bot:
Don't print out full exception details on validation error

https://gerrit.wikimedia.org/r/197929

yuvipanda reopened this task as Open.Mar 26 2015, 5:52 AM

PROBLEM - Disk space on vanadium is CRITICAL: DISK CRITICAL - free space: /srv 12711 MB (3% inode=99%):

is happpening again, but on /srv

Nuria added a comment.EditedMar 26 2015, 5:32 PM

I have cleaned up a bunch of logs that had been copied by hand to srv, issues with logging system being too verbose were solved in this patch: https://gerrit.wikimedia.org/r/197929

Nuria added a comment.Mar 26 2015, 5:42 PM

Issues with getting a larger infklow of events and thus disk getting filled up on /srv (as events are rightfully logged there) can only be solved by swaping the box.

Change 199957 had a related patch set uploaded (by Nuria):
Vanadium to keep logs for 30 days

https://gerrit.wikimedia.org/r/199957

Change 199957 merged by Ottomata:
Vanadium to keep logs for 30 days

https://gerrit.wikimedia.org/r/199957

Dzahn added a subscriber: Dzahn.Mar 31 2015, 7:37 PM
Filesystem           Size  Used Avail Use% Mounted on
/dev/md1             111G   60G   45G  58% /
udev                 3.9G  4.0K  3.9G   1% /dev
tmpfs                798M  328K  798M   1% /run
none                 5.0M     0  5.0M   0% /run/lock
none                 3.9G     0  3.9G   0% /run/shm
/dev/mapper/vg0-lv0  352G  104G  249G  30% /srv

^ looks resolved now?

Dzahn lowered the priority of this task from Unbreak Now! to Normal.Mar 31 2015, 7:39 PM
Dzahn closed this task as Resolved.Apr 2 2015, 12:07 AM
Dzahn claimed this task.
kevinator moved this task from Next Up to Done on the Analytics-Kanban board.Apr 9 2015, 2:12 PM