Page MenuHomePhabricator

eventlogging logs taking a huge amount of space on eventlog1002 and stat1005
Closed, ResolvedPublic8 Estimated Story Points

Description

We should review the eventlogging's logs data retention on eventlog1002 and stat1005:

elukey@eventlog1002:~$ df -h
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/eventlog1002--vg-data  870G  717G  110G  87% /srv

elukey@eventlog1002:~$ du -hs /srv/log/eventlogging
702G	/srv/log/eventlogging

elukey@stat1005:~$ du -hs /srv/log/eventlogging/archive/
946G	/srv/log/eventlogging/archive/

Event Timeline

elukey triaged this task as High priority.Oct 9 2018, 3:13 PM
elukey created this task.

Change 465569 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::eventlogging::analytics::files: reduce retention for archive

https://gerrit.wikimedia.org/r/465569

elukey added a project: Analytics-Kanban.
elukey set the point value for this task to 5.
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 465573 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] statistics::rsync::eventlogging: reduce retention for archive

https://gerrit.wikimedia.org/r/465573

Change 465569 merged by Elukey:
[operations/puppet@production] role::eventlogging::analytics::files: reduce retention for archive

https://gerrit.wikimedia.org/r/465569

Change 465573 merged by Elukey:
[operations/puppet@production] statistics::rsync::eventlogging: reduce retention for archive

https://gerrit.wikimedia.org/r/465573

Both changes merged, the space consumption should go down on both eventlog1002 and stat1005 after the next logrotate run. Keeping this task open to verify this.

Suggestion is:

  • make camu job that ingest raw client side kafka stream into HDFS
  • reduce retention of logs to 30 days on stats machines leting hdfs hadle the long retention (90 days)

Change 467646 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::camus: use camus to backup el-client-side

https://gerrit.wikimedia.org/r/467646

Change 467648 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::eventlogging::analytics::files: lower down retention

https://gerrit.wikimedia.org/r/467648

Change 467646 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::camus: use camus to backup el-client-side

https://gerrit.wikimedia.org/r/467646

Change 467746 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] camus: fix eventlogging-client-side configuration

https://gerrit.wikimedia.org/r/467746

Change 467746 merged by Elukey:
[operations/puppet@production] camus: fix eventlogging-client-side configuration

https://gerrit.wikimedia.org/r/467746

Change 467909 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/camus@master] Add the config option 'camus.message.json.setlenient'

https://gerrit.wikimedia.org/r/467909

Change 467926 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/refinery@master] util.py: improve is_yarn_application_running pattern match

https://gerrit.wikimedia.org/r/467926

Change 467926 merged by Elukey:
[analytics/refinery@master] util.py: improve is_yarn_application_running pattern match

https://gerrit.wikimedia.org/r/467926

Change 467909 abandoned by Elukey:
Add the config option 'camus.message.json.setlenient'

Reason:
This needs to be done in the wmf branch

https://gerrit.wikimedia.org/r/467909

Change 468016 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/camus@wmf] Add StringMessageDecoder to the list of kafka coders

https://gerrit.wikimedia.org/r/468016

Change 468044 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::camus: temp disable eventlogging-client-side

https://gerrit.wikimedia.org/r/468044

Change 468044 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::camus: temp disable eventlogging-client-side

https://gerrit.wikimedia.org/r/468044

Change 468016 merged by Elukey:
[analytics/camus@wmf] Add StringMessageDecoder to the list of kafka coders

https://gerrit.wikimedia.org/r/468016

Change 468369 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] camus: tune eventlogging-client-side config

https://gerrit.wikimedia.org/r/468369

Change 468369 merged by Elukey:
[operations/puppet@production] camus: tune eventlogging-client-side config

https://gerrit.wikimedia.org/r/468369

Change 467648 merged by Elukey:
[operations/puppet@production] role::eventlogging::analytics::files: lower down retention

https://gerrit.wikimedia.org/r/467648

Change 468998 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::camus: enable eventlogging-client-side

https://gerrit.wikimedia.org/r/468998

Change 468998 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::camus: enable eventlogging-client-side

https://gerrit.wikimedia.org/r/468998

We finally deployed the new camus eventlogging-client-side job that dumps raw eventlogging data to HDFS periodically. The remaining step is to reduce the stat1005's retention to something like 60 days to free space, but it of course can be done only when HDFS will hold the same amount of data.

Change 469384 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::eventlogging::analytics::files: reduce retention to 7 days

https://gerrit.wikimedia.org/r/469384

Change 469384 merged by Elukey:
[operations/puppet@production] profile::eventlogging::analytics::files: reduce retention to 7 days

https://gerrit.wikimedia.org/r/469384

elukey changed the point value for this task from 5 to 8.

Did we updated docs with the new location for logs older than 90 days?

@elukey: confirming that we have set up deletion for files like hdfs dfs -text /wmf/data/raw/eventlogging_client_side/eventlogging-client-side/hourly/2018/10/23/11/eventlogging-client-side.1006.6.855145.1676061642.1540292400000 after 90 days?

Change 475078 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::analytics::refinery::job::data_purge: add purge for EL

https://gerrit.wikimedia.org/r/475078

Did we updated docs with the new location for logs older than 90 days?

Added a line in https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hadoop_Raw_Data, let me know if it is enough :)

@elukey: confirming that we have set up deletion for files like hdfs dfs -text /wmf/data/raw/eventlogging_client_side/eventlogging-client-side/hourly/2018/10/23/11/eventlogging-client-side.1006.6.855145.1676061642.1540292400000 after 90 days?

It was still in my todos but I filed https://gerrit.wikimedia.org/r/475078 just now so we will not forget :)

Change 475078 merged by Elukey:
[operations/puppet@production] profile::analytics::refinery::job::data_purge: add purge for EL

https://gerrit.wikimedia.org/r/475078