Page MenuHomePhabricator

eventlogging_db_sanitization script failed
Closed, ResolvedPublic3 Estimated Story Points

Description

We got an alert at around 11:00UTC about the eventlogging_db_sanitization script not running on both db1107 and db1108:

Oct 16 11:00:00 db1107 systemd[1]: Started Apply Analytics data retetion policies to the Eventlogging database.
Oct 16 11:00:00 db1107 eventlogging_cleaner[27817]: INFO: line 139: Executing command SELECT      table_name,      SUM(IF(column_name = 'timestamp', 1, 0)) AS has_timestamp_field,      SUM(IF(column_name LIKE 'event_%', 1, 0)) AS event_field_count FROM information_schema.
Oct 16 11:00:00 db1107 eventlogging_cleaner[27817]: ERROR: line 645: Some table prefixes in the whitelist do not match any table name retrieved from the database. Please review the following entries of the whitelist: ['ResourceTiming']
Oct 16 11:00:00 db1107 systemd[1]: eventlogging_db_sanitization.service: Main process exited, code=exited, status=1/FAILURE
Oct 16 11:00:00 db1107 systemd[1]: eventlogging_db_sanitization.service: Unit entered failed state.
Oct 16 11:00:00 db1107 systemd[1]: eventlogging_db_sanitization.service: Failed with result 'exit-code'.
root@db1107:/var/log/eventlogging_cleaner# systemctl list-units --state=failed
  UNIT                                 LOAD   ACTIVE SUB    DESCRIPTION
● eventlogging_db_sanitization.service loaded failed failed Apply Analytics data retetion policies to the Eventlogging database

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

Event Timeline

I've ack'ed the alerts for 3 hours on db1107 and db1108.

Marostegui renamed this task from eventloggiong_db_sanitization script failed to eventlogging_db_sanitization script failed.Oct 16 2018, 12:22 PM
Marostegui updated the task description. (Show Details)

Thanks!

This should be a protection mechanism that in this case caused a false positive. So https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/466607/ introduces a new schema in the whitelist, but probably no event for the schema has been collected from Eventlogging yet (hence no table created on the DB).

@Gilles hi! Do you know when ResourceTiming will start registering events in Eventlogging?

0: jdbc:hive2://an-coord1001.eqiad.wmnet:1000> SELECT COUNT(*) FROM event.resourcetiming WHERE year = 2018;

[...]

6536590
1 row selected (48.308 seconds)

Thanks! So this might be the case of schema present only on Hadoop and not on Mysql? If so the logic that triggered the above check needs to be removed :)

@Gilles @elukey
Since we changed the EL blacklist that prevented schemas to be loaded to MySQL to a whitelist, new schemas are being loaded only to Hive by default.
So this scenario will happen frequently from now on.

Yes, @elukey, we need to remove that check from the MySQL purging script.

Change 467679 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_cleaner.py: remove obsolete check

https://gerrit.wikimedia.org/r/467679

Change 467679 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: remove obsolete check

https://gerrit.wikimedia.org/r/467679

15:24 <icinga-wm> RECOVERY - Check systemd state on db1108 is OK: OK - running: The system is fully operational
15:29 <icinga-wm> RECOVERY - Check systemd state on db1107 is OK: OK - running: The system is fully operational

elukey triaged this task as High priority.
elukey set the point value for this task to 3.
elukey moved this task from Next Up to Done on the Analytics-Kanban board.