Maniphest T207165

eventlogging_db_sanitization script failed
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	• Marostegui
	Oct 16 2018, 12:20 PM

Description

We got an alert at around 11:00UTC about the eventlogging_db_sanitization script not running on both db1107 and db1108:

Oct 16 11:00:00 db1107 systemd[1]: Started Apply Analytics data retetion policies to the Eventlogging database.
Oct 16 11:00:00 db1107 eventlogging_cleaner[27817]: INFO: line 139: Executing command SELECT      table_name,      SUM(IF(column_name = 'timestamp', 1, 0)) AS has_timestamp_field,      SUM(IF(column_name LIKE 'event_%', 1, 0)) AS event_field_count FROM information_schema.
Oct 16 11:00:00 db1107 eventlogging_cleaner[27817]: ERROR: line 645: Some table prefixes in the whitelist do not match any table name retrieved from the database. Please review the following entries of the whitelist: ['ResourceTiming']
Oct 16 11:00:00 db1107 systemd[1]: eventlogging_db_sanitization.service: Main process exited, code=exited, status=1/FAILURE
Oct 16 11:00:00 db1107 systemd[1]: eventlogging_db_sanitization.service: Unit entered failed state.
Oct 16 11:00:00 db1107 systemd[1]: eventlogging_db_sanitization.service: Failed with result 'exit-code'.

root@db1107:/var/log/eventlogging_cleaner# systemctl list-units --state=failed
  UNIT                                 LOAD   ACTIVE SUB    DESCRIPTION
● eventlogging_db_sanitization.service loaded failed failed Apply Analytics data retetion policies to the Eventlogging database

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

1 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.

Details

	Subject	Repo	Branch	Lines +/-
	eventlogging_cleaner.py: remove obsolete check	operations/puppet	production	+0 -44

Customize query in gerrit

Event Timeline

• Marostegui created this task.Oct 16 2018, 12:20 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 16 2018, 12:20 PM

I've ack'ed the alerts for 3 hours on db1107 and db1108.

• Marostegui renamed this task from eventloggiong_db_sanitization script failed to eventlogging_db_sanitization script failed.Oct 16 2018, 12:22 PM

• Marostegui updated the task description. (Show Details)

Thanks!

This should be a protection mechanism that in this case caused a false positive. So https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/466607/ introduces a new schema in the whitelist, but probably no event for the schema has been collected from Eventlogging yet (hence no table created on the DB).

@Gilles hi! Do you know when ResourceTiming will start registering events in Eventlogging?

It already is

0: jdbc:hive2://an-coord1001.eqiad.wmnet:1000> SELECT COUNT(*) FROM event.resourcetiming WHERE year = 2018;

[...]

6536590
1 row selected (48.308 seconds)

Thanks! So this might be the case of schema present only on Hadoop and not on Mysql? If so the logic that triggered the above check needs to be removed :)

@Gilles @elukey
Since we changed the EL blacklist that prevented schemas to be loaded to MySQL to a whitelist, new schemas are being loaded only to Hive by default.
So this scenario will happen frequently from now on.

Yes, @elukey, we need to remove that check from the MySQL purging script.

Change 467679 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_cleaner.py: remove obsolete check

https://gerrit.wikimedia.org/r/467679

gerritbot added a project: Patch-For-Review.Oct 16 2018, 1:07 PM

Change 467679 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: remove obsolete check

https://gerrit.wikimedia.org/r/467679

15:24 <icinga-wm> RECOVERY - Check systemd state on db1108 is OK: OK - running: The system is fully operational
15:29 <icinga-wm> RECOVERY - Check systemd state on db1107 is OK: OK - running: The system is fully operational

elukey claimed this task.Oct 16 2018, 1:30 PM

elukey triaged this task as High priority.

elukey edited projects, added Analytics-Kanban; removed Patch-For-Review, MediaWiki-extensions-EventLogging.

elukey set the point value for this task to 3.

elukey moved this task from Next Up to Done on the Analytics-Kanban board.

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.Oct 18 2018, 5:00 PM

• Nuria closed this task as Resolved.Oct 19 2018, 2:23 AM

eventlogging_db_sanitization script failedClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Event Timeline

eventlogging_db_sanitization script failed
Closed, ResolvedPublic3 Estimated Story Points
Actions