Page MenuHomePhabricator

Planet update service flapping/failing on planet1002
Closed, ResolvedPublic

Description

This alert seems to be firing a lot and then recovering:

          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
00:27 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
01:26 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
02:26 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
03:26 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
04:26 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
05:26 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
06:24 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
07:37 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
08:26 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
09:25 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
10:25 -icinga-wm:#wikimedia-operations- PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: planet-update-en.service 
          https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

And indeed, in the logs

Aug 30 10:21:32 planet1002 rawdog[4748]: An error occurred while reading state from /etc/rawdog/en/feeds/39a7970f.state.
Aug 30 10:21:32 planet1002 rawdog[4748]: This usually means the file is corrupt, and removing it will fix the problem.

Event Timeline

@Dzahn perhaps do you know what to do ? or know who might know? thank you!

Mentioned in SAL (#wikimedia-operations) [2021-09-01T12:41:50Z] <mutante> planet1002 - rm /etc/rawdog/en/feeds/39a7970f.state (corrupt) T289984

Mentioned in SAL (#wikimedia-operations) [2021-09-01T13:05:25Z] <mutante> planet1002 - temp removing feed from ad.huikeshoven - seems to cause corrupt state file (T289984)

FYI the alert is flapping again.

Change 720018 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] planet: remove ad.huikeshoven feed

https://gerrit.wikimedia.org/r/720018

Change 720018 merged by Dzahn:

[operations/puppet@production] planet: remove ad.huikeshoven feed

https://gerrit.wikimedia.org/r/720018

I removed the offending feed but the issue was still here.

Then I deleted ALL the existing state files for the "en" feed collection and ran updates again multiple times.

The issue was gone.

Finally reverted removing the feed above and the issue is still gone. So apparently it was unrelated but one other state file got corrupted.

Running the update finishes fine now, the status of the systemd service is ok and there are no more "--state=failed" units.