Page MenuHomePhabricator

phabricator_task_dump.service Failed on phab1004
Closed, ResolvedPublic

Description

The weekly public task dump failed on Jan 22nd. This was the first run after upgrading to Bullseye and is related to the removal of /usr/bin/python that the dumps script is calling:

Jan 22 02:00:01 phab1004 systemd[1]: Starting phabricator public task dump...
Jan 22 02:00:01 phab1004 public_task_dump.py[3883637]: /usr/bin/env: ‘python’: No such file or directory

Event Timeline

LSobanski triaged this task as Medium priority.Jan 22 2024, 8:24 AM

Change 992189 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] [phabricator] Use python3 for task dump script

https://gerrit.wikimedia.org/r/992189

Change 992189 merged by Dzahn:

[operations/puppet@production] [phabricator] Use python3 for task dump script

https://gerrit.wikimedia.org/r/992189

While we keep working on getting this fixed, I think the best option is to remove the timer for the job entirely, to avoid spurious alerts.

Change 992416 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] [phabricator] Remove public task dump task timer

https://gerrit.wikimedia.org/r/992416

Change 992416 merged by EoghanGaffney:

[operations/puppet@production] [phabricator] Remove public task dump task timer

https://gerrit.wikimedia.org/r/992416

Change 993799 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[operations/puppet@production] phabricator: tools: install python3-pymsql for public_task_dump.py

https://gerrit.wikimedia.org/r/993799

Change 993799 merged by Dzahn:

[operations/puppet@production] phabricator: tools: install python3-pymsql for public_task_dump.py

https://gerrit.wikimedia.org/r/993799

Change 993801 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: fix typo in python3-pymysql package name

https://gerrit.wikimedia.org/r/993801

Change 993801 merged by Dzahn:

[operations/puppet@production] phabricator: fix typo in python3-pymysql package name

https://gerrit.wikimedia.org/r/993801

@brennen So it doesn't look like puppet pulls from the tools repo on gitlab. Is how this works that you manually pull and then deploy phabricator to get the contents on the servers?

I resolved merge conflicts and merged this in gitlab.

Then manually cloned it into /tmp and copied it the public_task_dump script over on phab1004 so we wouldn't need another deployment first.

Tested to run it but:

ModuleNotFoundError: No module named 'bzlib'

edit: not a surprise when I copied only the file and not the libs. should be fixed by a deployment

Change 1003070 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: re-activate public dump job

https://gerrit.wikimedia.org/r/1003070

Mentioned in SAL (#wikimedia-operations) [2024-02-13T20:23:54Z] <mutante> phab1004 - running public_task_dump.py T355502

after T357464 I could now run the dump script:

[phab1004:/srv/phab/tools] $ sudo /usr/bin/python3 /srv/phab/tools/public_task_dump.py
Dzahn changed the task status from Open to In Progress.Feb 13 2024, 8:46 PM
Dzahn moved this task from Consultation to Work in Progress on the collaboration-services board.

Change 1003070 merged by Dzahn:

[operations/puppet@production] phabricator: re-activate public dump job

https://gerrit.wikimedia.org/r/1003070

T355574#9540664

The dump script works again. It was succesfully converted by Brennen.

After the latest phabricator deployment I started it manually and confirmed it wrote the dump file.

Then reactivated the timer again in puppet and the relevant unit files were added back to phab1004.

00:30 < jinxer-wm> (SystemdUnitFailed) firing: phabricator_task_dump.service on phab1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - 
                   https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
00:33 < mutante> ^ this is just because the system didnt forget it failed 3 weeks ago
00:33 < mutante> the unit was just added back by me 
00:33 < mutante> I could have prevented it by doing a "systemctl reset-failed" before merging.
00:34 < mutante> I can tell because:       Active: failed (Result: exit-code) since Mon 2024-01-22 17:55:12 UTC; 3 weeks 1 days ago
00:35 < mutante> since I started the unit manually now it should be resolved
00:35 < jinxer-wm> (SystemdUnitFailed) resolved: phabricator_task_dump.service on phab1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - 
                   https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
00:35 < mutante> there we go