Page MenuHomePhabricator

WikiWho is down
Closed, ResolvedPublicBUG REPORT

Description

Trying to gather authorship data through xTools. For example, this page: https://xtools.wmcloud.org/articleinfo/en.wikipedia.org/Boris_Berezovsky_(businessman)

Authorship tab shows the following errors.
Error querying Authorship API: timeout
Error querying Wikiwho API: Unknown

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
TheresNoTime triaged this task as High priority.EditedNov 14 2023, 10:23 AM
TheresNoTime subscribed.

Have tried restarting ww_celery/ww_flower

samtar@wikiwho01:~$ sudo tail -F /var/log/celery/flower.log /var/log/celery/worker_default.log /var/log/celery/worker_long.log /var/log/celery/worker_user.log

==> /var/log/celery/flower.log <==
[W 231114 10:20:06 control:43] 'scheduled' inspect method failed
[W 231114 10:20:06 control:43] 'scheduled' inspect method failed
[W 231114 10:20:06 control:43] 'active' inspect method failed
[W 231114 10:20:06 control:43] 'active' inspect method failed
[W 231114 10:20:06 control:43] 'reserved' inspect method failed
[W 231114 10:20:06 control:43] 'reserved' inspect method failed
[W 231114 10:20:06 control:43] 'revoked' inspect method failed
[W 231114 10:20:06 control:43] 'revoked' inspect method failed
[W 231114 10:20:06 control:43] 'conf' inspect method failed
[W 231114 10:20:06 control:43] 'conf' inspect method failed

==> /var/log/celery/worker_default.log <==
[2023-11-13 15:24:28,260: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[970c82c3-b624-473c-a44f-7d59e72749f3]
[2023-11-13 15:27:30,653: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[03b2c016-a464-4fa8-ac89-0cdd68810593]
[2023-11-13 15:31:33,570: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[32fc7b20-11a4-4c10-82aa-d507809fae47]
[2023-11-13 16:52:49,209: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[fa212cfc-4a44-4a71-bdd9-583be5fbf479]
[2023-11-13 16:58:59,678: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[0c12fc9f-5b19-48bf-b6d8-7ea16d551294]
[2023-11-13 17:48:03,891: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[ef68ea2d-e857-46dc-8783-3b98ab547853]
[2023-11-13 18:12:56,834: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[d0a0f7f2-4425-4c35-8ba2-f87a18d44bb5]
[2023-11-13 18:15:01,699: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[2904cf61-6d31-4b45-a51a-793a715a08d9]
[2023-11-13 18:20:54,749: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[318350c4-2028-41dc-b995-c4af542d297f]
[2023-11-13 18:35:25,062: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[63d83fec-bca1-44de-aab4-7d7ac09816d7]

==> /var/log/celery/worker_long.log <==
    return self.on_inbound_frame(frame)
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/amqp/method_framing.py", line 55, in on_frame
    callback(channel, method_sig, buf, None)
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/amqp/connection.py", line 520, in on_inbound_method
    return self.channels[channel_id].dispatch_method(
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
    listener(*args)
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/amqp/channel.py", line 279, in _on_close
    raise error_for_code(
amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more

==> /var/log/celery/worker_user.log <==
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/celery/apps/worker.py", line 284, in _handle_request
    raise exc(exitcode)
celery.exceptions.WorkerShutdown: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
    raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM) Job: 1.

samtar@wikiwho01:~$ sudo systemctl status ww_celery.service
● ww_celery.service - Celery Service
     Loaded: loaded (/etc/systemd/system/ww_celery.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2023-11-14 10:26:22 UTC; 14s ago
    Process: 146234 ExecStart=/bin/bash -c \${CELERY_BIN} -A \${CELERY_APP} multi start \${CELERYD_NODES}          --pidfile=\${CELERY>
   Main PID: 146245 (code=exited, status=1/FAILURE)
        CPU: 1.394s

Nov 14 10:26:22 wikiwho01 systemd[1]: ww_celery.service: Scheduled restart job, restart counter is at 5.
Nov 14 10:26:22 wikiwho01 systemd[1]: Stopped Celery Service.
Nov 14 10:26:22 wikiwho01 systemd[1]: ww_celery.service: Consumed 1.394s CPU time.
Nov 14 10:26:22 wikiwho01 systemd[1]: ww_celery.service: Start request repeated too quickly.
Nov 14 10:26:22 wikiwho01 systemd[1]: ww_celery.service: Failed with result 'exit-code'.
Nov 14 10:26:22 wikiwho01 systemd[1]: Failed to start Celery Service.
Nov 14 10:26:23 wikiwho01 systemd[1]: ww_celery.service: Start request repeated too quickly.
Nov 14 10:26:23 wikiwho01 systemd[1]: ww_celery.service: Failed with result 'exit-code'.
Nov 14 10:26:23 wikiwho01 systemd[1]: Failed to start Celery Service.
TheresNoTime raised the priority of this task from High to Needs Triage.

Steps to resolve this incident:

  • Created /var/run/celery, did chown wikiwho /var/run/celery (PID file was not being created)
  • Mounted /dev/sdb to /pickles
  • Mounted /dev/sdc to /pickles-02

Leaving open for investigation into why /var/run/celery was removed, and why these mounts are not in /etc/fstab

Added

UUID=dc441956-0a90-434a-803e-ef3ba7bcb4fd /pickles ext4 nofail,x-systemd.device-timeout=2s 0 2
UUID=7fa2581f-39cc-4eb6-94d0-952677c0f1b9 /pickles-02 ext4 nofail,x-systemd.device-timeout=2s 0 2

to /etc/fstab

Mentioned in SAL (#wikimedia-cloud) [2023-11-14T12:04:54Z] <TheresNoTime> Restart ww_flower.service, T351172

I'm a little confused what went wrong. What you did for /var/run/celery sounds like it was necessary and probably explains why things go down occasionally? Or do you believe the PID file was getting created before, but the permissions changed or something, which you've now fixed?

As for fstab, I must admit I'm not familiar with it. When we mounted the two volumes, we went by the instructions at https://wikitech.wikimedia.org/wiki/Help:Adding_disk_space_to_Cloud_VPS_instances and used wmcs-prepare-cinder-volume, which evidently is supposed to do the fstab stuff for you. Perhaps you found a bug? Were the volumes completely unmounted when WikiWho was down?

I'm a little confused what went wrong. What you did for /var/run/celery sounds like it was necessary and probably explains why things go down occasionally? Or do you believe the PID file was getting created before, but the permissions changed or something, which you've now fixed?

It's all pretty confusing, the fact the directory didn't exist almost makes it seem like the instance was rebuilt?

As for fstab, I must admit I'm not familiar with it. When we mounted the two volumes, we went by the instructions at https://wikitech.wikimedia.org/wiki/Help:Adding_disk_space_to_Cloud_VPS_instances and used wmcs-prepare-cinder-volume, which evidently is supposed to do the fstab stuff for you. Perhaps you found a bug? Were the volumes completely unmounted when WikiWho was down?

The volumes were unmounted when WikiWho was down, which suggests a reboot/failure with the volume storage?

Noting that the worker_long@ww_host worker is dying after a inconsistent amount of time — there were ~140 items in its queue to process, and I have been manually restarting the ww_celery service when I notice it has died. It has now got around ~40 items in its queue. I'll keep doing this proactively until its queue is empty.

Mentioned in SAL (#wikimedia-cloud) [2023-12-04T10:48:14Z] <TheresNoTime> samtar@wikiwho01:~$ sudo systemctl restart ww_celery.service T351172

re. having to restart ^, this was due to worker_long failing and not restarting. State after restarting:

Screenshot 2023-12-04 at 10.52.30.png (825×1 px, 66 KB)

Note the ~400 queued items for the long queue..

Service appears stable again :-)