WikiWho is down
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	AndreyKva
	Nov 14 2023, 5:22 AM

Description

Trying to gather authorship data through xTools. For example, this page: https://xtools.wmcloud.org/articleinfo/en.wikipedia.org/Boris_Berezovsky_(businessman)

Authorship tab shows the following errors.
Error querying Authorship API: timeout
Error querying Wikiwho API: Unknown

Related Objects

Mentioned In: T279960: XTools / authorship information returns "Error querying Wikiwho API: Unknown" / Fehler beim Abruf der Wikiwho API: Unbekannt
Mentioned Here: T344936: Systemd services rely on cron to restart

Event Timeline

AndreyKva created this task.Nov 14 2023, 5:22 AM

Restricted Application added a project: Community-Tech. · View Herald TranscriptNov 14 2023, 5:22 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Have tried restarting ww_celery/ww_flower

samtar@wikiwho01:~$ sudo tail -F /var/log/celery/flower.log /var/log/celery/worker_default.log /var/log/celery/worker_long.log /var/log/celery/worker_user.log

==> /var/log/celery/flower.log <==
[W 231114 10:20:06 control:43] 'scheduled' inspect method failed
[W 231114 10:20:06 control:43] 'scheduled' inspect method failed
[W 231114 10:20:06 control:43] 'active' inspect method failed
[W 231114 10:20:06 control:43] 'active' inspect method failed
[W 231114 10:20:06 control:43] 'reserved' inspect method failed
[W 231114 10:20:06 control:43] 'reserved' inspect method failed
[W 231114 10:20:06 control:43] 'revoked' inspect method failed
[W 231114 10:20:06 control:43] 'revoked' inspect method failed
[W 231114 10:20:06 control:43] 'conf' inspect method failed
[W 231114 10:20:06 control:43] 'conf' inspect method failed

==> /var/log/celery/worker_default.log <==
[2023-11-13 15:24:28,260: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[970c82c3-b624-473c-a44f-7d59e72749f3]
[2023-11-13 15:27:30,653: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[03b2c016-a464-4fa8-ac89-0cdd68810593]
[2023-11-13 15:31:33,570: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[32fc7b20-11a4-4c10-82aa-d507809fae47]
[2023-11-13 16:52:49,209: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[fa212cfc-4a44-4a71-bdd9-583be5fbf479]
[2023-11-13 16:58:59,678: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[0c12fc9f-5b19-48bf-b6d8-7ea16d551294]
[2023-11-13 17:48:03,891: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[ef68ea2d-e857-46dc-8783-3b98ab547853]
[2023-11-13 18:12:56,834: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[d0a0f7f2-4425-4c35-8ba2-f87a18d44bb5]
[2023-11-13 18:15:01,699: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[2904cf61-6d31-4b45-a51a-793a715a08d9]
[2023-11-13 18:20:54,749: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[318350c4-2028-41dc-b995-c4af542d297f]
[2023-11-13 18:35:25,062: WARNING/MainProcess] Soft time limit (120s) exceeded for api.tasks.process_article[63d83fec-bca1-44de-aab4-7d7ac09816d7]

==> /var/log/celery/worker_long.log <==
    return self.on_inbound_frame(frame)
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/amqp/method_framing.py", line 55, in on_frame
    callback(channel, method_sig, buf, None)
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/amqp/connection.py", line 520, in on_inbound_method
    return self.channels[channel_id].dispatch_method(
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
    listener(*args)
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/amqp/channel.py", line 279, in _on_close
    raise error_for_code(
amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more

==> /var/log/celery/worker_user.log <==
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/celery/apps/worker.py", line 284, in _handle_request
    raise exc(exitcode)
celery.exceptions.WorkerShutdown: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/wikiwho/wikiwho_api/env/lib/python3.9/site-packages/billiard/pool.py", line 1265, in mark_as_worker_lost
    raise WorkerLostError(
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM) Job: 1.

samtar@wikiwho01:~$ sudo systemctl status ww_celery.service
● ww_celery.service - Celery Service
     Loaded: loaded (/etc/systemd/system/ww_celery.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Tue 2023-11-14 10:26:22 UTC; 14s ago
    Process: 146234 ExecStart=/bin/bash -c \${CELERY_BIN} -A \${CELERY_APP} multi start \${CELERYD_NODES}          --pidfile=\${CELERY>
   Main PID: 146245 (code=exited, status=1/FAILURE)
        CPU: 1.394s

Nov 14 10:26:22 wikiwho01 systemd[1]: ww_celery.service: Scheduled restart job, restart counter is at 5.
Nov 14 10:26:22 wikiwho01 systemd[1]: Stopped Celery Service.
Nov 14 10:26:22 wikiwho01 systemd[1]: ww_celery.service: Consumed 1.394s CPU time.
Nov 14 10:26:22 wikiwho01 systemd[1]: ww_celery.service: Start request repeated too quickly.
Nov 14 10:26:22 wikiwho01 systemd[1]: ww_celery.service: Failed with result 'exit-code'.
Nov 14 10:26:22 wikiwho01 systemd[1]: Failed to start Celery Service.
Nov 14 10:26:23 wikiwho01 systemd[1]: ww_celery.service: Start request repeated too quickly.
Nov 14 10:26:23 wikiwho01 systemd[1]: ww_celery.service: Failed with result 'exit-code'.
Nov 14 10:26:23 wikiwho01 systemd[1]: Failed to start Celery Service.

TheresNoTime added subscribers: MusikAnimal, Ragesoss.Nov 14 2023, 10:24 AM

I'm looking, but cc @MusikAnimal, I know you recently made some changes ref. the services (T344936: Systemd services rely on cron to restart)

Steps to resolve this incident:

Created /var/run/celery, did chown wikiwho /var/run/celery (PID file was not being created)
Mounted /dev/sdb to /pickles
Mounted /dev/sdc to /pickles-02

Leaving open for investigation into why /var/run/celery was removed, and why these mounts are not in /etc/fstab

Added

UUID=dc441956-0a90-434a-803e-ef3ba7bcb4fd /pickles ext4 nofail,x-systemd.device-timeout=2s 0 2
UUID=7fa2581f-39cc-4eb6-94d0-952677c0f1b9 /pickles-02 ext4 nofail,x-systemd.device-timeout=2s 0 2

to /etc/fstab

TheresNoTime mentioned this in T279960: XTools / authorship information returns "Error querying Wikiwho API: Unknown" / Fehler beim Abruf der Wikiwho API: Unbekannt.Nov 14 2023, 11:20 AM

Mentioned in SAL (#wikimedia-cloud) [2023-11-14T12:04:54Z] <TheresNoTime> Restart ww_flower.service, T351172

I'm a little confused what went wrong. What you did for /var/run/celery sounds like it was necessary and probably explains why things go down occasionally? Or do you believe the PID file was getting created before, but the permissions changed or something, which you've now fixed?

As for fstab, I must admit I'm not familiar with it. When we mounted the two volumes, we went by the instructions at https://wikitech.wikimedia.org/wiki/Help:Adding_disk_space_to_Cloud_VPS_instances and used wmcs-prepare-cinder-volume, which evidently is supposed to do the fstab stuff for you. Perhaps you found a bug? Were the volumes completely unmounted when WikiWho was down?

In T351172#9332505, @MusikAnimal wrote:

I'm a little confused what went wrong. What you did for /var/run/celery sounds like it was necessary and probably explains why things go down occasionally? Or do you believe the PID file was getting created before, but the permissions changed or something, which you've now fixed?

It's all pretty confusing, the fact the directory didn't exist almost makes it seem like the instance was rebuilt?

As for fstab, I must admit I'm not familiar with it. When we mounted the two volumes, we went by the instructions at https://wikitech.wikimedia.org/wiki/Help:Adding_disk_space_to_Cloud_VPS_instances and used wmcs-prepare-cinder-volume, which evidently is supposed to do the fstab stuff for you. Perhaps you found a bug? Were the volumes completely unmounted when WikiWho was down?

The volumes were unmounted when WikiWho was down, which suggests a reboot/failure with the volume storage?

Noting that the worker_long@ww_host worker is dying after a inconsistent amount of time — there were ~140 items in its queue to process, and I have been manually restarting the ww_celery service when I notice it has died. It has now got around ~40 items in its queue. I'll keep doing this proactively until its queue is empty.

Mentioned in SAL (#wikimedia-cloud) [2023-12-04T10:48:14Z] <TheresNoTime> samtar@wikiwho01:~$ sudo systemctl restart ww_celery.service T351172

re. having to restart ^, this was due to worker_long failing and not restarting. State after restarting:

Screenshot 2023-12-04 at 10.52.30.png (825×1 px, 66 KB)

Note the ~400 queued items for the long queue..

Service appears stable again :-)

MusikAnimal moved this task from Needs Attention 👀 to Done 🏁 on the Community-Tech (CommTech-Kanban) board.Apr 10 2024, 4:33 PM

	F41559464: Screenshot 2023-12-04 at 10.52.30.png
	Dec 4 2023, 10:53 AM

WikiWho is downClosed, ResolvedPublicBUG REPORTActions

Description

Related Objects

Event Timeline

WikiWho is down
Closed, ResolvedPublicBUG REPORT
Actions