Page MenuHomePhabricator

versions.toolforge.org is down
Closed, ResolvedPublic

Description

Currently, it is not possible to access https://versions.toolforge.org/. This tool provides information about currently deployed wikiversions, an useful information to have while preparing for backporting. The tool eventually loads, but without the wikiversions info:

image.png (663×1 px, 211 KB)

I can't find a dedicated place to report issues related to the tool, so I'm reporting it here, CCing maintainers: @bd808 @greg @Quiddity.

Event Timeline

The error.log shows various errors, but the “root” errors seem to be:

2024-11-25 09:17:21: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  file_get_contents(https://noc.wikimedia.org/conf/wikiversions.json): failed to open stream: HTTP request failed! in /data/project/versions/public_html/index.php on line 91
2024-11-25 09:17:21: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  file_get_contents(https://noc.wikimedia.org/conf/dblists/group0.dblist): failed to open stream: HTTP request failed! in /data/project/versions/public_html/index.php on line 91
2024-11-25 09:17:21: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  file_get_contents(https://noc.wikimedia.org/conf/dblists/group1.dblist): failed to open stream: HTTP request failed! in /data/project/versions/public_html/index.php on line 91

For convenience:

I don’t know where those errors come from; accessing the URLs from elsewhere (including another webservice shell) works fine AFAICT.

Mentioned in SAL (#wikimedia-cloud) [2024-11-25T10:36:30Z] <wmbot~lucaswerkmeister@tools-bastion-13> webservice restart # maybe it helps with T380703

I believe there's some DNS/resolver issues in cloud ongoing

(Switching hats, sorry.)

I can definitely reproduce the issue within the running webservice’s pod:

tools.versions@tools-bastion-13:~$ kubectl get pods
NAME                        READY   STATUS    RESTARTS   AGE
versions-57b986d64f-fvjff   1/1     Running   0          20d
tools.versions@tools-bastion-13:~$ kubectl exec -it versions-57b986d64f-fvjff -- bash
tools.versions@versions-57b986d64f-fvjff:~$ curl https://noc.wikimedia.org/conf/wikiversions.json | sha256sum 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:30 --:--:--     0

I think the webservice restart didn’t actually do anything (notice the “20d” age of the pod). But I’m a bit hesitant to hit it with a harder restart stick myself…

I believe there's some DNS/resolver issues in cloud ongoing

You mean T374830?

aborrero claimed this task.
aborrero subscribed.

this was a network outage caused by the operations at T380174: CloudVPS: IPv6 in eqiad1

Should be fixed now.

There might be a mixture of issues here, as the original error seemed to happen before the network outage (November 25, 2024 at 11:28:51 AM GMT+1 https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/132#note_8339f66545cb4552cd41ab090f629bde43e09aba).

Let's see after we are back from the outage if there's still issues happening.