User Details
- User Since
- Nov 2 2020, 11:59 AM (185 w, 1 d)
- Availability
- Available
- IRC Nick
- dcaro
- LDAP User
- David Caro
- MediaWiki User
- DCaro (WMF) [ Global Accounts ]
Wed, May 15
+1 @Tim-moody please remember to cleanup up after the migration, storage is reserved even if it's not used (unlike cpu/ram), thanks!
This is deployed already:
dcaro@tools-bastion-13:~$ curl -v --insecure https://api.svc.tools.eqiad1.wikimedia.cloud:30003/openapi.json -o /dev/null % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 172.16.6.113:30003... * Connected to api.svc.tools.eqiad1.wikimedia.cloud (172.16.6.113) port 30003 (#0) * ALPN: offers h2,http/1.1 } [5 bytes data] * TLSv1.3 (OUT), TLS handshake, Client hello (1): } [512 bytes data] * TLSv1.3 (IN), TLS handshake, Server hello (2): { [122 bytes data] * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): { [25 bytes data] * TLSv1.3 (IN), TLS handshake, Request CERT (13): { [76 bytes data] * TLSv1.3 (IN), TLS handshake, Certificate (11): { [462 bytes data] * TLSv1.3 (IN), TLS handshake, CERT verify (15): { [80 bytes data] * TLSv1.3 (IN), TLS handshake, Finished (20): { [52 bytes data] * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): } [1 bytes data] * TLSv1.3 (OUT), TLS handshake, Certificate (11): } [8 bytes data] * TLSv1.3 (OUT), TLS handshake, Finished (20): } [52 bytes data] * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN: server accepted http/1.1 * Server certificate: * subject: CN=api-gateway.api-gateway.svc * start date: May 2 18:07:37 2024 GMT * expire date: May 23 18:07:37 2024 GMT * issuer: CN=api-gateway.api-gateway.svc * SSL certificate verify result: self-signed certificate (18), continuing anyway. * using HTTP/1.1 } [5 bytes data] > GET /openapi.json HTTP/1.1 > Host: api.svc.tools.eqiad1.wikimedia.cloud:30003 > User-Agent: curl/7.88.1 > Accept: */* > { [5 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [297 bytes data] * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): { [297 bytes data] * old SSL session ID is stale, removing { [5 bytes data] < HTTP/1.1 200 OK < Server: nginx/1.21.0 < Date: Wed, 15 May 2024 10:34:35 GMT < Content-Type: application/json < Content-Length: 27088 < Connection: keep-alive < { [16227 bytes data] 100 27088 100 27088 0 0 98349 0 --:--:-- --:--:-- --:--:-- 98861 * Connection #0 to host api.svc.tools.eqiad1.wikimedia.cloud left intact
Deployed, the changes should be there already :)
Tue, May 14
moving it to the done column changed the status :/
Kyverno has native support for mutating existing resources: https://kyverno.io/docs/writing-policies/mutate/#mutate-existing-resources
I have some questions:
on nfs-9, there's 3 errors that repeat over and over in the logs, until right the moment when the number of D processes start to raise:
root@tools-k8s-worker-nfs-9:~# journalctl --since "2024-05-14 04:00:00" | grep kubelet | grep 'Error syncing pod' ... May 14 05:13:37 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:37.341891 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6 May 14 05:13:47 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:47.340454 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6 May 14 05:13:47 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:47.340979 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"job\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=job pod=k8s-20170515.signature-check.wiktionary-5669b58c56-hbt5s_tool-signature-checker(fa1e050d-3002-40a1-9672-0eed6956e925)\"" pod="tool-signature-checker/k8s-20170515.signature-check.wiktionary-5669b58c56-hbt5s" podUID=fa1e050d-3002-40a1-9672-0eed6956e925 May 14 05:13:49 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:49.339518 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6 May 14 05:13:59 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:59.340072 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6 May 14 05:14:00 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:00.339848 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6 May 14 05:14:13 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:13.340330 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6 May 14 05:14:14 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:14.339403 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6 May 14 05:14:25 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:25.340147 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6 May 14 05:14:27 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:27.342502 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6 May 14 05:14:36 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:36.340171 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6 May 14 05:14:51 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:51.340674 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6 May 14 05:15:04 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:15:04.341238 879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6 ##### stops here, no more errors after this
The load on nfs-52 seems to come from osm4wiki processes, that has been killed several times for going over the memory limit. NFS is responsive, just slow when the load becomes high, so probably unrelated.
+1 from me
nfs-9 has been drained without issues, so I can start debugging on it.
The error on clouddumps1002 is:
May 14 04:55:00 clouddumps1002 rsync[4096653]: @ERROR: chroot failed May 14 04:55:00 clouddumps1002 rsync[4096653]: rsync error: error starting client-server protocol (code 5) at main.c(1817) [sender=3.2.3] May 14 04:55:00 clouddumps1002 systemd[1]: rsync_nginxlogs.service: Main process exited, code=exited, status=5/NOTINSTALLED May 14 04:55:00 clouddumps1002 systemd[1]: rsync_nginxlogs.service: Failed with result 'exit-code'. May 14 04:55:00 clouddumps1002 systemd[1]: Failed to start Regular jobs to rsync nginx logs.
Mon, May 13
This all seems correct, although I reiterate that the interesting part is the scope creation or management. We don't currently have a good way to create databases that are explicitly tied to a particular tool -- even with the bespoke/by hand approach we take now there's nothing to keep the tool and trove ACLS in sync.
Got it :), I was able to reproduce with ~30s of inactivity:
tools.wm-lol@tools-bastion-13:~$ time toolforge jobs logs -f test 2024-05-13T13:25:18+00:00 [test-64dc8bcc9d-bdtkn] Mon May 13 01:25:18 PM UTC 2024 ERROR: An internal error occured while executing this command. Traceback (most recent call last): File "/usr/lib/python3/dist-packages/urllib3/response.py", line 443, in _error_catcher yield File "/usr/lib/python3/dist-packages/urllib3/response.py", line 815, in read_chunked self._update_chunk_length() File "/usr/lib/python3/dist-packages/urllib3/response.py", line 745, in _update_chunk_length line = self._fp.fp.readline() ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/socket.py", line 706, in readinto return self._sock.recv_into(b) ^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/ssl.py", line 1278, in recv_into return self.read(nbytes, buffer) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/ssl.py", line 1134, in read return self._sslobj.read(len, buffer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TimeoutError: The read operation timed out
The private one works also, but only if you use the url https://object.eqiad1.wikimediacloud.org/swift/v1/AUTH_toolsbeta/dcarotest1/test.yaml and application credentials (for example), that is not exposed anywhere in the horizon UI
The private bucket has no such link though :/
I had a quick look at the code, it seems that the anonymous user is not extended and it's falling back to the django default one, that does not have those fields.
Thanks for all the replies!
This should have reduced the error rate somewhat, if the job returns any logs, it will not timeout (it will still timeout if the job does not return anything for a while): https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/84
This should have reduced the error rate somewhat, if the job returns any logs, it will not timeout (it will still timeout if the job does not return anything for a while): https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/84
From dmesg:
Flagging as upstream to check in on it eventually
Tue, May 7
As a quick workaround, have you tried specifying the newer version?
I'm currently leaning on using idp.w.o for toolforge api auth, and leaving the openstack auth only when required for the storage or database management. Unless we are 100% sure that we are never going to move to idp.w.o for horizon (let me know if so).
I'm working on T363983: [toolforge] Investigate authentication, and I think both are very related, specially when we talk about how to authenticate to access the buckets/storage.
@MoritzMuehlenhoff mentioned on irc that this is probably related to the upgrade of glibc + systemd issues under load, should be very infrequent.
It might be related to T199911: Systemd session creation fails under I/O load
For the "wikireplicas" meeting, the notes are:
The sessions seems to be a root login from cumin2002 (for all nodes):
Mon, May 6
@bd808 This should have been fixed, can you try now?
Thu, May 2
Wait for T363983: [toolforge] Investigate authentication, as it might not be needed
Wait for T363983: [toolforge] Investigate authentication, as it might not be needed
Wait for T363983: [toolforge] Investigate authentication, as it might not be needed