Page MenuHomePhabricator

dcaro (David Caro)
SRE & amauteur yak shaver

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Nov 2 2020, 11:59 AM (185 w, 1 d)
Availability
Available
IRC Nick
dcaro
LDAP User
David Caro
MediaWiki User
DCaro (WMF) [ Global Accounts ]

Recent Activity

Wed, May 15

dcaro renamed T365015: [jobs-api] Cleanup deprecated blueprints from [jobs-api] Cleanup deprecated APIs to [jobs-api] Cleanup deprecated blueprints.
Wed, May 15, 2:03 PM · Toolforge
dcaro triaged T365015: [jobs-api] Cleanup deprecated blueprints as Medium priority.
Wed, May 15, 2:03 PM · Toolforge
dcaro updated the task description for T365014: [jobs-api,builds-api,envvars-api] consolidate api paths.
Wed, May 15, 2:01 PM · Toolforge
dcaro updated the task description for T365014: [jobs-api,builds-api,envvars-api] consolidate api paths.
Wed, May 15, 2:00 PM · Toolforge
dcaro triaged T365014: [jobs-api,builds-api,envvars-api] consolidate api paths as High priority.
Wed, May 15, 1:55 PM · Toolforge
dcaro added a comment to T361946: Request temporary storage quota increase for project iiab for migration to bookworm image.

+1 @Tim-moody please remember to cleanup up after the migration, storage is reserved even if it's not used (unlike cpu/ram), thanks!

Wed, May 15, 1:41 PM · Cloud-VPS (Quota-requests)
dcaro placed T362082: [components-api] Add minimal cli with build-only features up for grabs.
Wed, May 15, 1:38 PM · Toolforge
dcaro triaged T363544: [envvars-cli] Add option to not show envvar values when listing as Medium priority.
Wed, May 15, 1:37 PM · Toolforge
dcaro moved T364878: Running out of "secrets" quota (envvars) produces unhelpful error message from `toolforge envvars create` from Backlog to Ready to be worked on on the Toolforge board.
Wed, May 15, 1:19 PM · Patch-For-Review, Toolforge
dcaro triaged T364878: Running out of "secrets" quota (envvars) produces unhelpful error message from `toolforge envvars create` as Medium priority.
Wed, May 15, 1:19 PM · Patch-For-Review, Toolforge
dcaro added a comment to T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage.

This all seems correct, although I reiterate that the interesting part is the scope creation or management. We don't currently have a good way to create databases that are explicitly tied to a particular tool -- even with the bespoke/by hand approach we take now there's nothing to keep the tool and trove ACLS in sync.

Can you elaborate on this?

The current database-for-a-tool solution is that we've created trove-only openstack projects to manage databases used by toolforge tools. Those projects may or may not have the same members as the tool that project is supporting; it's entirely ad-hoc. If we have a model where a tool corresponds directly to an openstack tenant then we can put the trove DBs in there and have consistent access and membership between tools and trove.

Wed, May 15, 12:49 PM · cloud-services-team, Toolforge
dcaro added a comment to T356261: [toolforge-cli,jobs-cli,builds-cli,envvars-cli] Explore OpenAPI SDK tooling for client consolidation.

Returning to the original intent of this task, there seem to only be two good alternatives, OpenAPI Generator and Swagger Codegen (links in task description). Of these, my recommendation would be OpenAPI Generator based on discussion in earlier comments.

@dcaro, do you think it would be okay to just pick one, or do we need a decision request? And/or further exploration, e.g. comparing generated SDKs?

Wed, May 15, 12:23 PM · Toolforge (Toolforge iteration 09)
dcaro updated the task description for T362299: [api-gateway] Add a python server to serve consolidated openapi docs.
Wed, May 15, 10:35 AM · Toolforge (Toolforge iteration 09), User-aborrero
dcaro added a comment to T362299: [api-gateway] Add a python server to serve consolidated openapi docs.

This is deployed already:

dcaro@tools-bastion-13:~$ curl -v --insecure https://api.svc.tools.eqiad1.wikimedia.cloud:30003/openapi.json -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 172.16.6.113:30003...
* Connected to api.svc.tools.eqiad1.wikimedia.cloud (172.16.6.113) port 30003 (#0)
* ALPN: offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
* TLSv1.3 (IN), TLS handshake, Server hello (2):
{ [122 bytes data]
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
{ [25 bytes data]
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
{ [76 bytes data]
* TLSv1.3 (IN), TLS handshake, Certificate (11):
{ [462 bytes data]
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
{ [80 bytes data]
* TLSv1.3 (IN), TLS handshake, Finished (20):
{ [52 bytes data]
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
} [8 bytes data]
* TLSv1.3 (OUT), TLS handshake, Finished (20):
} [52 bytes data]
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: CN=api-gateway.api-gateway.svc
*  start date: May  2 18:07:37 2024 GMT
*  expire date: May 23 18:07:37 2024 GMT
*  issuer: CN=api-gateway.api-gateway.svc
*  SSL certificate verify result: self-signed certificate (18), continuing anyway.
* using HTTP/1.1
} [5 bytes data]
> GET /openapi.json HTTP/1.1
> Host: api.svc.tools.eqiad1.wikimedia.cloud:30003
> User-Agent: curl/7.88.1
> Accept: */*
> 
{ [5 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [297 bytes data]
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
{ [297 bytes data]
* old SSL session ID is stale, removing
{ [5 bytes data]
< HTTP/1.1 200 OK
< Server: nginx/1.21.0
< Date: Wed, 15 May 2024 10:34:35 GMT
< Content-Type: application/json
< Content-Length: 27088
< Connection: keep-alive
< 
{ [16227 bytes data]
100 27088  100 27088    0     0  98349      0 --:--:-- --:--:-- --:--:-- 98861
* Connection #0 to host api.svc.tools.eqiad1.wikimedia.cloud left intact
Wed, May 15, 10:35 AM · Toolforge (Toolforge iteration 09), User-aborrero
dcaro closed T362299: [api-gateway] Add a python server to serve consolidated openapi docs as Resolved.
Wed, May 15, 10:35 AM · Toolforge (Toolforge iteration 09), User-aborrero
dcaro closed T362299: [api-gateway] Add a python server to serve consolidated openapi docs, a subtask of T354745: [jobs-api,buildservice-api,envvars-api] Investigate ways to present our multiple Openapi definitions to a future consolidated CLI client, as Resolved.
Wed, May 15, 10:33 AM · Toolforge (Toolforge iteration 09), Patch-For-Review, User-aborrero
dcaro added a comment to T364883: [maintain-kubeusers] Increment default Secrets (envvars) quota.

Deployed, the changes should be there already :)

Wed, May 15, 10:32 AM · Toolforge (Toolforge iteration 09)
dcaro closed T364883: [maintain-kubeusers] Increment default Secrets (envvars) quota as Resolved.
Wed, May 15, 10:32 AM · Toolforge (Toolforge iteration 09)

Tue, May 14

dcaro claimed T364883: [maintain-kubeusers] Increment default Secrets (envvars) quota.
Tue, May 14, 6:02 PM · Toolforge (Toolforge iteration 09)
dcaro triaged T364883: [maintain-kubeusers] Increment default Secrets (envvars) quota as Medium priority.
Tue, May 14, 6:02 PM · Toolforge (Toolforge iteration 09)
dcaro moved T364883: [maintain-kubeusers] Increment default Secrets (envvars) quota from Next Up to In Review on the Toolforge (Toolforge iteration 09) board.
Tue, May 14, 6:02 PM · Toolforge (Toolforge iteration 09)
dcaro changed the status of T364780: increase quota for services from Resolved to Invalid.

moving it to the done column changed the status :/

Tue, May 14, 1:55 PM · Toolforge (Toolforge iteration 09)
dcaro changed the status of T364780: increase quota for services from Duplicate to Resolved.
Tue, May 14, 1:55 PM · Toolforge (Toolforge iteration 09)
dcaro changed the status of T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14) from Open to In Progress.
Tue, May 14, 1:54 PM · Toolforge (Toolforge iteration 09)
dcaro added a comment to T362872: Decision Request - Toolforge policy agent enforcement model.

Kyverno has native support for mutating existing resources: https://kyverno.io/docs/writing-policies/mutate/#mutate-existing-resources

Tue, May 14, 1:45 PM · Cloud Services Proposals, User-aborrero, cloud-services-team, Toolforge
dcaro added a comment to T364459: Migrate eqiad1 cloudnets to Neutron OVS agent.

I have some questions:

Tue, May 14, 10:35 AM · cloud-services-team (FY2023/2024-Q3-Q4), Cloud-VPS
dcaro added a comment to T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14).

on nfs-9, there's 3 errors that repeat over and over in the logs, until right the moment when the number of D processes start to raise:

root@tools-k8s-worker-nfs-9:~# journalctl --since "2024-05-14 04:00:00" | grep kubelet | grep 'Error syncing pod'
...
May 14 05:13:37 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:37.341891  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6
May 14 05:13:47 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:47.340454  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:13:47 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:47.340979  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"job\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=job pod=k8s-20170515.signature-check.wiktionary-5669b58c56-hbt5s_tool-signature-checker(fa1e050d-3002-40a1-9672-0eed6956e925)\"" pod="tool-signature-checker/k8s-20170515.signature-check.wiktionary-5669b58c56-hbt5s" podUID=fa1e050d-3002-40a1-9672-0eed6956e925
May 14 05:13:49 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:49.339518  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6
May 14 05:13:59 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:13:59.340072  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:14:00 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:00.339848  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6
May 14 05:14:13 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:13.340330  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6
May 14 05:14:14 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:14.339403  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:14:25 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:25.340147  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:14:27 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:27.342502  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"scheduler\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=scheduler pod=airflow.scheduler-5b8d46bd8-4p42w_tool-airflow(2102a178-6c9a-47f6-9fd5-9d1817a415b6)\"" pod="tool-airflow/airflow.scheduler-5b8d46bd8-4p42w" podUID=2102a178-6c9a-47f6-9fd5-9d1817a415b6
May 14 05:14:36 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:36.340171  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:14:51 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:14:51.340674  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
May 14 05:15:04 tools-k8s-worker-nfs-9 kubelet[879691]: E0514 05:15:04.341238  879691 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"webservice\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=webservice pod=airflow-589f44d589-sb69q_tool-airflow(d6f529fa-6f85-4f89-a9f0-14fd358fd7d6)\"" pod="tool-airflow/airflow-589f44d589-sb69q" podUID=d6f529fa-6f85-4f89-a9f0-14fd358fd7d6
##### stops here, no more errors after this
Tue, May 14, 8:30 AM · Toolforge (Toolforge iteration 09)
dcaro added a comment to T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14).

The load on nfs-52 seems to come from osm4wiki processes, that has been killed several times for going over the memory limit. NFS is responsive, just slow when the load becomes high, so probably unrelated.

Tue, May 14, 8:16 AM · Toolforge (Toolforge iteration 09)
dcaro added a comment to T364761: Request for access for user dr0ptp4kt for 'admin' tool.

+1 from me

Tue, May 14, 8:10 AM · cloud-services-team, Toolforge
dcaro added a comment to T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14).

nfs-9 has been drained without issues, so I can start debugging on it.

Tue, May 14, 7:57 AM · Toolforge (Toolforge iteration 09)
dcaro updated the task description for T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14).
Tue, May 14, 7:57 AM · Toolforge (Toolforge iteration 09)
dcaro moved T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14) from Ready to be worked on to Toolforge iteration 09 on the Toolforge board.
Tue, May 14, 7:54 AM · Toolforge (Toolforge iteration 09)
dcaro triaged T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14) as High priority.
Tue, May 14, 7:54 AM · Toolforge (Toolforge iteration 09)
dcaro raised the priority of T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14) from High to Needs Triage.
Tue, May 14, 7:54 AM · Toolforge (Toolforge iteration 09)
dcaro triaged T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14) as High priority.
Tue, May 14, 7:54 AM · Toolforge (Toolforge iteration 09)
dcaro created T364822: [infra] NFS hangs in some workers until the worker is rebooted (2024-05-14).
Tue, May 14, 7:54 AM · Toolforge (Toolforge iteration 09)
dcaro triaged T364821: [cookbook,infra] wmcs.toolforge.k8s.worker.drain failed to finish with `KeyError` on one node as Medium priority.
Tue, May 14, 7:50 AM · Toolforge
dcaro created T364821: [cookbook,infra] wmcs.toolforge.k8s.worker.drain failed to finish with `KeyError` on one node.
Tue, May 14, 7:50 AM · Toolforge
dcaro added a comment to T364820: The rsync_nginxlogs.service that sends web logs from clouddumps100[1-2] to a stats server isn't working.

The error on clouddumps1002 is:

May 14 04:55:00 clouddumps1002 rsync[4096653]: @ERROR: chroot failed
May 14 04:55:00 clouddumps1002 rsync[4096653]: rsync error: error starting client-server protocol (code 5) at main.c(1817) [sender=3.2.3]
May 14 04:55:00 clouddumps1002 systemd[1]: rsync_nginxlogs.service: Main process exited, code=exited, status=5/NOTINSTALLED
May 14 04:55:00 clouddumps1002 systemd[1]: rsync_nginxlogs.service: Failed with result 'exit-code'.
May 14 04:55:00 clouddumps1002 systemd[1]: Failed to start Regular jobs to rsync nginx logs.
Tue, May 14, 7:42 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26)
dcaro claimed T364820: The rsync_nginxlogs.service that sends web logs from clouddumps100[1-2] to a stats server isn't working.
Tue, May 14, 7:38 AM · Data-Platform-SRE (2024.05.06 - 2024.05.26)
dcaro added a comment to T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage.

Here, we are only talking about allowing tools to set up and use s3-style buckets, correct? Is there any intersection/dependency between this use case and enabling ceph-backed PVCs in toolforge k8s, e.g. the toolforge k8s nodes being able to access and authenticate with the ceph cluster?

Tue, May 14, 7:34 AM · cloud-services-team, Toolforge

Mon, May 13

dcaro added a comment to T363683: Decision request - kubernetes upgrade workgroup.

Option 2 seems to me like the obviously good choice :)

On a first read, I was under the impression that the working group would exist only until we catch up, but from option 2 it seems clear that this would become a regular working group.

Mon, May 13, 4:36 PM · Cloud Services Proposals
dcaro updated the task description for T363983: [toolforge] Investigate authentication.
Mon, May 13, 4:26 PM · Toolforge (Toolforge iteration 09)
dcaro added a comment to T363683: Decision request - kubernetes upgrade workgroup.
Mon, May 13, 4:17 PM · Cloud Services Proposals
dcaro updated the task description for T363683: Decision request - kubernetes upgrade workgroup.
Mon, May 13, 4:15 PM · Cloud Services Proposals
dcaro added a comment to T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage.

This all seems correct, although I reiterate that the interesting part is the scope creation or management. We don't currently have a good way to create databases that are explicitly tied to a particular tool -- even with the bespoke/by hand approach we take now there's nothing to keep the tool and trove ACLS in sync.

Mon, May 13, 3:45 PM · cloud-services-team, Toolforge
dcaro added a comment to T364468: Toolforge jobs logs -f almost always ends in error.

Got it :), I was able to reproduce with ~30s of inactivity:

tools.wm-lol@tools-bastion-13:~$ time toolforge jobs logs -f test
2024-05-13T13:25:18+00:00 [test-64dc8bcc9d-bdtkn] Mon May 13 01:25:18 PM UTC 2024
ERROR: An internal error occured while executing this command.
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 443, in _error_catcher
    yield
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 815, in read_chunked
    self._update_chunk_length()
  File "/usr/lib/python3/dist-packages/urllib3/response.py", line 745, in _update_chunk_length
    line = self._fp.fp.readline()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ssl.py", line 1278, in recv_into
    return self.read(nbytes, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ssl.py", line 1134, in read
    return self._sslobj.read(len, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: The read operation timed out
Mon, May 13, 1:38 PM · Toolforge, Tool-spacemedia
dcaro updated the task description for T364706: [horizon,swift] When accessing any file (public/private) without authenticating first you get a 500 error.
Mon, May 13, 1:36 PM · cloud-services-team, Toolforge
dcaro closed T364706: [horizon,swift] When accessing any file (public/private) without authenticating first you get a 500 error, a subtask of T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage, as Invalid.
Mon, May 13, 1:34 PM · cloud-services-team, Toolforge
dcaro closed T364706: [horizon,swift] When accessing any file (public/private) without authenticating first you get a 500 error as Invalid.
Mon, May 13, 1:34 PM · cloud-services-team, Toolforge
dcaro added a comment to T364706: [horizon,swift] When accessing any file (public/private) without authenticating first you get a 500 error.

The private one works also, but only if you use the url https://object.eqiad1.wikimediacloud.org/swift/v1/AUTH_toolsbeta/dcarotest1/test.yaml and application credentials (for example), that is not exposed anywhere in the horizon UI

Mon, May 13, 1:32 PM · cloud-services-team, Toolforge
dcaro added a comment to T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage.

May I ask why are we are talking about tools interacting with Horizon? I would hope everything happens either via the APIs directly or via Striker instead of introducing a second admin interface for Toolforge :-)

Mon, May 13, 1:27 PM · cloud-services-team, Toolforge
dcaro added a comment to T364706: [horizon,swift] When accessing any file (public/private) without authenticating first you get a 500 error.

The private bucket has no such link though :/

image.png (186×567 px, 13 KB)

Mon, May 13, 1:23 PM · cloud-services-team, Toolforge
dcaro added a comment to T364706: [horizon,swift] When accessing any file (public/private) without authenticating first you get a 500 error.

How does Horizon expose these /api/swift URLs?

Mon, May 13, 1:21 PM · cloud-services-team, Toolforge
dcaro added a comment to T364706: [horizon,swift] When accessing any file (public/private) without authenticating first you get a 500 error.

I had a quick look at the code, it seems that the anonymous user is not extended and it's falling back to the django default one, that does not have those fields.

Mon, May 13, 1:07 PM · cloud-services-team, Toolforge
dcaro added a comment to T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage.

Thanks for all the replies!

Mon, May 13, 1:04 PM · cloud-services-team, Toolforge
dcaro added a comment to T364468: Toolforge jobs logs -f almost always ends in error.

No visible difference :( Is the change already live?

Mon, May 13, 12:51 PM · Toolforge, Tool-spacemedia
dcaro added a comment to T359953: `toolforge jobs logs -f` crashes after a while with internal k8s api errors.

This should have reduced the error rate somewhat, if the job returns any logs, it will not timeout (it will still timeout if the job does not return anything for a while): https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/84

Mon, May 13, 8:47 AM · Toolforge
dcaro added a comment to T364468: Toolforge jobs logs -f almost always ends in error.

This should have reduced the error rate somewhat, if the job returns any logs, it will not timeout (it will still timeout if the job does not return anything for a while): https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/84

Mon, May 13, 8:46 AM · Toolforge, Tool-spacemedia
dcaro added a comment to T364060: Degraded RAID on cloudcephosd1031.

From dmesg:

Mon, May 13, 8:39 AM · Cloud-VPS, Cloud-Services-Origin-Alert, cloud-services-team, DC-Ops, SRE, ops-eqiad
dcaro changed the status of T363417: [builds-builder] golang based images get infinite nested loops for procfile entries from In Progress to Stalled.
Mon, May 13, 8:32 AM · Upstream, Toolforge (Toolforge iteration 09)
dcaro added a project to T363417: [builds-builder] golang based images get infinite nested loops for procfile entries: Upstream.

Flagging as upstream to check in on it eventually

Mon, May 13, 8:32 AM · Upstream, Toolforge (Toolforge iteration 09)
dcaro claimed T363417: [builds-builder] golang based images get infinite nested loops for procfile entries.
Mon, May 13, 8:32 AM · Upstream, Toolforge (Toolforge iteration 09)
dcaro renamed T364706: [horizon,swift] When accessing any file (public/private) without authenticating first you get a 500 error from [horizon,swift] When accessing a private file without authenticating first you get a 500 error to [horizon,swift] When accessing any file (public/private) without authenticating first you get a 500 error.
Mon, May 13, 7:50 AM · cloud-services-team, Toolforge
dcaro created T364706: [horizon,swift] When accessing any file (public/private) without authenticating first you get a 500 error.
Mon, May 13, 7:40 AM · cloud-services-team, Toolforge

Tue, May 7

dcaro updated subscribers of T363983: [toolforge] Investigate authentication.

I'm trying to understand pros & cons of the different protocols, especially OIDC (OpenID Connect) vs CAS.

I played a bit with OIDC in a previous job, and I remember it as fairly complex but also quite well supported by a number of libraries that can be used to implement a client. I'm not sure how that compares with CAS.

A quick Google search led me to this presentation (although not very recent) with some useful comparisons: https://ldapcon.org/2017/wp-content/uploads/2017/08/16_Cl%C3%A9ment-Oudot_PRE_LDAPCon2017_SSO-1.pdf

Tue, May 7, 3:48 PM · Toolforge (Toolforge iteration 09)
dcaro added a comment to T363854: Upgrade golang buildpack to 1.22.

As a quick workaround, have you tried specifying the newer version?

Tue, May 7, 3:14 PM · Toolforge (Software install/update)
dcaro added a comment to T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage.

I'm currently leaning on using idp.w.o for toolforge api auth, and leaving the openstack auth only when required for the storage or database management. Unless we are 100% sure that we are never going to move to idp.w.o for horizon (let me know if so).

Tue, May 7, 12:34 PM · cloud-services-team, Toolforge
dcaro added a comment to T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage.

I'm working on T363983: [toolforge] Investigate authentication, and I think both are very related, specially when we talk about how to authenticate to access the buckets/storage.

Tue, May 7, 12:33 PM · cloud-services-team, Toolforge
dcaro closed T364376: cloudcephosd: the service unit user@0.service is in failed status as Resolved.

@MoritzMuehlenhoff mentioned on irc that this is probably related to the upgrade of glibc + systemd issues under load, should be very infrequent.

Tue, May 7, 11:47 AM · Cloud-Services, cloud-services-team
dcaro merged T364381: SystemdUnitDown into T364376: cloudcephosd: the service unit user@0.service is in failed status.
Tue, May 7, 11:43 AM · Cloud-Services, cloud-services-team
dcaro merged task T364381: SystemdUnitDown into T364376: cloudcephosd: the service unit user@0.service is in failed status.
Tue, May 7, 11:42 AM · cloud-services-team
dcaro claimed T364381: SystemdUnitDown .
Tue, May 7, 11:42 AM · cloud-services-team
dcaro added a comment to T364376: cloudcephosd: the service unit user@0.service is in failed status.

It might be related to T199911: Systemd session creation fails under I/O load

Tue, May 7, 10:47 AM · Cloud-Services, cloud-services-team
dcaro added a comment to T362443: Learn how to do what Taavi does.

For the "wikireplicas" meeting, the notes are:

Tue, May 7, 10:41 AM · Toolforge, Cloud-VPS, cloud-services-team
dcaro added a comment to T364376: cloudcephosd: the service unit user@0.service is in failed status.

The sessions seems to be a root login from cumin2002 (for all nodes):

Tue, May 7, 9:47 AM · Cloud-Services, cloud-services-team

Mon, May 6

dcaro added a comment to T363417: [builds-builder] golang based images get infinite nested loops for procfile entries.

@bd808 This should have been fixed, can you try now?

Mon, May 6, 1:39 PM · Upstream, Toolforge (Toolforge iteration 09)
dcaro updated the task description for T363983: [toolforge] Investigate authentication.
Mon, May 6, 12:34 PM · Toolforge (Toolforge iteration 09)
dcaro updated the task description for T363983: [toolforge] Investigate authentication.
Mon, May 6, 12:22 PM · Toolforge (Toolforge iteration 09)
dcaro updated the task description for T364293: [infra,k8s] Move to kubernetes PAVs and drop kyverno.
Mon, May 6, 9:18 AM · cloud-services-team, Toolforge
dcaro renamed T362869: [k8s,infra] Upgrade Toolforge to Uwubernetes (1.30) from Upgrade Toolforge to Uwubernetes to [k8s,infra] Upgrade Toolforge to Uwubernetes (1.30).
Mon, May 6, 9:17 AM · Toolforge
dcaro renamed T362868: [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29 from [infra,k8s] pgrade Toolforge Kubernetes to version 1.29 to [infra,k8s] Upgrade Toolforge Kubernetes to version 1.29.
Mon, May 6, 9:16 AM · Toolforge
dcaro renamed T363417: [builds-builder] golang based images get infinite nested loops for procfile entries from Golang and Procfile buildpacks not working together as expected to [builds-builder] golang based images get infinite nested loops for procfile entries.
Mon, May 6, 8:39 AM · Upstream, Toolforge (Toolforge iteration 09)
dcaro changed the status of T363417: [builds-builder] golang based images get infinite nested loops for procfile entries from Open to In Progress.
Mon, May 6, 8:38 AM · Upstream, Toolforge (Toolforge iteration 09)
dcaro moved T363482: toolforge lima-kilo: refresh maintain-kubeusers test data from In Review to Done on the Toolforge (Toolforge iteration 09) board.
Mon, May 6, 8:38 AM · Patch-For-Review, Toolforge (Toolforge iteration 09), User-aborrero, cloud-services-team
dcaro moved T363417: [builds-builder] golang based images get infinite nested loops for procfile entries from Next Up to In Review on the Toolforge (Toolforge iteration 09) board.
Mon, May 6, 8:38 AM · Upstream, Toolforge (Toolforge iteration 09)
dcaro edited projects for T363417: [builds-builder] golang based images get infinite nested loops for procfile entries, added: Toolforge (Toolforge iteration 09); removed Toolforge.
Mon, May 6, 8:38 AM · Upstream, Toolforge (Toolforge iteration 09)
dcaro moved T364210: Add more granular schedule macro's to toolforge jobs from Backlog to Ready to be worked on on the Toolforge board.
Mon, May 6, 8:20 AM · Wikimedia-Hackathon-2024, Toolforge
dcaro triaged T364210: Add more granular schedule macro's to toolforge jobs as Medium priority.
Mon, May 6, 8:20 AM · Wikimedia-Hackathon-2024, Toolforge
dcaro moved T364293: [infra,k8s] Move to kubernetes PAVs and drop kyverno from Backlog to Ready to be worked on on the Toolforge board.
Mon, May 6, 8:19 AM · cloud-services-team, Toolforge
dcaro triaged T364293: [infra,k8s] Move to kubernetes PAVs and drop kyverno as High priority.
Mon, May 6, 8:19 AM · cloud-services-team, Toolforge
dcaro moved T364204: toolforge jobs load flushes out all jobs from Backlog to Ready to be worked on on the Toolforge board.
Mon, May 6, 8:19 AM · Wikimedia-Hackathon-2024, Toolforge
dcaro triaged T364204: toolforge jobs load flushes out all jobs as Medium priority.
Mon, May 6, 8:18 AM · Wikimedia-Hackathon-2024, Toolforge
dcaro created T364293: [infra,k8s] Move to kubernetes PAVs and drop kyverno.
Mon, May 6, 8:10 AM · cloud-services-team, Toolforge

Thu, May 2

dcaro added a comment to T363809: [envvars-api] Prefix all endpoints with `/tool/<toolname>`.

Wait for T363983: [toolforge] Investigate authentication, as it might not be needed

Thu, May 2, 9:57 AM · Toolforge
dcaro added a comment to T363346: [jobs-api] Prefix all endpoints with `/tool/<toolname>`.

Wait for T363983: [toolforge] Investigate authentication, as it might not be needed

Thu, May 2, 9:56 AM · Toolforge
dcaro added a comment to T363808: [builds-api] Prefix all endpoints with `/tool/<toolname>`.

Wait for T363983: [toolforge] Investigate authentication, as it might not be needed

Thu, May 2, 9:56 AM · Toolforge (Toolforge iteration 09)
dcaro added a parent task for T363983: [toolforge] Investigate authentication: T362069: [components-api] Get a skeleton of API webservice and implement `/tool/<toolname>/deploy` with build-only features.
Thu, May 2, 9:54 AM · Toolforge (Toolforge iteration 09)
dcaro added a subtask for T362069: [components-api] Get a skeleton of API webservice and implement `/tool/<toolname>/deploy` with build-only features: T363983: [toolforge] Investigate authentication.
Thu, May 2, 9:54 AM · Toolforge
dcaro triaged T363983: [toolforge] Investigate authentication as High priority.
Thu, May 2, 9:53 AM · Toolforge (Toolforge iteration 09)