Page MenuHomePhabricator

dcaro (David Caro)
SRE & amauteur yak shaver

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Nov 2 2020, 11:59 AM (265 w, 6 d)
Availability
Available
IRC Nick
dcaro
LDAP User
David Caro
MediaWiki User
DCaro (WMF) [ Global Accounts ]

Recent Activity

Thu, Nov 20

dcaro placed T410421: Add paging alert if Toolforge HAProxy connection limit is reached up for grabs.

will not be able to get to this

Thu, Nov 20, 3:50 PM · Patch-For-Review, User-dcaro, cloud-services-team, Toolforge

Wed, Nov 19

dcaro added a comment to T410421: Add paging alert if Toolforge HAProxy connection limit is reached.

Related to T343885: [promethus,haproxy] Move to haproxy internal metrics from haproxy_exporter

Wed, Nov 19, 9:04 AM · Patch-For-Review, User-dcaro, cloud-services-team, Toolforge
dcaro moved T410421: Add paging alert if Toolforge HAProxy connection limit is reached from To refine to Today on the User-dcaro board.
Wed, Nov 19, 8:55 AM · Patch-For-Review, User-dcaro, cloud-services-team, Toolforge
dcaro claimed T410421: Add paging alert if Toolforge HAProxy connection limit is reached.
Wed, Nov 19, 8:53 AM · Patch-For-Review, User-dcaro, cloud-services-team, Toolforge
dcaro triaged T410421: Add paging alert if Toolforge HAProxy connection limit is reached as High priority.
Wed, Nov 19, 8:37 AM · Patch-For-Review, User-dcaro, cloud-services-team, Toolforge
dcaro triaged T409328: sso failure in codfw1dev (labtesthorizon.wikimedia.org) as High priority.
Wed, Nov 19, 8:37 AM · Infrastructure-Foundations, CAS-SSO, cloud-services-team, Cloud-VPS
dcaro triaged T340180: Use GitLab CI to upload packages to the toolsbeta repo as Medium priority.
Wed, Nov 19, 8:37 AM · Patch-Needs-Improvement, cloud-services-team, Toolforge
dcaro triaged T410046: [jobs-cli] provides no meaningful feedback for restart as Low priority.
Wed, Nov 19, 8:36 AM · cloud-services-team, Toolforge
dcaro triaged T410048: [jobs-cli] provides no meaningful feedback for delete as Low priority.
Wed, Nov 19, 8:36 AM · cloud-services-team, Toolforge
dcaro triaged T410055: [logs-api] `--follow` returns inconsistent/artificial log entries as Medium priority.
Wed, Nov 19, 8:36 AM · cloud-services-team, Toolforge
dcaro triaged T410058: [builds-api] support specifying tag in build as Low priority.
Wed, Nov 19, 8:36 AM · cloud-services-team, Toolforge
dcaro renamed T410102: [bastion] redis-cli is absent from tools bastion hosts from redis-cli is absent from tools bastion hosts to [bastion] redis-cli is absent from tools bastion hosts.
Wed, Nov 19, 8:35 AM · User-bd808, cloud-services-team, Toolforge
dcaro triaged T410102: [bastion] redis-cli is absent from tools bastion hosts as Low priority.
Wed, Nov 19, 8:35 AM · User-bd808, cloud-services-team, Toolforge
dcaro triaged T410148: tofu-infra: add cinder volume types as Medium priority.
Wed, Nov 19, 8:35 AM · Cloud-VPS, User-aborrero, cloud-services-team
dcaro triaged T410265: [tofu-infra] "tofu plan" failing in codfw as Medium priority.
Wed, Nov 19, 8:34 AM · Cloud-VPS, cloud-services-team
dcaro triaged T410294: Site: codfw 1 VM request for codfw1dev CAS test/dev, hostname: cloudidp2001-dev as Medium priority.
Wed, Nov 19, 8:34 AM · Cloud-VPS, cloud-services-team, vm-requests, Infrastructure-Foundations, SRE
dcaro triaged T410382: Ensure ingress pods get scheduled on ingress nodes as Medium priority.
Wed, Nov 19, 8:33 AM · cloud-services-team, Toolforge
dcaro triaged T410410: Fix pacct rotation properly everywhere as Low priority.
Wed, Nov 19, 8:32 AM · cloud-services-team, Cloud-VPS
dcaro moved T410421: Add paging alert if Toolforge HAProxy connection limit is reached from Inbox to Clinic Duty on the cloud-services-team board.
Wed, Nov 19, 8:31 AM · Patch-For-Review, User-dcaro, cloud-services-team, Toolforge
dcaro moved T410470: cloudvirt1071 crash from Hardware to FY2025/26-Q1-Q2 on the cloud-services-team board.
Wed, Nov 19, 8:30 AM · cloud-services-team (FY2025/26-Q1-Q2), Cloud-VPS
dcaro moved T410470: cloudvirt1071 crash from Inbox to Hardware on the cloud-services-team board.
Wed, Nov 19, 8:30 AM · cloud-services-team (FY2025/26-Q1-Q2), Cloud-VPS
dcaro triaged T410470: cloudvirt1071 crash as High priority.
Wed, Nov 19, 8:30 AM · cloud-services-team (FY2025/26-Q1-Q2), Cloud-VPS

Tue, Nov 18

dcaro updated the task description for T359650: [jobs-api] Create storage layer, and save business models in persistent storage.
Tue, Nov 18, 6:27 PM · Toolforge (Toolforge iteration 25), User-Raymond_Ndibe
dcaro added a comment to T410352: "best-of" tool Unexpectedly Returning 500s.

Looks the same to me:

2025-11-18T10:22:37.611073+00:00 tools-k8s-haproxy-8 haproxy[766]: 213.55.247.35:24591 [18/Nov/2025:10:22:37.577] k8s-ingress-https~ k8s-ingress-http/tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud 0/0/3/28/31 500 18300 - - PH-- 1983/1716/489/183/0 0/0 "GET / HTTP/1.1" 0/0000000000000000/0/0/0 tool-db-usage.toolforge.org/TLSv1.3/TLS_AES_256_GCM_SHA384 host:"tool-db-usage.toolforge.org"
Tue, Nov 18, 10:24 AM · Tools, Toolforge, cloud-services-team
dcaro added a comment to T410352: "best-of" tool Unexpectedly Returning 500s.

Aren't the PH part of the other set of flags?

Tue, Nov 18, 10:19 AM · Tools, Toolforge, cloud-services-team
dcaro added a comment to T410352: "best-of" tool Unexpectedly Returning 500s.

We were also having some crawlers hitting the haproxy:

image.png (348×2 px, 104 KB)

Tue, Nov 18, 9:18 AM · Tools, Toolforge, cloud-services-team
dcaro added a comment to T410352: "best-of" tool Unexpectedly Returning 500s.

Might not be related, but for sure it does not help, that our ingress pods are being scheduled somewhere else than the ingress workers due to (I think) memory constraints, as the ingress workers have ~8G memory, and ~3G get already used by the minimal setup, while we request already 5G for the ingress containers:

root@tools-k8s-control-9:~# kubectl get deployment -n ingress-nginx-gen2 ingress-nginx-gen2-controller -o json | jq '.spec.template.spec.containers[].resources'
{
  "limits": {
    "cpu": "3",
    "memory": "6G"
  },
  "requests": {
    "cpu": "2",
    "memory": "5G"
  }
}
Tue, Nov 18, 9:02 AM · Tools, Toolforge, cloud-services-team
dcaro added a comment to T410352: "best-of" tool Unexpectedly Returning 500s.

That did not take long:

In [10]: print(response.text)
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<link rel="shortcut icon" href="https://tools-static.wmflabs.org/admin/errors/favicon.ico">
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; display: flex; flex-direction: row; flex-wrap: wrap; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
.content-text { flex: 1; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645ad; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
summary { font-weight: bold; cursor: pointer; }
details[open] { background: #970302; color: #dfdedd; }
.text-muted { color: #777; }
@media (prefers-color-scheme: dark) {
  a { color: #9e9eff; }
  body { background: transparent; color: #ddd; }
  .footer { border-top: 1px solid #444; background: #060606; }
  #logo { filter: invert(1) hue-rotate(180deg); }
  .text-muted { color: #888; }
}
</style>
<meta name="color-scheme" content="light dark">
<div class="content" role="main">
<a href="https://wikitech.wikimedia.org/wiki/Portal:Toolforge"><img id="logo" src="https://tools-static.wmflabs.org/admin/errors/toolforge-logo.png" srcset="https://tools-static.wmflabs.org/admin/errors/toolforge-logo-2x.png 2x" alt="Wikimedia Toolforge" width="120" height="120">
</a>
<div class="content-text">
<h1>Wikimedia Toolforge Error</h1>
Tue, Nov 18, 8:57 AM · Tools, Toolforge, cloud-services-team
dcaro added a comment to T410352: "best-of" tool Unexpectedly Returning 500s.

I'm doing a quick check, putting a while loop requesting that url:

In [6]: while response.status_code == 200:
   ...:     response = requests.get("https://best-of.toolforge.org/api/category/random?foo=7")
   ...:     time.sleep(1)
   ...:     print(".")
   ...:

while looking at the k8s ingress logs + webservice logs, see if I catch it when it returns 500

Tue, Nov 18, 8:18 AM · Tools, Toolforge, cloud-services-team

Mon, Nov 17

dcaro edited P84261 (An Untitled Masterwork).
Mon, Nov 17, 5:50 PM
dcaro updated the task description for T407477: [docs] update all readmes with the same deployment docs.
Mon, Nov 17, 5:47 PM · Patch-For-Review, Toolforge (Toolforge iteration 25)
dcaro edited P84261 (An Untitled Masterwork).
Mon, Nov 17, 5:42 PM
dcaro edited P84261 (An Untitled Masterwork).
Mon, Nov 17, 5:41 PM
dcaro added a comment to T408387: CloudVPS instance for ProVe.

@Odinaldo you'll need to create developer accounts (https://www.mediawiki.org/wiki/Developer_account), or if you have one already, you'll have to link it to your phabricator account (from the management link in the wiki page for the developer account).

Mon, Nov 17, 5:19 PM · cloud-services-team (FY2025/26-Q1-Q2), Cloud-VPS (Project-requests)
dcaro claimed T408387: CloudVPS instance for ProVe.
Mon, Nov 17, 10:52 AM · cloud-services-team (FY2025/26-Q1-Q2), Cloud-VPS (Project-requests)
dcaro added a comment to T408387: CloudVPS instance for ProVe.

Thank you Francesco (and Andrew). We will satisfy the requirements and ensure everything is transparent to allay any community concerns.

What would the next steps be please?

Mon, Nov 17, 10:52 AM · cloud-services-team (FY2025/26-Q1-Q2), Cloud-VPS (Project-requests)

Thu, Nov 13

dcaro updated the task description for T404157: [builds-api, maintain-harbor] fix build/image cleanup.
Thu, Nov 13, 5:44 PM · Toolforge (Toolforge iteration 25), Patch-For-Review
dcaro added a comment to T385604: Decision Request - How openstack projects relate to tofu-infra.

I vote for the Option 1 (with Andrew's note on only for non-automatic projects), though Option 3 would be a close second (I don't completely understand the flows, though look interesting).

Thu, Nov 13, 5:30 PM · Cloud-VPS, cloud-services-team, User-aborrero, Cloud Services Proposals
dcaro added a comment to T410055: [logs-api] `--follow` returns inconsistent/artificial log entries.

@DamianZaremba btw. I think that you are the last one using the jobs-api log endpoint, can you move your code to use the logs-api instead? (so we can remove the logs code from jobs api :) ).

Thu, Nov 13, 5:16 PM · cloud-services-team, Toolforge
dcaro added a comment to T410055: [logs-api] `--follow` returns inconsistent/artificial log entries.

Some notes for whomever implements this:

Thu, Nov 13, 5:11 PM · cloud-services-team, Toolforge
dcaro added a comment to T410055: [logs-api] `--follow` returns inconsistent/artificial log entries.

That would allow also to have a 'type' that's something like "internal", and express there the fact that it got no logs yet.

Thu, Nov 13, 5:02 PM · cloud-services-team, Toolforge
dcaro added a comment to T410055: [logs-api] `--follow` returns inconsistent/artificial log entries.

Agree, we can now try to extend the datastructure that logs-api returns (LogEntry), ideally we would want to support different types of logs too (build logs, system logs, etc.) so we might want to create some more generic one.

Thu, Nov 13, 5:01 PM · cloud-services-team, Toolforge
dcaro added a comment to T410048: [jobs-cli] provides no meaningful feedback for delete.

I think that returning an info message (in the wrapper structure of the jobs-api https://api-docs.toolforge.org/docs#/Jobs/jobs_list) with "Job <myjob> deleted."

image.png (296×460 px, 15 KB)

Thu, Nov 13, 4:50 PM · cloud-services-team, Toolforge
dcaro added a comment to T410058: [builds-api] support specifying tag in build.

Can you elaborate a bit on the flow you have in mind? Would that be sorted if components-api allowed to change the image name?

Thu, Nov 13, 4:48 PM · cloud-services-team, Toolforge
dcaro added a comment to T410048: [jobs-cli] provides no meaningful feedback for delete.

I think it was an early decision on the jobs-cli side to not return anything when everything went well. I'm ok changing it. In other clis we return all the information needed to recreate the job (or object) when it's deleted, in case it was a mistake, that might be a good option here too.

Thu, Nov 13, 3:59 PM · cloud-services-team, Toolforge
dcaro moved T408707: [jobs-api] apply topology constraints from Next Up to Done on the Toolforge (Toolforge iteration 25) board.
Thu, Nov 13, 3:55 PM · Toolforge (Toolforge iteration 25), Patch-For-Review, cloud-services-team
dcaro edited projects for T408707: [jobs-api] apply topology constraints, added: Toolforge (Toolforge iteration 25); removed Toolforge.
Thu, Nov 13, 3:54 PM · Toolforge (Toolforge iteration 25), Patch-For-Review, cloud-services-team
dcaro renamed T401110: [jobs-api,jobs-cli] Support `--timeout` for one-off jobs from Support `--timeout` for one-off jobs to [jobs-api,jobs-cli] Support `--timeout` for one-off jobs.
Thu, Nov 13, 3:19 PM · Toolforge, cloud-services-team
dcaro renamed T401422: [jobs-cli,logs-api] `toolforge jobs logs` breaks on long log lines from [TjfCliError] `toolforge jobs logs` breaks on long log lines to [jobs-cli,logs-api] `toolforge jobs logs` breaks on long log lines.
Thu, Nov 13, 3:19 PM · cloud-services-team, Toolforge
dcaro renamed T401552: [logs-api,jobs-cli] `toolforge jobs logs` has inconsistent ordering from `toolforge jobs logs` has inconsistent ordering to [logs-api,jobs-cli] `toolforge jobs logs` has inconsistent ordering.
Thu, Nov 13, 3:19 PM · cloud-services-team, Toolforge
dcaro moved T409847: [maintain-kubeusers,maintain-dbusers] user homes are not readable by replica_cnf so it fails to create replica.my.cnf files from In Progress to Done on the Toolforge (Toolforge iteration 25) board.
Thu, Nov 13, 1:45 PM · cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, Toolforge (Toolforge iteration 25)
dcaro added a comment to T409970: Increase volume storage on project analytics.

+1

Thu, Nov 13, 11:58 AM · Cloud-VPS (Quota-requests)
dcaro added a comment to T409981: Request increased build quota for MilHistBot Toolforge tool.

+1

Thu, Nov 13, 11:58 AM · Toolforge (Quota-requests)
dcaro added a comment to T390885: Check for non-libre vscode-server installs/processes on Toolforge bastions.

I was hoping one way to do this would be to null route the domain name where VS Code downloads the server binary from. https://code.visualstudio.com/docs/remote/ssh#_what-are-the-connectivity-requirements-for-the-vs-code-server-when-it-is-running-on-a-remote-machine-vm suggests that update.code.visualstudio.com is that domain name, but it also suggests that the software will try to work around such network restrictions. Still worth a try IHMO.

Thu, Nov 13, 9:31 AM · Toolforge, cloud-services-team
dcaro added a comment to T410009: SSH session hangs after authentication for user delemike on login.toolforge.org. Logs show hang at debug1: pledge: filesystem..

You are reaching the limit of open ssh sessions using vscode remotely.

Thu, Nov 13, 9:20 AM · cloud-services-team, Toolforge
dcaro closed T409847: [maintain-kubeusers,maintain-dbusers] user homes are not readable by replica_cnf so it fails to create replica.my.cnf files as Resolved.

Fixed the permissions issue, added metrics, alerts, runbooks, and dashboard.

Thu, Nov 13, 9:12 AM · cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, Toolforge (Toolforge iteration 25)
dcaro added a comment to T409029: Flapping wikitech-static icinga alert.

Tonight it was specially flappy (almost every hour like):

image.png (705×1 px, 356 KB)

Thu, Nov 13, 8:54 AM · wikitech.wikimedia.org, cloud-services-team

Wed, Nov 12

dcaro added a comment to T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.

quick throw-away script for simple deployments in lima-kilo using the web images:

That's ok, but can you test if we can use them as jobs from jobs-api?
I'm sure that they will be able to be pulled and run as just images, the key point is running as jobs (envvars, entrypoints, resources, security policies, ...).

For that you can try using the image-config patch you created in lima-kilo, and start one job for each image type (might be easier using jobs.yaml), and making sure it runs ok (ex. logging some string, and checking that the logs are sent ok).

this might likely require creating a test PR for jobs-api since we can't just deploy webservice images in jobs-api rn. I'll do that and respond here

Wed, Nov 12, 4:46 PM · Toolforge (Toolforge iteration 25)
dcaro moved T408574: [jobs-api] handle qualified image names from Next Up to Done on the Toolforge (Toolforge iteration 25) board.
Wed, Nov 12, 4:22 PM · Toolforge (Toolforge iteration 25), Patch-For-Review, cloud-services-team
dcaro closed T408574: [jobs-api] handle qualified image names as Resolved.
Wed, Nov 12, 4:21 PM · Toolforge (Toolforge iteration 25), Patch-For-Review, cloud-services-team
dcaro closed T408574: [jobs-api] handle qualified image names, a subtask of T403322: [builds-api] return image digest, as Resolved.
Wed, Nov 12, 4:21 PM · Patch-For-Review, cloud-services-team, Toolforge
dcaro closed T409007: [jobs-api] failed to create job from components as Resolved.
Wed, Nov 12, 3:26 PM · Toolforge (Toolforge iteration 25), cloud-services-team
dcaro edited projects for T409007: [jobs-api] failed to create job from components, added: Toolforge (Toolforge iteration 25); removed Toolforge.
Wed, Nov 12, 3:26 PM · Toolforge (Toolforge iteration 25), cloud-services-team
dcaro updated the task description for T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Wed, Nov 12, 12:11 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
dcaro updated the task description for T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Wed, Nov 12, 12:09 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
dcaro added a comment to T409847: [maintain-kubeusers,maintain-dbusers] user homes are not readable by replica_cnf so it fails to create replica.my.cnf files.

Merged the above patch, and things starting getting unstuck:

Nov 12 10:28:58 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user vincentvega
Nov 12 10:28:58 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user joelyrookewmde
Nov 12 10:28:58 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user suzannewood
Nov 12 10:28:58 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user fritzbeing
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user shr0x-ya
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user ritika-bhambri11
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user goldenjdm
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user tmwyk
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user sadrettin
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user piastu
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user aydoh8
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user swampl
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user khajitdadddy
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user kspiers
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user weeks
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user jiji
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user vsdetoniprojetomais
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user gonyeahialam
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user elementaler7
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user guyfawcus
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user sisyph
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user olafjanssen
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user pfischer
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user devdoingdev
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user tausheefhassan
Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user imanoobg
Nov 12 10:30:00 cloudcontrol1007 maintain-dbusers[2677814]: DEBUG [root.populate_accountsdb:751] Found 0 new tool accounts () and 0 removed tool accounts ()
Nov 12 10:31:35 cloudcontrol1007 maintain-dbusers[2677814]: DEBUG [root.populate_accountsdb:751] Found 1 new user accounts (rashitige) and 0 removed user accounts ()
Nov 12 10:31:35 cloudcontrol1007 maintain-dbusers[2677814]: DEBUG [urllib3.connectionpool._new_conn:1049] Starting new HTTPS connection (1): nfs.svc.toolforge.org:443
Nov 12 10:31:35 cloudcontrol1007 maintain-dbusers[2677814]: DEBUG [urllib3.connectionpool._make_request:544] https://nfs.svc.toolforge.org:443 "POST /v1/write-replica-cnf HTTP/1.1
Wed, Nov 12, 10:39 AM · cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, Toolforge (Toolforge iteration 25)
dcaro updated subscribers of T409900: [jobs-api,image-config] Deprecate/update the list of supported pre-built images.

Maybe @komla can help here too

Wed, Nov 12, 10:14 AM · Toolforge (Toolforge iteration 25), cloud-services-team
dcaro created T409900: [jobs-api,image-config] Deprecate/update the list of supported pre-built images.
Wed, Nov 12, 10:14 AM · Toolforge (Toolforge iteration 25), cloud-services-team
dcaro added a comment to T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.

quick throw-away script for simple deployments in lima-kilo using the web images:

Wed, Nov 12, 9:34 AM · Toolforge (Toolforge iteration 25)
dcaro triaged T409725: [jobs-api,webservice] Fetch images from builds-api as Medium priority.
Wed, Nov 12, 8:41 AM · Toolforge (Toolforge iteration 25), cloud-services-team
dcaro triaged T409726: [builds-api] Add an endpoint to get all available images as Medium priority.
Wed, Nov 12, 8:41 AM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
dcaro triaged T409727: [builds-api,harbor,image-config] Move pre-built images to harbor as Medium priority.
Wed, Nov 12, 8:41 AM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
dcaro triaged T409728: [image-config] deprecate and move all data to builds-api as Medium priority.
Wed, Nov 12, 8:41 AM · Toolforge (Toolforge iteration 25), cloud-services-team
dcaro added a comment to T324487: Tool seems not to work.

It's hosted in github, from https://wikitech.wikimedia.org/wiki/Tool:Import-500px

Wed, Nov 12, 8:01 AM · import-500px

Tue, Nov 11

dcaro added a comment to T409847: [maintain-kubeusers,maintain-dbusers] user homes are not readable by replica_cnf so it fails to create replica.my.cnf files.

I think this should avoid the current errors: https://gerrit.wikimedia.org/r/c/operations/puppet/+/991653

Tue, Nov 11, 5:46 PM · cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, Toolforge (Toolforge iteration 25)
dcaro added a comment to T409847: [maintain-kubeusers,maintain-dbusers] user homes are not readable by replica_cnf so it fails to create replica.my.cnf files.

I think it's failing to commit that some users were already created, and recounting them as created every time too

Tue, Nov 11, 5:26 PM · cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, Toolforge (Toolforge iteration 25)
dcaro added a comment to T409847: [maintain-kubeusers,maintain-dbusers] user homes are not readable by replica_cnf so it fails to create replica.my.cnf files.

It seems there's currently ~19 accounts affected:

root@cloudcontrol1007:~# journalctl -u maintain-dbusers.service -n 10000 | grep 'problem populating' | grep -o account_id.* | sort | uniq -c 
    134 account_id chk2605 failed without response.
    134 account_id davenyi failed without response.
    135 account_id devdoingdev failed without response.
    135 account_id elementaler7 failed without response.
    134 account_id fritzbeing failed without response.
    134 account_id hokwelum failed without response.
    135 account_id imanoobg failed without response.
    135 account_id jiji failed without response.
    134 account_id jordylizana failed without response.
    135 account_id khajitdadddy failed without response.
    135 account_id olafjanssen failed without response.
    135 account_id piastu failed without response.
    134 account_id sadrettin failed without response.
    135 account_id sisyph failed without response.
    135 account_id swampl failed without response.
    135 account_id tausheefhassan failed without response.
    134 account_id tmwyk failed without response.
    134 account_id vincentvega failed without response.
    135 account_id vsdetoniprojetomais failed without response.
Tue, Nov 11, 5:07 PM · cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, Toolforge (Toolforge iteration 25)
dcaro added a comment to T409847: [maintain-kubeusers,maintain-dbusers] user homes are not readable by replica_cnf so it fails to create replica.my.cnf files.

Populating the accounts seem to flop on the 29th of september:

image.png (371×462 px, 23 KB)

https://grafana.wikimedia.org/goto/_feo-rkDR?orgId=1

Tue, Nov 11, 4:16 PM · cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, Toolforge (Toolforge iteration 25)
dcaro created T409847: [maintain-kubeusers,maintain-dbusers] user homes are not readable by replica_cnf so it fails to create replica.my.cnf files.
Tue, Nov 11, 4:09 PM · cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, Toolforge (Toolforge iteration 25)

Mon, Nov 10

dcaro removed a subtask for T348755: [jobs-api,webservice] Run webservices via the jobs framework: T409725: [jobs-api,webservice] Fetch images from builds-api.
Mon, Nov 10, 1:20 PM · Toolforge (Toolforge iteration 25), cloud-services-team, User-Raymond_Ndibe, Epic
dcaro removed a parent task for T409725: [jobs-api,webservice] Fetch images from builds-api: T348755: [jobs-api,webservice] Run webservices via the jobs framework.
Mon, Nov 10, 1:20 PM · Toolforge (Toolforge iteration 25), cloud-services-team
dcaro added a parent task for T409725: [jobs-api,webservice] Fetch images from builds-api: T409728: [image-config] deprecate and move all data to builds-api.
Mon, Nov 10, 1:19 PM · Toolforge (Toolforge iteration 25), cloud-services-team
dcaro added a subtask for T409728: [image-config] deprecate and move all data to builds-api: T409725: [jobs-api,webservice] Fetch images from builds-api.
Mon, Nov 10, 1:19 PM · Toolforge (Toolforge iteration 25), cloud-services-team
dcaro created T409728: [image-config] deprecate and move all data to builds-api.
Mon, Nov 10, 1:19 PM · Toolforge (Toolforge iteration 25), cloud-services-team
dcaro updated the task description for T409725: [jobs-api,webservice] Fetch images from builds-api.
Mon, Nov 10, 1:17 PM · Toolforge (Toolforge iteration 25), cloud-services-team
dcaro updated the task description for T409726: [builds-api] Add an endpoint to get all available images.
Mon, Nov 10, 1:17 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
dcaro created T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Mon, Nov 10, 1:16 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
dcaro created T409726: [builds-api] Add an endpoint to get all available images.
Mon, Nov 10, 1:07 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
dcaro created T409725: [jobs-api,webservice] Fetch images from builds-api.
Mon, Nov 10, 1:06 PM · Toolforge (Toolforge iteration 25), cloud-services-team

Nov 6 2025

dcaro updated the task description for T388092: [jobs-api] expose jobs-api continuous jobs to the internet via `toolname.toolforge.org`, just like webservice.
Nov 6 2025, 5:31 PM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), User-Raymond_Ndibe, Epic
dcaro added a comment to T409404: [toolsdb] Add filesystem space alerts.

That sounds good to me yes, it's the similar to the other space.

Nov 6 2025, 3:35 PM · Patch-For-Review, cloud-services-team (FY2025/26-Q1-Q2), Toolforge, Sustainability (Incident Followup)
dcaro closed T404199: [prometheus,infra] 2025-09-10 tools-prometheus-9 down as Resolved.

It seems the memory limit has completely stopped the full outages, I'll close this as the main issue is "workedaround". Might be good to investigate the queries that kill it, but right now we don't have the throughput to dig deeper.

Nov 6 2025, 2:52 PM · Toolforge (Toolforge iteration 25)
dcaro moved T409009: [functional tests] leave a mess behind from Backlog to Ready to be worked on on the Toolforge board.
Nov 6 2025, 2:49 PM · cloud-services-team, Toolforge
dcaro added a comment to T409009: [functional tests] leave a mess behind.

The cleanup of what's in the home dir can happen at the start of the tests, so in case anything fails, you still have some leftover to investigate.

Nov 6 2025, 2:49 PM · cloud-services-team, Toolforge
dcaro triaged T409009: [functional tests] leave a mess behind as Medium priority.

This was less of an issue is lima-kilo, where you rebuild it every now and then, but when running on a loop in prod becomes more relevant.

Nov 6 2025, 2:48 PM · cloud-services-team, Toolforge
dcaro closed T409047: [elastic] add metrics as Resolved.
Nov 6 2025, 2:43 PM · Toolforge (Toolforge iteration 25)
dcaro added a comment to T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.

Also, what does you mean with doesn't exist in the toollabs-images repo, but setup is likely like the other node image? those do exist there, just a different revision, for example for ruby 2.5: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/toollabs-images/+/9aaeb88e4af82a42f50146ef4ba97f6932d1e1b6/ruby25-sssd/

Nov 6 2025, 1:58 PM · Toolforge (Toolforge iteration 25)
dcaro assigned T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images to Raymond_Ndibe.

I did not mean to unassign sorry, I think we both edited at the same time.

Nov 6 2025, 1:54 PM · Toolforge (Toolforge iteration 25)
dcaro placed T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images up for grabs.
Nov 6 2025, 8:03 AM · Toolforge (Toolforge iteration 25)

Nov 4 2025

dcaro added a subtask for T348755: [jobs-api,webservice] Run webservices via the jobs framework: T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.
Nov 4 2025, 5:13 PM · Toolforge (Toolforge iteration 25), cloud-services-team, User-Raymond_Ndibe, Epic