User Details
- User Since
- Nov 2 2020, 11:59 AM (265 w, 6 d)
- Availability
- Available
- IRC Nick
- dcaro
- LDAP User
- David Caro
- MediaWiki User
- DCaro (WMF) [ Global Accounts ]
Thu, Nov 20
will not be able to get to this
Wed, Nov 19
Tue, Nov 18
Looks the same to me:
2025-11-18T10:22:37.611073+00:00 tools-k8s-haproxy-8 haproxy[766]: 213.55.247.35:24591 [18/Nov/2025:10:22:37.577] k8s-ingress-https~ k8s-ingress-http/tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud 0/0/3/28/31 500 18300 - - PH-- 1983/1716/489/183/0 0/0 "GET / HTTP/1.1" 0/0000000000000000/0/0/0 tool-db-usage.toolforge.org/TLSv1.3/TLS_AES_256_GCM_SHA384 host:"tool-db-usage.toolforge.org"
Aren't the PH part of the other set of flags?
We were also having some crawlers hitting the haproxy:
Might not be related, but for sure it does not help, that our ingress pods are being scheduled somewhere else than the ingress workers due to (I think) memory constraints, as the ingress workers have ~8G memory, and ~3G get already used by the minimal setup, while we request already 5G for the ingress containers:
root@tools-k8s-control-9:~# kubectl get deployment -n ingress-nginx-gen2 ingress-nginx-gen2-controller -o json | jq '.spec.template.spec.containers[].resources'
{
"limits": {
"cpu": "3",
"memory": "6G"
},
"requests": {
"cpu": "2",
"memory": "5G"
}
}That did not take long:
In [10]: print(response.text)
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<link rel="shortcut icon" href="https://tools-static.wmflabs.org/admin/errors/favicon.ico">
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; display: flex; flex-direction: row; flex-wrap: wrap; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
.content-text { flex: 1; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645ad; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
summary { font-weight: bold; cursor: pointer; }
details[open] { background: #970302; color: #dfdedd; }
.text-muted { color: #777; }
@media (prefers-color-scheme: dark) {
a { color: #9e9eff; }
body { background: transparent; color: #ddd; }
.footer { border-top: 1px solid #444; background: #060606; }
#logo { filter: invert(1) hue-rotate(180deg); }
.text-muted { color: #888; }
}
</style>
<meta name="color-scheme" content="light dark">
<div class="content" role="main">
<a href="https://wikitech.wikimedia.org/wiki/Portal:Toolforge"><img id="logo" src="https://tools-static.wmflabs.org/admin/errors/toolforge-logo.png" srcset="https://tools-static.wmflabs.org/admin/errors/toolforge-logo-2x.png 2x" alt="Wikimedia Toolforge" width="120" height="120">
</a>
<div class="content-text">
<h1>Wikimedia Toolforge Error</h1>I'm doing a quick check, putting a while loop requesting that url:
In [6]: while response.status_code == 200:
...: response = requests.get("https://best-of.toolforge.org/api/category/random?foo=7")
...: time.sleep(1)
...: print(".")
...:while looking at the k8s ingress logs + webservice logs, see if I catch it when it returns 500
Mon, Nov 17
@Odinaldo you'll need to create developer accounts (https://www.mediawiki.org/wiki/Developer_account), or if you have one already, you'll have to link it to your phabricator account (from the management link in the wiki page for the developer account).
Thu, Nov 13
I vote for the Option 1 (with Andrew's note on only for non-automatic projects), though Option 3 would be a close second (I don't completely understand the flows, though look interesting).
@DamianZaremba btw. I think that you are the last one using the jobs-api log endpoint, can you move your code to use the logs-api instead? (so we can remove the logs code from jobs api :) ).
Some notes for whomever implements this:
That would allow also to have a 'type' that's something like "internal", and express there the fact that it got no logs yet.
Agree, we can now try to extend the datastructure that logs-api returns (LogEntry), ideally we would want to support different types of logs too (build logs, system logs, etc.) so we might want to create some more generic one.
I think that returning an info message (in the wrapper structure of the jobs-api https://api-docs.toolforge.org/docs#/Jobs/jobs_list) with "Job <myjob> deleted."
Can you elaborate a bit on the flow you have in mind? Would that be sorted if components-api allowed to change the image name?
I think it was an early decision on the jobs-cli side to not return anything when everything went well. I'm ok changing it. In other clis we return all the information needed to recreate the job (or object) when it's deleted, in case it was a mistake, that might be a good option here too.
+1
+1
You are reaching the limit of open ssh sessions using vscode remotely.
Fixed the permissions issue, added metrics, alerts, runbooks, and dashboard.
Tonight it was specially flappy (almost every hour like):
Wed, Nov 12
Merged the above patch, and things starting getting unstuck:
Nov 12 10:28:58 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user vincentvega Nov 12 10:28:58 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user joelyrookewmde Nov 12 10:28:58 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user suzannewood Nov 12 10:28:58 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user fritzbeing Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user shr0x-ya Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user ritika-bhambri11 Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user goldenjdm Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user tmwyk Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user sadrettin Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user piastu Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user aydoh8 Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user swampl Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user khajitdadddy Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user kspiers Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user weeks Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user jiji Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user vsdetoniprojetomais Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user gonyeahialam Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user elementaler7 Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user guyfawcus Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user sisyph Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user olafjanssen Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user pfischer Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user devdoingdev Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user tausheefhassan Nov 12 10:28:59 cloudcontrol1007 maintain-dbusers[2677814]: INFO [root._create_accounts_on_host:1014] Created account in clouddb1020.eqiad.wmnet:3363 for user imanoobg Nov 12 10:30:00 cloudcontrol1007 maintain-dbusers[2677814]: DEBUG [root.populate_accountsdb:751] Found 0 new tool accounts () and 0 removed tool accounts () Nov 12 10:31:35 cloudcontrol1007 maintain-dbusers[2677814]: DEBUG [root.populate_accountsdb:751] Found 1 new user accounts (rashitige) and 0 removed user accounts () Nov 12 10:31:35 cloudcontrol1007 maintain-dbusers[2677814]: DEBUG [urllib3.connectionpool._new_conn:1049] Starting new HTTPS connection (1): nfs.svc.toolforge.org:443 Nov 12 10:31:35 cloudcontrol1007 maintain-dbusers[2677814]: DEBUG [urllib3.connectionpool._make_request:544] https://nfs.svc.toolforge.org:443 "POST /v1/write-replica-cnf HTTP/1.1
Maybe @komla can help here too
quick throw-away script for simple deployments in lima-kilo using the web images:
It's hosted in github, from https://wikitech.wikimedia.org/wiki/Tool:Import-500px
Tue, Nov 11
I think this should avoid the current errors: https://gerrit.wikimedia.org/r/c/operations/puppet/+/991653
I think it's failing to commit that some users were already created, and recounting them as created every time too
It seems there's currently ~19 accounts affected:
root@cloudcontrol1007:~# journalctl -u maintain-dbusers.service -n 10000 | grep 'problem populating' | grep -o account_id.* | sort | uniq -c
134 account_id chk2605 failed without response.
134 account_id davenyi failed without response.
135 account_id devdoingdev failed without response.
135 account_id elementaler7 failed without response.
134 account_id fritzbeing failed without response.
134 account_id hokwelum failed without response.
135 account_id imanoobg failed without response.
135 account_id jiji failed without response.
134 account_id jordylizana failed without response.
135 account_id khajitdadddy failed without response.
135 account_id olafjanssen failed without response.
135 account_id piastu failed without response.
134 account_id sadrettin failed without response.
135 account_id sisyph failed without response.
135 account_id swampl failed without response.
135 account_id tausheefhassan failed without response.
134 account_id tmwyk failed without response.
134 account_id vincentvega failed without response.
135 account_id vsdetoniprojetomais failed without response.Populating the accounts seem to flop on the 29th of september:
https://grafana.wikimedia.org/goto/_feo-rkDR?orgId=1
Mon, Nov 10
Nov 6 2025
That sounds good to me yes, it's the similar to the other space.
It seems the memory limit has completely stopped the full outages, I'll close this as the main issue is "workedaround". Might be good to investigate the queries that kill it, but right now we don't have the throughput to dig deeper.
The cleanup of what's in the home dir can happen at the start of the tests, so in case anything fails, you still have some leftover to investigate.
This was less of an issue is lima-kilo, where you rebuild it every now and then, but when running on a loop in prod becomes more relevant.
Also, what does you mean with doesn't exist in the toollabs-images repo, but setup is likely like the other node image? those do exist there, just a different revision, for example for ruby 2.5: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/toollabs-images/+/9aaeb88e4af82a42f50146ef4ba97f6932d1e1b6/ruby25-sssd/
I did not mean to unassign sorry, I think we both edited at the same time.



