User Details
- User Since
- Feb 17 2016, 9:54 PM (520 w, 4 d)
- Availability
- Available
- LDAP User
- DamianZaremba
- MediaWiki User
- Unknown
Tue, Jan 13
Mon, Jan 12
This is not yet migrated as I have been busy since September.
Nov 19 2025
I got another alert around 00:28 CET, will investigate when I have a moment, but had the same symptoms as this.
Nov 18 2025
Nov 14 2025
Nov 13 2025
https://cluebotng-trainer.toolforge.org/Report%20Interface%20Import/2025-11-13%2018:02:49/logs/bayes-train.log example of where this gets pulled in
Eventually having 'system logs' (build, jobs, components) with a common 'trace id', so for a deployment you can see everything that happened, would be very nice to have (probably would require some effort to make sure internal things that could have secrets don't get logged to places with tool access).
Yeah, we do this for example in the patch endpoint, but not the delete or restart endpoint (the structure is there but always empty).
Being able to change the component name in components-api would be useful, but for different reasons (especially with reuse_build the naming gets a bit funny e.g. 'backup database' is used for 'run this cron').
Looks like this is working as expected;
Nov 12 2025
Verified this is working now (after the repo is updated)
tools.test-damian@tools-bastion-15:~/toolforge-deploy$ ./utils/run_functional_tests.sh -u https://gitlab.wikimedia.org/damian/toolforge-deploy.git -b test-break -c jobs-api --test-tool test-damian Installed toolforge CLIs versions: | component | type | package name | version | comment | | :-------: | :--: | :----------: | :-----: | :-----: | | builds-cli | package | toolforge-builds-cli | 0.0.24 | | | components-cli | package | toolforge-components-cli | 0.0.16 | | | envvars-cli | package | toolforge-envvars-cli | 0.0.15 | | | jobs-cli | package | toolforge-jobs-cli | 16.1.25 | | | misctools-cli | package | toolforge-misctools-cli | 1.49.3 | | | toolforge-cli | package | toolforge-cli | 0.3.8 | | | toolforge-weld | package | python3-toolforge-weld | 1.6.11 | | | webservice-cli | package | toolforge-webservice | 0.103.18 | |
Nov 11 2025
Looks like a tool issue - it's trying to load things from https://tools.wmflabs.org/import-500px/, rather than https://import-500px.toolforge.org/, which is not allowed by CORS
Hit again today while bumping releases on nearly everything
Deployment ID: 20251111-152523-4chqpjcf8c Created: 20251111-152523 Status: failed Long status: Got exception: Some builds failed to start: redis(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng/builds) report-interface(error:409 Client Error: Conflict for url: https://api.svc.tools.eqiad1.wikimedia.cloud:30003/builds/v1/tool/cluebotng/builds)
Noticed this again today during tools-db having an outage.
Nov 9 2025
Nov 6 2025
This was less of an issue is lima-kilo, where you rebuild it every now and then, but when running on a loop in prod becomes more relevant.
Nov 5 2025
This also impacted ClueBot NG, alert for report interface being down was sent at 23:55 UTC (alerting rule is after 2min), then bot not editing at 00:39 UTC (alerting after 1hour, 5min). Recovery at 08:45 UTC. Manually checked around 1am (CET) and was reporting max connections used per the above.
Nov 4 2025
Thanks for the logs.
Run on toolforge for now
Tagging SRE as not sure which team is responsible.
Nov 3 2025
Things are a lot better for me since the uwsgi wheels change.
There was also an ingress update recently (T383516)
Verified working in production with https://github.com/cluebotng/component-configs/blob/main/config/static-files/files/trove-mysql.sh
Nov 2 2025
Function tests passing, both MRs need tagging with NeedsReview as usual
It appears these images are stored in reggie, which is configured with a 24 hour TTL (x-ref https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/blob/main/reggie/reggie-values.yaml.tftpl#L39).
Oct 30 2025
Currently handling this in my tools as part of the deployment (network policy / ingress being the only 'unsupported' objects managed (including cross-tool access), everything else is components/jobs-api driven), works just fine but is 'unsupported' so this is 'long term nice to have' for me.
This has been migrated over for a while (https://phabricator.wikimedia.org/T401151#11084273), some logs have been reduced (debug -> info), still some areas to cleanup (irc messages are almost gone etc).
No longer using Github for the tool downloads.
So the problem is the repo is cloned under the test user, which uses the upstream repo.
Oct 29 2025
Oct 28 2025
That's helpful, I have a grafana instance on toolforge that I can use for now also.
The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!
Took the liberty of doing what the weekly person normally does for new tickets... feel free to change.
https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/234 is the more important issue (jobs will be scheduled onto non-nfs workers), probably should make a ticket for that, but found it here
Should fail due to no --mount
local.tf-test@lima-kilo:~$ toolforge jobs run --continuous --filelog-stdout test.log --filelog --filelog-stderr test.log --image tool-tf-test/component1:latest --command web test-job
There is a bug in the job validation (which should set the path to be absolute)
This is sort of covered by https://wikitech.wikimedia.org/wiki/Help:Toolforge/Building_container_images#Using_NFS_shared_storage
The /restart/ bit in components-api is done, once https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/220 lands then the /inferred restart in jobs-api/ is done and this can be finally closed.
This was initially proposed as https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/merge_requests/229 which got some push back.
Oct 27 2025
It is also not a trivial problem, so it will take some effort and time to be able to get a "well paved" flow going
Indeed and that is fine, what we have works in broad strokes, so this is more about the future. The primary consuming service is still in beta anyway.
Looks good to me - thanks for the extended effort on getting this deployed @dcaro
tools.cluebotng-monitoring@tools-bastion-15:~$ toolforge build show
Build ID: cluebotng-monitoring-buildpacks-pipelinerun-b7fqr
Start Time: 2025-09-16T14:29:01Z
End Time: 2025-09-16T14:31:24Z
Status: ok
Message: Tasks Completed: 1 (Failed: 0, Cancelled 0), Skipped: 0
Parameters:
Source URL: https://github.com/cluebotng/monitoring.git
Ref: refs/tags/v2.6.5
Envvars: N/A
Use latest versions: True
Destination Image: tools-harbor.wmcloud.org/tool-cluebotng-monitoring/prometheus:latest@sha256:d281c16aa8bdb1487638a8acb28901125ae16da88134ded1e208619aae90436aOct 26 2025
Oct 23 2025
14s (filtered) is not much longer than 17s (all entries) which makes me think there is a lot going on before the filtering.
https://wikitech.wikimedia.org/wiki/Help:Toolforge/Building_container_images states
Compared to the current images, you can use newer versions of the base languages (ex. python 3.12, see Specifying a Python runtime), and benefit when newer versions are added upstream.
Similar to https://phabricator.wikimedia.org/T401875
Related to https://phabricator.wikimedia.org/T380127
Oct 22 2025
I think something to consider is when we leak internal details e.g. "got exception" vs wrapping in human messages e.g. warnings (jobs-api).
Somewhat related but for example config generate with some existing jobs works, is loadable and creates the same job (slow) relies on another test to run for the deployment to exist (assume the deploy token test), this makes running specific tests/failures (as advertised by the cli) fail for not very obvious reasons.