User Details
- User Since
- Mar 6 2020, 9:03 PM (319 w, 1 d)
- Availability
- Available
- IRC Nick
- Raymond_Ndibe
- LDAP User
- Raymond Ndibe
- MediaWiki User
- Raymond Ndibe [ Global Accounts ]
Fri, Apr 17
Thu, Apr 16
Wed, Apr 15
Before:
sudo radosgw-admin user info --uid 84baa6f9fe8d41afb4b7ca99891161f3\$84baa6f9fe8d41afb4b7ca99891161f3
{
"user_id": "84baa6f9fe8d41afb4b7ca99891161f3$84baa6f9fe8d41afb4b7ca99891161f3",
"display_name": "etherpads3",
...
"max_buckets": 1000,
...
"bucket_quota": {
"enabled": false,
...
},
"user_quota": {
"enabled": true,
"check_on_raw": false,
"max_size": 8589934592,
"max_size_kb": 8388608,
"max_objects": 4096
},
...
}Command:
sudo radosgw-admin quota set --quota-scope=user --uid 84baa6f9fe8d41afb4b7ca99891161f3\$84baa6f9fe8d41afb4b7ca99891161f3 --max-objects=65536
After:
sudo radosgw-admin user info --uid 84baa6f9fe8d41afb4b7ca99891161f3\$84baa6f9fe8d41afb4b7ca99891161f3
{
"user_id": "84baa6f9fe8d41afb4b7ca99891161f3$84baa6f9fe8d41afb4b7ca99891161f3",
"display_name": "etherpads3",
...
"max_buckets": 1000,
...
"bucket_quota": {
"enabled": false,
...
},
"user_quota": {
"enabled": true,
"check_on_raw": false,
"max_size": 8589934592,
"max_size_kb": 8388608,
"max_objects": 65536
},
...
}Tue, Apr 14
sudo wmcs-openstack project show ciperformance
+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| description | In T420590 we are working on making CI faster to get faster feedback time for developers. Some changes are easy, some changes needs to be evaluated and we need |
| | a couple of instances to run performance test to test out different configurations settings before we push them to CI. We will use our CI runner Quibble to run |
| | different tests (Apache License) |
| domain_id | default |
| enabled | True |
| id | c47d8465fb8543b2b16d2435c1dd811a |
| is_domain | False |
| name | ciperformance |
| options | {} |
| parent_id | default |
| tags | [] |
+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+Mon, Apr 13
Oppose. Pushing a tag should be the action that triggers the release pipeline.
The tagging is still the action triggering the release.
This task is just about removing the need for a human to manually push the tag, which should make it more difficult to make mistakes.
If a human is not responsible for manually pushing the tag, then we don't have to worry about both pypi release or other contributors having stale tags locally.
Yes Danya, push-to-deploy doesn't support webservice yet. We are working on that, though it won't be out like say, tomorrow
Fri, Apr 10
can you verify that providing an environment variable fixes this problem for you @MBH ?
It seems like your job is missing some toolforge environment variable. To see your available environment variable, run toolforge envvars list.
tools.mbh@tools-bastion-15:~$ toolforge jobs list +---------------+---------------------+------------------------------------------+ | Job name: | Job type: | Status: | +---------------+---------------------+------------------------------------------+ | countries | one-off | Running for 1h4m20s | | daily | schedule: 3 0 * * * | Last schedule time: 2026-04-10T00:03:00Z | | monthly | schedule: 5 0 2 * * | Last schedule time: 2026-04-01T00:05:00Z | | file-renaming | continuous | Not running | +---------------+---------------------+------------------------------------------+ tools.mbh@tools-bastion-15:~$ toolforge jobs show file-renaming +---------------+------------------------------------------------------------------------+ | Job name: | file-renaming | +---------------+------------------------------------------------------------------------+ | Command: | mono /data/project/mbh/bots/file-renaming.exe | +---------------+------------------------------------------------------------------------+ | Job type: | continuous | +---------------+------------------------------------------------------------------------+ | Image: | mono6.8 | +---------------+------------------------------------------------------------------------+ | Port: | none | +---------------+------------------------------------------------------------------------+ | File log: | yes | +---------------+------------------------------------------------------------------------+ | Output log: | /data/project/mbh/file-renaming.out | +---------------+------------------------------------------------------------------------+ | Error log: | /data/project/mbh/file-renaming.err | +---------------+------------------------------------------------------------------------+ | Emails: | none | +---------------+------------------------------------------------------------------------+ | Resources: | default | +---------------+------------------------------------------------------------------------+ | Replicas: | 1 | +---------------+------------------------------------------------------------------------+ | Mounts: | none | +---------------+------------------------------------------------------------------------+ | Retry: | no | +---------------+------------------------------------------------------------------------+ | Timeout: | no | +---------------+------------------------------------------------------------------------+ | Health check: | none | +---------------+------------------------------------------------------------------------+ | Status: | Not running | +---------------+------------------------------------------------------------------------+ | Hints: | Last run at 2026-03-19T16:17:00Z. Pod in 'Running' phase. Pod has been | | | restarted 9 times. State 'waiting'. Reason | | | 'CreateContainerConfigError'. Additional message:'secret | | | "toolforge.envvar.v1.conn-string" not found'. | +---------------+------------------------------------------------------------------------+
Wed, Apr 8
Tue, Apr 7
Most promising paths:
- Adding max_query_length: 336h to loki configuration. This will always raise the query time range exceeds the limit... when range exceeds 336h. works for 30d, 100d, 1000d, 10000d, etc.
- Adding max_query_length: 336h and max_query_lookback: 336h to loki configuration. This will always ignore the the days that fall outside 336h and return everything else.
Wed, Apr 1
@Soda You can now see all your logs using --since to adjust how far in the past the logs should be gotten from.
Wed, Mar 25
Note to self:
for the link objects cleanup, maybe we can pull the image before the cleanup, and verify that we can pull the image after the cleanup. If we can't (meaning the cleanup broke the image), push it back to establish the link again. This can be done in addition to other ways to mitigate risks. Noting down here so we don't forget what we discussed during the session with David.
Mar 18 2026
I added documentation for the s3 lifecycle configuration part here https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide#Known_issues_with_Docker_registries.
I also configured that on harborstorage buckets for tools and toolsbeta and that got rid of the massive 20G difference between what is reported on horizon and what is reported on harbor:
Update After Further Research
TL;DR:
- maintain-harbor job to handle harbor vs s3 orphaned objects cleanup (is this safe? are their safer alternative? what happens if we just ignore this?)
- lifecycle rules to handle old multipart upload data and documentation of this on wikitech
- cookbook to handle ceph3 orphaned objects cleanup
Mar 16 2026
I digged deeper into this. https://github.com/goharbor/harbor/issues/22111 is one of our problems, but is not the major one. Below are other related issues I researched:
Mar 11 2026
not sure if we should do this. We had a decision request and the final decision was to manually handle version bumping https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Decision_record_T373072_to_strictly_enforce_semantic_versioning_rules_for_toolforge_services_APIs_or_not
Mar 10 2026
Mar 5 2026
described the issue and solution here https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-harbor/-/merge_requests/62#note_190853
Mar 4 2026
Mar 3 2026
Done
Feb 25 2026
Feb 19 2026
Feb 18 2026
Feb 17 2026
Feb 16 2026
Before:
raymond-ndibe@cloudcontrol1006:~$ sudo wmcs-openstack quota show catalyst-dev +-----------------------+-------+ | Resource | Limit | +-----------------------+-------+ | cores | 40 | ... | ram | 81920 | ... | gigabytes | 750 | ... +-----------------------+-------+
Feb 15 2026
@Otcenas11 I looked at your tool. Prev commands show you are running toolforge webservice --backend=kubernetes --mount=all buildservice start,
which is failing with the error below, because you have a manually created deployment that has the same name as the name webservice is trying to use:
tools.changedetection-io@tools-bastion-15:~$ toolforge webservice --backend=kubernetes --mount=all buildservice start
Traceback (most recent call last):
File "/usr/bin/toolforge-webservice", line 33, in <module>
sys.exit(load_entry_point('toolforge-webservice==0.103.18', 'console_scripts', 'toolforge-webservice')())
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/usr/lib/python3/dist-packages/toolsws/cli/webservice.py", line 561, in main
start(job, "Starting webservice")
~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/toolsws/cli/webservice.py", line 88, in start
job.request_start()
~~~~~~~~~~~~~~~~~^^
File "/usr/lib/python3/dist-packages/toolsws/backends/kubernetes.py", line 647, in request_start
self.api.create_object(
~~~~~~~~~~~~~~~~~~~~~~^
"deployments", self._get_deployment(started_at)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/usr/lib/python3/dist-packages/toolforge_weld/kubernetes.py", line 249, in create_object
return self.post(
~~~~~~~~~^
kind,
^^^^^
...<2 lines>...
**kwargs,
^^^^^^^^^
)
^
File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 190, in post
response = self._make_request("POST", url, **kwargs).json()
~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 157, in _make_request
raise e
File "/usr/lib/python3/dist-packages/toolforge_weld/api_client.py", line 136, in _make_request
response.raise_for_status()
~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/usr/lib/python3/dist-packages/requests/models.py", line 1024, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 409 Client Error: Conflict for url: https://k8s.tools.eqiad1.wikimedia.cloud:6443/apis/apps/v1/namespaces/tool-changedetection-io/deployments
tools.changedetection-io@tools-bastion-15:~$ toolforge jobs listtools.changedetection-io@tools-bastion-15:~$ kubectl get deployments NAME READY UP-TO-DATE AVAILABLE AGE changedetection-io 1/1 1 1 7h15m tools.changedetection-io@tools-bastion-15:~$
It seems like when an image with digest (e.g. '192.168.5.15/tool-tf-test/cluebot3:latest@sha256:b43fe64ac24365bd7cf3731f010e08020b6ce7304dd7852cd689600318ff270d') is provided while attempting to create job, what ends up in k8s is something like 192.168.5.15/tool-tf-test/cluebot3:latest, the digest is being dropped.
This is happening because we are running Image.from_url_or_name again here https://gitlab.wikimedia.org/repos/cloud/toolforge/jobs-api/-/blob/main/tjf/runtimes/k8s/runtime.py?ref_type=heads#L191, which drops the digest.
I'm not sure why we are doing that again in that line of code, since we already ran it in the api/models.py. Will just remove it and see what happens.
@dancy I don't know much about catalyst, It'd be interesting to know why we are requesting similar increases as catalyst, for catalyst-dev?
to be clear the question above won't prevent increase if approved by the team, forwarding to irc right now. It's just personal curiosity.
@Otcenas11 Can you let us know the name of the toolforge tool you tried deploying this on? and maybe give more details about how to reproduce the issue you are facing? I can try to check out what the problem is. We can proceed with this (pending approval from others) if it's confirmed that this can't be deployed on toolforge
