Page MenuHomePhabricator

Raymond_Ndibe (Ray)
Software Engineer

Projects (9)

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Mar 6 2020, 9:03 PM (301 w, 2 d)
Availability
Available
IRC Nick
Raymond_Ndibe
LDAP User
Raymond Ndibe
MediaWiki User
Raymond Ndibe [ Global Accounts ]

Recent Activity

Tue, Dec 9

Raymond_Ndibe created T412081: resource consumption on issue toolforge tool wpcleaner.
Tue, Dec 9, 9:25 AM · Toolforge (Toolforge iteration 25)

Fri, Nov 28

Raymond_Ndibe added a comment to T411208: [lima-kilo] error mounting docker cache.

@Volans ran into this issue some time ago, and I ran into it today. The workaround is using ./start-devenv.sh --no-cache.

Fri, Nov 28, 12:31 AM · cloud-services-team, Toolforge

Tue, Nov 25

Raymond_Ndibe changed the status of T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images, a subtask of T348755: [jobs-api,webservice] Run webservices via the jobs framework, from Open to In Progress.
Tue, Nov 25, 1:46 AM · Toolforge (Toolforge iteration 25), cloud-services-team, User-Raymond_Ndibe, Epic
Raymond_Ndibe changed the status of T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images from Open to In Progress.
Tue, Nov 25, 1:46 AM · Toolforge (Toolforge iteration 25)

Nov 13 2025

Raymond_Ndibe added a comment to T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.

It seems like we don't need to do any special thing to get the images to run @dcaro @fnegri

Nov 13 2025, 11:31 PM · Toolforge (Toolforge iteration 25)
Raymond_Ndibe added a comment to T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.

Lima kilo env configurations for anyone who wants to recreate (configmaps, limitranges, resourcequotas, etc. I basically maxed everything out to ensure those never become an issue while running these tests. keeping everything in doc so things don't clutter the task):
https://docs.google.com/document/d/1LfXdcVB-Vh0I0IuoniCN325Tofzu7MK9bBnA8G9aLM0/edit?tab=t.0

Nov 13 2025, 11:29 PM · Toolforge (Toolforge iteration 25)

Nov 12 2025

Raymond_Ndibe added a comment to T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.

quick throw-away script for simple deployments in lima-kilo using the web images:

That's ok, but can you test if we can use them as jobs from jobs-api?
I'm sure that they will be able to be pulled and run as just images, the key point is running as jobs (envvars, entrypoints, resources, security policies, ...).

For that you can try using the image-config patch you created in lima-kilo, and start one job for each image type (might be easier using jobs.yaml), and making sure it runs ok (ex. logging some string, and checking that the logs are sent ok).

Nov 12 2025, 4:24 PM · Toolforge (Toolforge iteration 25)
Raymond_Ndibe added a comment to T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.

quick throw-away script for simple deployments in lima-kilo using the web images:

#!/usr/bin/env python3
Nov 12 2025, 2:00 AM · Toolforge (Toolforge iteration 25)

Nov 11 2025

Raymond_Ndibe changed the status of T408783: [docs] Update all toolforge repos in gitlab with contribution guidelines and license from Open to In Progress.
Nov 11 2025, 8:18 PM · Patch-For-Review, Toolforge (Toolforge iteration 25)
Raymond_Ndibe added a comment to T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.

I did not mean to unassign sorry, I think we both edited at the same time.

Can you manually test that is the case? For example running some code on each of them, even if it's a shellscript of sorts.

Also there's some setup that is not there in some other images, if you check some of the Dockerfiles for webservice there's also envvars set in some.

And, can you share the code you use to generate that table? Could be useful.

Nov 11 2025, 7:06 PM · Toolforge (Toolforge iteration 25)
Raymond_Ndibe claimed T409725: [jobs-api,webservice] Fetch images from builds-api.
Nov 11 2025, 6:44 PM · Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe claimed T409726: [builds-api] Add an endpoint to get all available images.
Nov 11 2025, 6:44 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe added a comment to T409726: [builds-api] Add an endpoint to get all available images.

image-config configmap has the below structure currently:
NOTE: the below entry is not an exact example of what is in the config, I just gathered many of the common aliases, state, extras into a single config entry so we can talk about it

apiVersion: v1
data:
  images-v1.yaml: |
    bookworm:
      aliases:
      - tf-bullseye-std
      - tf-bullseye-std-DEPRECATED
      state: stable
      variants:
        jobs-framework:
          image: docker-registry.tools.wmflabs.org/toolforge-bookworm-sssd
        webservice:
          extra:
            resources: jdk
            wstype: generic
          image: docker-registry.tools.wmflabs.org/toolforge-bookworm-web-sssd
...
kind: ConfigMap
...

Few things to think about:

  • How to support all aliases, state, extra? First which among those do we need (e.g. for backwards compatibility) and which are unnecessary? For the necessary ones, how do we support them in harbor if we want the endpoint to be as simple as just making a request to harbor and parsing? we certainly don't want to maintain a yaml in builds-api that defines these since that'd basically mean moving image-config into builds-api. A few things come to mind:
    1. extensive use of tags. (e.g. aliases-tf-bullseye-std, aliases-tf-bullseye-std-DEPRECATED, state-stable, state-deprecated, resources-jdk, wstype-generic, etc). If we go with this, then we need a way of parsing these in builds-api (probably trivial). More importantly, we'll likely need a cookbook for maintaining these images (updating tag to deprecated, specifying tags when uploading a new image, etc).
    2. helm chart on harbor (I hate this because it's no different from maintaining a local yaml on builds-api): with this you still have the images, then a chart defining these "tags`
Nov 11 2025, 6:43 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe claimed T409728: [image-config] deprecate and move all data to builds-api.
Nov 11 2025, 6:42 PM · Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe added a comment to T409726: [builds-api] Add an endpoint to get all available images.

image-config configmap has the below structure currently:
NOTE: the below entry is not an exact example of what is in the config, I just gathered many of the common aliases, state, extras into a single config entry so we can talk about it

apiVersion: v1
data:
  images-v1.yaml: |
    bookworm:
      aliases:
      - tf-bullseye-std
      - tf-bullseye-std-DEPRECATED
      state: stable
      variants:
        jobs-framework:
          image: docker-registry.tools.wmflabs.org/toolforge-bookworm-sssd
        webservice:
          extra:
            resources: jdk
            wstype: generic
          image: docker-registry.tools.wmflabs.org/toolforge-bookworm-web-sssd
...
kind: ConfigMap
...

Few things to think about:

  • How to support all aliases, state, extra? First which among those do we need (e.g. for backwards compatibility) and which are unnecessary? For the necessary ones, how do we support them in harbor if we want the endpoint to be as simple as just making a request to harbor and parsing? we certainly don't want to maintain a yaml in builds-api that defines these since that'd basically mean moving image-config into builds-api. A few things come to mind:
    1. extensive use of tags. (e.g. aliases-tf-bullseye-std, aliases-tf-bullseye-std-DEPRECATED, state-stable, state-deprecated, resources-jdk, wstype-generic, etc). If we go with this, then we need a way of parsing these in builds-api (probably trivial). More importantly, we'll likely need a cookbook for maintaining these images (updating tag to deprecated, specifying tags when uploading a new image, etc).
    2. helm chart on harbor (I hate this because it's no different from maintaining a local yaml on builds-api): with this you still have the images, then a chart defining these "tags`
Nov 11 2025, 1:29 AM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe updated the task description for T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Nov 11 2025, 12:53 AM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe updated the task description for T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Nov 11 2025, 12:53 AM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe updated the task description for T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Nov 11 2025, 12:47 AM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe updated the task description for T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Nov 11 2025, 12:45 AM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team

Nov 10 2025

Raymond_Ndibe updated the task description for T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Nov 10 2025, 11:36 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe updated the task description for T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Nov 10 2025, 11:34 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe updated the task description for T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Nov 10 2025, 11:34 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe updated the task description for T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Nov 10 2025, 11:34 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe claimed T409727: [builds-api,harbor,image-config] Move pre-built images to harbor.
Nov 10 2025, 11:30 PM · Patch-For-Review, Toolforge (Toolforge iteration 25), cloud-services-team

Nov 7 2025

Raymond_Ndibe added a comment to T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.

Also, what does you mean with doesn't exist in the toollabs-images repo, but setup is likely like the other node image? those do exist there, just a different revision, for example for ruby 2.5: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/docker-images/toollabs-images/+/9aaeb88e4af82a42f50146ef4ba97f6932d1e1b6/ruby25-sssd/

Nov 7 2025, 1:35 PM · Toolforge (Toolforge iteration 25)

Nov 6 2025

Raymond_Ndibe claimed T400917: [jobs-api] Allow customizing time to request Loki logs for.
Nov 6 2025, 5:34 AM · Toolforge (Toolforge iteration 25), cloud-services-team
Raymond_Ndibe added a comment to T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.

In all cases where both variants exist, the webservice image is functionally a superset of the jobs-framework image, therefore, the webservice image can most likely serve both purposes.

Nov 6 2025, 4:50 AM · Toolforge (Toolforge iteration 25)
Raymond_Ndibe claimed T409191: [jobs-api] Investigate if we can reuse the 'web' flavour pre-built images as regular images.
Nov 6 2025, 4:48 AM · Toolforge (Toolforge iteration 25)
Raymond_Ndibe claimed T408783: [docs] Update all toolforge repos in gitlab with contribution guidelines and license.
Nov 6 2025, 12:16 AM · Patch-For-Review, Toolforge (Toolforge iteration 25)

Nov 4 2025

Raymond_Ndibe added a comment to T408034: toolforge jobs dump includes booleans as strings.

This is more of a thing done for the purpose of backward compatibility rather than a bug mistakenly introduced.
Initially this was being returned as string, so we it was just carried over like that to avoid breaking anything for anyone who uses the api directly and expects this to be string.

Nov 4 2025, 3:45 AM · Toolforge, cloud-services-team
Raymond_Ndibe closed T408002: [functional tests,toolforge-deploy] functional tests are optimistic about retries/timeouts as Resolved.
Nov 4 2025, 3:42 AM · Patch-For-Review, cloud-services-team, Toolforge

Oct 22 2025

Raymond_Ndibe added a comment to T359649: [jobs-api,infra] upgrade all the existing toolforge jobs to the latest job version.

The original message draft talked about the "v2 job spec", which is why I assumed this was about the job configuration and not some internal implementation detail.

But if this is just about internal implementation details, why are we asking tool maintainers to care about it in the first place? IHMO in that case we should just handle it internally like we've handled similar migrations in the past.

Oct 22 2025, 10:37 PM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, User-Raymond_Ndibe, User-aborrero
Raymond_Ndibe added a comment to T359649: [jobs-api,infra] upgrade all the existing toolforge jobs to the latest job version.

Job version upgrade email draft:

Immediate questions based on this:

  • How is the v2 job config format different than the v1 format? (This should be documented at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Running_jobs and summarized here.)
  • I also see no differences with the format of the config file generated by toolforge jobs dump and the file checked in my version control. How do I check what exactly needs changing in my config file?

This has to do with the way the job is created in kubernetes, so a difference will not be reflected on the dumps. You know all the fields that are as a result of the legacy k8s specs? we want to get rid of those. Easiest way is to get the k8s spec of a job and check the version number in the label.

Maybe we do need to explain what exactly will be changing, but for the average user, they need not care about the change since it's more on the k8s side than in the actual job spec they submit

Oct 22 2025, 9:57 PM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, User-Raymond_Ndibe, User-aborrero
Raymond_Ndibe added a comment to T359649: [jobs-api,infra] upgrade all the existing toolforge jobs to the latest job version.

Upgrade notification to individual maintainer draft

Upgrade Your Old Toolforge Jobs Version to V2 <name>
Oct 22 2025, 9:54 PM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, User-Raymond_Ndibe, User-aborrero
Raymond_Ndibe added a comment to T359649: [jobs-api,infra] upgrade all the existing toolforge jobs to the latest job version.

Job version upgrade email draft:

Immediate questions based on this:

  • How is the v2 job config format different than the v1 format? (This should be documented at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Running_jobs and summarized here.)
  • I also see no differences with the format of the config file generated by toolforge jobs dump and the file checked in my version control. How do I check what exactly needs changing in my config file?
Oct 22 2025, 9:51 PM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, User-Raymond_Ndibe, User-aborrero
Raymond_Ndibe added a comment to T359649: [jobs-api,infra] upgrade all the existing toolforge jobs to the latest job version.

Affected tools:

actrial
adamant
admin
ahechtbot
air7538tools
alertlive
arkivbot
aswnbot
aw-gerrit-gitlab-bridge
bothasava
botorder
brandonbot
contribstats
croptool
csp-report
danmicholobot
dannys712-bot
deployment-calendar
dewikinews-rss
dexbot
dow
dykautobot
earwigbot
emijrpbot
erwin85
featured-content-bot
ffbot
fist
fontcdn
forrestbot
galobot
gerakibot
gerrit-reviewer-bot
h78c67c-bot
hay
hewiki-tools
highly-agitated-pages
itwiki
itwiki-scuola-italiana
jackbot
jarry-common
jorobot
kian
lists
logoscope
magnustools
maintgraph
map-of-monuments
mitmachen
mjolnir
most-wanted
nlwiki-herhaalbot
non-robot
openstack-browser
pagepile
pangolinbot1
patrocle
phabbot
phabsearchemail
phansearch
phpcs
pickme
quest
random-featured
rembot
sdbot
search-filters
sergobot-statistics
shex-simple
socksfinder
sourcemd
spur
status
svbot2
svgcheck
sz-iwbot
technischewuensche
tf-image-bot
thanatos
thanks
thesandbot
tnt-dev
toolhub-extension-demo
toolhunt-api
tools-edit-count
top25reportbot
topicmatcher
trainbow
tutor
typo-fixer
update-1lib1ref
vicbot2
video2commons
wd-flaw-finder
wdumps
welcomebot
wgmc
wiki-patrimonio
wiki-stat-portal
wikicup
wikidata-game
wikidata-todo
wikijournalbot
wikilinkbot
wikiloves
wikiprojectlist
wikivoyage
wm-domains
wmch
wmde-access
ws-cat-browser
zhmrtbot
zhwiki-teleirc
Oct 22 2025, 9:39 PM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, User-Raymond_Ndibe, User-aborrero
Raymond_Ndibe added a comment to T359649: [jobs-api,infra] upgrade all the existing toolforge jobs to the latest job version.

Job version upgrade email draft:

[Cloud-announce] Old Toolforge Jobs Upgrade To V2 on 2025-11-20
Oct 22 2025, 9:38 PM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, User-Raymond_Ndibe, User-aborrero
Raymond_Ndibe changed the status of T402568: [components-api] Queue builds when the build queue is full from In Progress to Stalled.
Oct 22 2025, 2:41 AM · Toolforge (Toolforge iteration 25), Patch-For-Review
Raymond_Ndibe changed the status of T402568: [components-api] Queue builds when the build queue is full, a subtask of T401851: [components-api,beta] Image should only be build once when re-used in components, from In Progress to Stalled.
Oct 22 2025, 2:41 AM · Toolforge (Toolforge iteration 25)

Oct 21 2025

Raymond_Ndibe changed the status of T407496: [maintain-harbor] Failing to cleanup stale artifacts from Open to In Progress.
Oct 21 2025, 2:57 AM · Toolforge (Toolforge iteration 25)
Raymond_Ndibe created T407822: alloy failing in macos lima-kilo.
Oct 21 2025, 1:26 AM · Toolforge (Toolforge iteration 24)
Raymond_Ndibe closed T401648: [components-api] exclude defaults when getting deployment as Resolved.
Oct 21 2025, 1:11 AM · Toolforge (Toolforge iteration 24), Patch-For-Review
Raymond_Ndibe closed T394595: [cicd] create cicd flow for non repo owners as Resolved.
Oct 21 2025, 1:10 AM · Toolforge (Toolforge iteration 24), cloud-services-team (FY2025/26-Q1-Q2), User-Raymond_Ndibe
Raymond_Ndibe closed T394595: [cicd] create cicd flow for non repo owners, a subtask of T392524: [cicd] Streamline toolforge cli deployment and external contributor ci flows, as Resolved.
Oct 21 2025, 1:10 AM · cloud-services-team (FY2025/26-Q1-Q2), Toolforge, User-Raymond_Ndibe, Epic
Raymond_Ndibe edited projects for T394595: [cicd] create cicd flow for non repo owners, added: Toolforge (Toolforge iteration 24); removed Toolforge.
Oct 21 2025, 1:10 AM · Toolforge (Toolforge iteration 24), cloud-services-team (FY2025/26-Q1-Q2), User-Raymond_Ndibe

Oct 20 2025

Raymond_Ndibe closed T407733: Quota increase request for project catalyst as Resolved.
Oct 20 2025, 6:52 PM · Catalyst, Cloud-VPS (Quota-requests)
Raymond_Ndibe added a comment to T407733: Quota increase request for project catalyst.

Command:

sudo cookbook wmcs.openstack.quota_increase --project catalyst --cores 16 --ram 32768 --task-id T407733 --cluster-name eqiad1
Oct 20 2025, 6:52 PM · Catalyst, Cloud-VPS (Quota-requests)

Oct 14 2025

Raymond_Ndibe closed T389118: [jobs-api] refactor models as Resolved.
Oct 14 2025, 7:46 PM · Toolforge (Toolforge iteration 24), Patch-For-Review, User-Raymond_Ndibe

Oct 8 2025

Raymond_Ndibe added a comment to T405018: [envvars] only mask secrets.

--type secret/config with the default being secret should work while creating envvars

Oct 8 2025, 11:22 PM · cloud-services-team, Toolforge
Raymond_Ndibe added a comment to T402764: [components-api] allow specifying `source_repo`+`ref` for the config.

I struggle to see why we need an handle configurations saved in a git repo as a different from ordinary url, given that the files in a particular branch in both gitlab and github (and most git servers) can be defind as a simple url.
What am I missing?
typical example is https://gitlab.wikimedia.org/repos/cloud/toolforge/components-api/-/raw/replace_destination_image_comparison_with_image_name/LICENSE?ref_type=heads
the above link can be read by anything, we do not need to know it's in a git repo in branch replace_destination_image_comparison_with_image_name

Oct 8 2025, 10:57 PM · Toolforge (Toolforge iteration 25), Patch-For-Review, cloud-services-team

Sep 25 2025

Raymond_Ndibe closed T405620: [components-api] deployment is marked `successful` before finishing as Resolved.
Sep 25 2025, 6:06 PM · cloud-services-team, Toolforge
Raymond_Ndibe claimed T405620: [components-api] deployment is marked `successful` before finishing.
Sep 25 2025, 5:01 PM · cloud-services-team, Toolforge

Sep 24 2025

Raymond_Ndibe moved T334629: Update maintain_kubeusers to use the toolstate database from Ready to be worked on to Toolforge iteration 24 on the Toolforge board.
Sep 24 2025, 11:08 PM · Toolforge (Toolforge iteration 25), User-Raymond_Ndibe, cloud-services-team
Raymond_Ndibe created T405463: [jobs-api] pod cpu request greater than limitrange in lima-kilo, broken.
Sep 24 2025, 11:56 AM · Toolforge (Toolforge iteration 24)

Sep 23 2025

Raymond_Ndibe added a comment to T403167: [components-api] rebuilds un-changed images.

Hi @Raymond_Ndibe,

Essentially what you describe is how you get into this state.

I included it as an example along the lines of perhaps builds-api should be truly authoritative for images and treat harbour as a literal storage layer, rather than the storagesyer doing cleanup async (on phone as I just arrived to Spain so can't quote right now).

A --force-build isn't a golden ticket, specifically for webservice images, which have to be built direct via builds rather than components (this will hopefully go away soon as it's just a label in the runtime and webservice is inconsistent to the spec from jobs - there is a ticket for that but I can't get it right now).

This issues appears to happen especially with the monitoring code, I assume due to Grafana being quite big - it's not woth splitting the images out because of the issues around not retaining/promiscuous rebuilding of images combined with an effective hard limit of image combinations per component (4).

A simple solution for this would be to get a quotation increase, but the "root cause" should still be fixed/documented/considered for the long term.

Sep 23 2025, 6:35 PM · cloud-services-team, Toolforge

Sep 15 2025

Raymond_Ndibe added a comment to T403167: [components-api] rebuilds un-changed images.

This is similar to the error message you got @DamianZaremba. @dcaro you should also see this

Sep 15 2025, 11:47 PM · cloud-services-team, Toolforge
Raymond_Ndibe added a comment to T403167: [components-api] rebuilds un-changed images.

Hello @DamianZaremba can you help with reproducing the error in the last message you sent? From my experience the only way this can happen is if you tried toolforge components deployment create (without --force-build), immediately after running toolforge build clean. We need to revisit the clean command, but right now the way it works is to delete all the images in harbor, while leaving behind the builds (unfortunately or our users a build existing automatically means that the image should exist, which is the right UX, but that is not how it currently works)

Sep 15 2025, 11:46 PM · cloud-services-team, Toolforge
Raymond_Ndibe added a comment to T403167: [components-api] rebuilds un-changed images.

Hello @DamianZaremba can you help with reproducing the error in the last message you sent? From my experience the only way this can happen is if you tried toolforge components deployment create (without --force-build), immediately after running toolforge build clean. We need to revisit the clean command, but right now the way it works is to delete all the images in harbor, while leaving behind the builds (unfortunately or our users a build existing automatically means that the image should exist, which is the right UX, but that is not how it currently works)

Sep 15 2025, 11:11 PM · cloud-services-team, Toolforge
Raymond_Ndibe added a comment to T403167: [components-api] rebuilds un-changed images.

Hello @DamianZaremba can you help with reproducing the error in the last message you sent? From my experience the only way this can happen is if you tried toolforge components deployment create (without --force-build), immediately after running toolforge build clean. We need to revisit the clean command, but right now the way it works is to delete all the images in harbor, while leaving behind the builds (unfortunately or our users a build existing automatically means that the image should exist, which is the right UX, but that is not how it currently works)

Sep 15 2025, 10:48 PM · cloud-services-team, Toolforge
Raymond_Ndibe added a subtask for T403167: [components-api] rebuilds un-changed images: T404157: [builds-api, maintain-harbor] fix build/image cleanup.
Sep 15 2025, 8:09 PM · cloud-services-team, Toolforge
Raymond_Ndibe added a parent task for T404157: [builds-api, maintain-harbor] fix build/image cleanup: T403167: [components-api] rebuilds un-changed images.
Sep 15 2025, 8:09 PM · Toolforge (Toolforge iteration 25), Patch-For-Review
Raymond_Ndibe updated the task description for T404157: [builds-api, maintain-harbor] fix build/image cleanup.
Sep 15 2025, 8:08 PM · Toolforge (Toolforge iteration 25), Patch-For-Review

Sep 12 2025

Raymond_Ndibe changed the status of T402568: [components-api] Queue builds when the build queue is full from Open to In Progress.
Sep 12 2025, 6:45 PM · Toolforge (Toolforge iteration 25), Patch-For-Review
Raymond_Ndibe changed the status of T402568: [components-api] Queue builds when the build queue is full, a subtask of T401851: [components-api,beta] Image should only be build once when re-used in components, from Open to In Progress.
Sep 12 2025, 6:44 PM · Toolforge (Toolforge iteration 25)

Sep 10 2025

Raymond_Ndibe changed the status of T404157: [builds-api, maintain-harbor] fix build/image cleanup from Open to In Progress.
Sep 10 2025, 2:27 AM · Toolforge (Toolforge iteration 25), Patch-For-Review

Sep 9 2025

Raymond_Ndibe created T404157: [builds-api, maintain-harbor] fix build/image cleanup.
Sep 9 2025, 11:55 PM · Toolforge (Toolforge iteration 25), Patch-For-Review
Raymond_Ndibe claimed T402568: [components-api] Queue builds when the build queue is full.
Sep 9 2025, 11:02 PM · Toolforge (Toolforge iteration 25), Patch-For-Review
Raymond_Ndibe changed the status of T403513: [lima-kilo] fix permission of tool's home dir from Invalid to Resolved.
Sep 9 2025, 11:01 PM · Toolforge (Toolforge iteration 24)
Raymond_Ndibe closed T403513: [lima-kilo] fix permission of tool's home dir as Invalid.
Sep 9 2025, 11:00 PM · Toolforge (Toolforge iteration 24)
Raymond_Ndibe closed T350687: [harbor] Move harbor data to object storage service as Resolved.
Sep 9 2025, 7:30 PM · Toolforge (Toolforge iteration 24), cloud-services-team (FY2025/26-Q1-Q2), User-Raymond_Ndibe, Goal
Raymond_Ndibe closed T350687: [harbor] Move harbor data to object storage service, a subtask of T356301: [harbor] Deploy with Helm, as Resolved.
Sep 9 2025, 7:30 PM · cloud-services-team, Toolforge, User-Raymond_Ndibe, User-aborrero, Goal
Raymond_Ndibe updated the task description for T350687: [harbor] Move harbor data to object storage service.
Sep 9 2025, 7:30 PM · Toolforge (Toolforge iteration 24), cloud-services-team (FY2025/26-Q1-Q2), User-Raymond_Ndibe, Goal
Raymond_Ndibe moved T401994: [components-api] support port protocol in config from In Progress to In Review on the Toolforge (Toolforge iteration 24) board.
Sep 9 2025, 1:31 PM · Toolforge (Toolforge iteration 24)
Raymond_Ndibe moved T402572: [components-api] handle non-passed arguments and defaults consistently from In Progress to In Review on the Toolforge (Toolforge iteration 24) board.
Sep 9 2025, 1:31 PM · Toolforge (Toolforge iteration 24)
Raymond_Ndibe moved T401172: [jobs-api] make job status an enum, with clearly defined states from In Progress to In Review on the Toolforge (Toolforge iteration 24) board.
Sep 9 2025, 1:30 PM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, User-Raymond_Ndibe

Sep 2 2025

Raymond_Ndibe created T403513: [lima-kilo] fix permission of tool's home dir.
Sep 2 2025, 6:39 PM · Toolforge (Toolforge iteration 24)

Aug 27 2025

Raymond_Ndibe closed T402521: Quota increase request for catalyst-dev as Resolved.
Aug 27 2025, 4:53 PM · Catalyst, cloud-services-team, Cloud-VPS (Quota-requests)
Raymond_Ndibe added a comment to T402521: Quota increase request for catalyst-dev.

before

raymond-ndibe@cloudcontrol1006:~$ sudo wmcs-openstack quota show catalyst-dev
+-----------------------+-------+
| Resource              | Limit |
+-----------------------+-------+
| cores                 |     8 |
| ram                   | 16384 |
| gigabytes             |    80 |
...
+-----------------------+-------+

after

raymond-ndibe@cloudcontrol1006:~$ sudo wmcs-openstack quota show catalyst-dev
+-----------------------+-------+
| Resource              | Limit |
+-----------------------+-------+
| cores                 |    32 |
| ram                   | 65536 |
| gigabytes             |   670 |
...
+-----------------------+-------+
Aug 27 2025, 4:53 PM · Catalyst, cloud-services-team, Cloud-VPS (Quota-requests)
Raymond_Ndibe claimed T402521: Quota increase request for catalyst-dev.
Aug 27 2025, 4:30 PM · Catalyst, cloud-services-team, Cloud-VPS (Quota-requests)

Aug 26 2025

Raymond_Ndibe added a comment to T401172: [jobs-api] make job status an enum, with clearly defined states.
  • one-off | continuous jobs: examples:
    • {"short": "pending", "messages": ["restarting, maybe retrying?"], "duration": "00:00:32", "up_to_date": false} for jobs that are restarting either because of failure when backofflimit is specified (for jobs), or the restarting if command has exited (for deployments).
    • {"short": "pending", "messages": ["scheduling"], "duration": "00:00:32", "up_to_date": false} pod is waiting to be assigned to node
    • {"short": "pending", "messages": ["initializing"], "duration": "00:00:32", "up_to_date": false} pod init containers are still running, images still getting pulled
    • {"short": "running", "messages": ["running"], "duration": "00:00:32", "up_to_date": true} all containers in the pod are running
    • {"short": "succeeded", "messages": ["succeeded"], "duration": "00:00:32", "up_to_date": true} pod containers exited successfully
    • {"short": "stopped", "messages": ["stopped"], "duration": "00:00:32", "up_to_date": true} (upcoming) job was stopped by user, to maybe be restarted later.
    • {"short": "failed", "messages": ["Command not found"], "duration": "00:00:32", "up_to_date": true} the pod, container(s) failed to run
    • {"short": "unknown", "messages": ["unknown"], "duration": "00:00:32", "up_to_date": true} unable to get the status of the job for some reason
Aug 26 2025, 6:42 PM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, User-Raymond_Ndibe
Raymond_Ndibe added a comment to T401172: [jobs-api] make job status an enum, with clearly defined states.

This sounds like a good improvement.

Just a question regarding inconsistent/up_to_date - I can't quite parse "the saved spec will still be out of sync with the running spec in the runtime", is the intention for this to reflect:

  • Job config sent to API has not been synced to Job object in runtime (k8s) - I think this is done sync during the API call?
  • Job instance running (pod) is using an older version of spec than is in Job (k8s object) i.e. it was started before the Job changed - Continuous would be restarted, One off would just exit so this would only really apply to Scheduled?
Aug 26 2025, 6:10 PM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, User-Raymond_Ndibe
Raymond_Ndibe added a comment to T402923: [builds-service] builds not working due to access issues in tools.

@Raymond_Ndibe I increased the quota too, for issues like this, can you drop to irc instead? it's way easier to coordinate

Aug 26 2025, 1:45 PM · Toolforge (Toolforge iteration 23)
Raymond_Ndibe added a comment to T402923: [builds-service] builds not working due to access issues in tools.

Mentioned in SAL (#wikimedia-cloud) [2025-08-26T13:42:06Z] <dcaro> extended object storage quota to 100G (T402923)

Aug 26 2025, 1:43 PM · Toolforge (Toolforge iteration 23)
Raymond_Ndibe added a comment to T402923: [builds-service] builds not working due to access issues in tools.

yeaaaa, I think I see where the problem is coming from.

raymond-ndibe@cloudcontrol1006:~$ sudo radosgw-admin user info --uid tools\$tools
{
    "user_id": "tools$tools",
...
    "bucket_quota": {
        "enabled": false,
        "check_on_raw": false,
        "max_size": -1,
        "max_size_kb": 0,
        "max_objects": -1
    },
    "user_quota": {
        "enabled": true,
        "check_on_raw": false,
        "max_size": 53687091200,
        "max_size_kb": 52428800,
        "max_objects": 51107
    },
...
}

max_size_kb is 52428800 and that's equivalent to 50GB. the storage on horizon is 49.9GB. I just manually ran garbage collection on harbor. Let me see if I can build something rn

Aug 26 2025, 1:42 PM · Toolforge (Toolforge iteration 23)
Raymond_Ndibe added a comment to T402923: [builds-service] builds not working due to access issues in tools.

might be worth it to check the storage quota of harborstorage s3 bucket. The fact that is it was working intially but stopped suddenly makes me thing it's something to do with storage quota. Let me check

Aug 26 2025, 1:31 PM · Toolforge (Toolforge iteration 23)
Raymond_Ndibe added a comment to T402923: [builds-service] builds not working due to access issues in tools.

looking at this

Aug 26 2025, 1:18 PM · Toolforge (Toolforge iteration 23)

Aug 24 2025

Raymond_Ndibe closed T401957: Request creation of eseap VPS project as Resolved.
Aug 24 2025, 4:52 AM · Cloud-VPS (Project-requests)
Raymond_Ndibe added a comment to T401957: Request creation of eseap VPS project.

sudo cookbook wmcs.vps.create_project --user robertsky --user chlod --cluster-name eqiad1 --project eseap --task-id T401957 --description "To host eseap.org website and other related digital assets (for now a phorge task tracker) for ESEAP Hub"
...
raymond-ndibe@cloudcontrol1006:~$ sudo wmcs-openstack quota show eseap

+-----------------------+-------+
| Resource              | Limit |
+-----------------------+-------+
| cores                 |     8 |
| instances             |     8 |
| ram                   | 16384 |
| fixed_ips             |  None |
| networks              |   100 |
| volumes               |     8 |
| snapshots             |     4 |
| gigabytes             |    80 |
| backups               |    10 |
| volumes_high-iops     |    -1 |
| gigabytes_high-iops   |    -1 |
| snapshots_high-iops   |    -1 |
| volumes___DEFAULT__   |    -1 |
| gigabytes___DEFAULT__ |    -1 |
| snapshots___DEFAULT__ |    -1 |
| volumes_standard      |    -1 |
| gigabytes_standard    |    -1 |
| snapshots_standard    |    -1 |
| groups                |     4 |
| ports                 |   500 |
| rbac_policies         |    10 |
| routers               |    10 |
| subnets               |   100 |
| subnet_pools          |    -1 |
| injected-file-size    | 10240 |
| injected-path-size    |   255 |
| injected-files        |     5 |
| key-pairs             |   100 |
| properties            |   128 |
| server-group-members  |    10 |
| server-groups         |    10 |
| floating-ips          |     0 |
| secgroup-rules        |   100 |
| secgroups             |    40 |
| backup-gigabytes      |  1000 |
| per-volume-gigabytes  |    -1 |
+-----------------------+-------+

@Robertsky @Chlod, default quotas were used because there was no quota detail in the request. If you need some of these values changed, you need to create new requests

Aug 24 2025, 4:52 AM · Cloud-VPS (Project-requests)
Raymond_Ndibe claimed T401957: Request creation of eseap VPS project.
Aug 24 2025, 4:21 AM · Cloud-VPS (Project-requests)

Aug 20 2025

Raymond_Ndibe added a comment to T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage.

What am I missing? why is this a bad approach? the upside is that all the problems of managing multiple auth tokens goes away. We just do things the same way we currently do it in toolforge.

Note that s3 is a protocol, not a service, it defines a certain set of methods and flows to manage files in an object storage service.

So if I understand correctly, you are proposing implementing our own file management protocol (different than s3, probably some subset), and implement that on the storage-api, that will be hosting the user objects in a single bucket on openstack?

If so, there's some drawbacks:

  • No libraries to interact with it, any existing software that has s3 integration would have to we rewritten
  • No tooling to interact with it, this includes s3cmd, s3fs, k8s volume integration
  • Re-implementing a subset of what s3 defines, but with fewer engs and no upstream/community behind it
  • Vendor lock-in for users (custom storage code in your tool that's not easily portable to any other platform)
  • Re-implementing quotas and quota management on our side (as everything is now on the same quota under the openstack project hosting the bucket)
  • Architecturally we will need the extra throughput to go back-and-forth from toolforge APIs when moving files (potentially big files)

Note that I'm not saying that it's good or bad, just raising drawbacks that you'd have to deal with, so they are accounting when doing the tradeoff analysis of the options.

Some of those can be alleviated with different decisions/designs:

  • if we just 'proxy' s3 requests through validating/forcing the bucket used, then users can still use s3 tooling and libs
  • using one bucket per tool allows for easier management (delete a tool then delete it's buckets), potential quotas (would have to look) and such
Aug 20 2025, 3:29 PM · Toolforge, Patch-For-Review, cloud-services-team
Raymond_Ndibe added a comment to T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage.

For some reason we don't seem to be discussing the possibility of making one toolforge object store and having a toolforge-storage to group and manage objects belonging to each tool. This seems more consistent with what a platform as a service is, less_flexibility+auto_management. If a tool needs access to it's own s3 bucket complete with keys and everything, aren't they better creating an openstack project, etc?

our users don't need to know anything about buckets or tokens or whatever. They just need to know they can store objects and retrieve them safely preferably via toolforge alone. Any other thing outside of toolforge seems out-of-scope for what toolforge is about.

How do they access the objects if it's not using the s3 protocol?

How do you authenticate that access without some sort of token/password?

Said that, I agree that ideally the underlying bucket creation and management could be hidden behind a storage-api service, so the user does not need have to have full access to all the bucket creation/deletion/etc, just secure access to that bucket once the storage-api creates it, deletes it and such. So in that sense, the if there's a way to secure buckets individually, the storage-api can have it's own authentication to openstack to manage them. So far though it seems that our current setup does not allow for that fine-grained authentication (user<->bucket), as ec2 credentials give access to all the buckets of the project. Maybe we can investigate if we can add that auth directly on ceph side instead of openstack?

We could also put the direct access to the objects through the storage-api too, and use whichever authentication we have to toolforge APIs as the gatekeeper, though that will make any data fetching/putting pass through the toolforge API before being pushed to openstack/ceph, with the extra traffic and roundabout, but would allow us more grained control of that access.

Aug 20 2025, 10:23 AM · Toolforge, Patch-For-Review, cloud-services-team
Raymond_Ndibe added a comment to T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage.

For some reason we don't seem to be discussing the possibility of making one toolforge object store and having a toolforge-storage to group and manage objects belonging to each tool. This seems more consistent with what a platform as a service is, less_flexibility+auto_management. If a tool needs access to it's own s3 bucket complete with keys and everything, aren't they better creating an openstack project, etc?

Aug 20 2025, 9:57 AM · Toolforge, Patch-For-Review, cloud-services-team
Raymond_Ndibe added a comment to T358496: [toolforge,storage] Provide per-tool access to cloud-vps object storage.

For some reason we don't seem to be discussing the possibility of making one toolforge object store and having a toolforge-storage to group and manage objects belonging to each tool. This seems more consistent with what a platform as a service is, less_flexibility+auto_management. If a tool needs access to it's own s3 bucket complete with keys and everything, aren't they better creating an openstack project, etc?

Aug 20 2025, 9:52 AM · Toolforge, Patch-For-Review, cloud-services-team
Raymond_Ndibe closed T397933: Disable tools.maintain-harbor as Resolved.
Aug 20 2025, 7:49 AM · Toolforge (Toolforge iteration 23), cloud-services-team
Raymond_Ndibe edited projects for T397933: Disable tools.maintain-harbor, added: Toolforge (Toolforge iteration 23); removed Toolforge.
Aug 20 2025, 7:48 AM · Toolforge (Toolforge iteration 23), cloud-services-team
Raymond_Ndibe added a comment to T401172: [jobs-api] make job status an enum, with clearly defined states.

I believe status messages should be as uniform as possible. If we need to covey extra information it's better to put that in some form of status detail thing.

Aug 20 2025, 7:40 AM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), Patch-For-Review, User-Raymond_Ndibe

Aug 19 2025

Raymond_Ndibe updated the task description for T350687: [harbor] Move harbor data to object storage service.
Aug 19 2025, 5:00 PM · Toolforge (Toolforge iteration 24), cloud-services-team (FY2025/26-Q1-Q2), User-Raymond_Ndibe, Goal
Raymond_Ndibe moved T350687: [harbor] Move harbor data to object storage service from In Progress to In Review on the Toolforge (Toolforge iteration 23) board.
Aug 19 2025, 3:29 PM · Toolforge (Toolforge iteration 24), cloud-services-team (FY2025/26-Q1-Q2), User-Raymond_Ndibe, Goal
Raymond_Ndibe updated the task description for T350687: [harbor] Move harbor data to object storage service.
Aug 19 2025, 3:13 PM · Toolforge (Toolforge iteration 24), cloud-services-team (FY2025/26-Q1-Q2), User-Raymond_Ndibe, Goal
Raymond_Ndibe updated the task description for T350687: [harbor] Move harbor data to object storage service.
Aug 19 2025, 1:46 PM · Toolforge (Toolforge iteration 24), cloud-services-team (FY2025/26-Q1-Q2), User-Raymond_Ndibe, Goal