Page MenuHomePhabricator

Provide a Redis container for use within a tool's namespace
Closed, ResolvedPublicFeature

Description

For some tools, the shared Toolforge Redis instance is a performance/functionality bottleneck. T318479: Intermittent redis connection timeouts in Toolforge is one documented example of this.

Today it is possible for a tool to use the Build Service and its Apt builder to create a custom image running Redis. This image can then be paired with a manually created Service object to expose a task running the image to other Pods within the tool's namespace. See https://gitlab.wikimedia.org/toolforge-repos/wikibugs2-znc#deploy-on-toolforge for documentation of this pattern for a similar non-public Service. T348758: [jobs-api,jobs-cli] Support services in jobs envisions a future where exposing the service port is a more easily accomplished process with toolforge jobs.

In this task I propose Toolforge providing a 'redis' container image in the same general way that today it provides a 'python3.11' container image. This would make deploying a tool-local Redis service easier by eliminating the need for a tool to make its own container image. When T348758 is eventually implemented it will make running an isolated Redis for task queuing and other latency sensitive applications relatively trivial.

Details

TitleReferenceAuthorSource BranchDest Branch
Setup Redis servicetoolforge-repos/containers-redis!1bd808work/bd808/redismain
Customize query in GitLab

Event Timeline

At this time I would propose that the Redis is configured not to use any persistent storage strategy. The lack of persistent volume claims (PVCs) in the Toolforge Kubernetes cluster would necessitate NFS state storage and this is not likely to meet the performance hopes of the dedicated Redis or the admins managing the NFS service.

My near term ulterior motive for this task is being able to add a tool-local Redis to Wikibugs without needing to do a similar amount of work as I did for https://gitlab.wikimedia.org/toolforge-repos/wikibugs2-znc.

bd808 changed the task status from Open to In Progress.Mar 20 2024, 12:25 AM
bd808 claimed this task.
bd808 triaged this task as Medium priority.

Change 1012797 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/docker-images/toollabs-images@master] Add redis image

https://gerrit.wikimedia.org/r/1012797

bd808 changed the task status from In Progress to Stalled.Mar 21 2024, 3:21 PM

I have put a -2 lock on my https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+/1012797 patch. I think it is worth waiting a few weeks to see if a strong hard fork of Redis shows up following https://redis.com/blog/redis-adopts-dual-source-available-licensing/. To be clear, the Debian Bookworm distribution's Redis v7.0.15 is licensed under the three-clause Berkeley Software Distribution (BSD) which is OSI approved. My concern is introducing new dependencies on Redis now that we know that the upstream will not be continuing to produce OSI approved products in the future.

See also: T360596: Figure out a plan to move forward with regarding Redis License changes

bd808 changed the task status from Stalled to In Progress.May 23 2024, 3:00 PM

I have put a -2 lock on my https://gerrit.wikimedia.org/r/c/operations/docker-images/toollabs-images/+/1012797 patch. I think it is worth waiting a few weeks to see if a strong hard fork of Redis shows up following https://redis.com/blog/redis-adopts-dual-source-available-licensing/.

I have changed my thinking on this topic. Here is what I wrote on the Gerrit patch when I removed my -2 block there:

I have been thinking about my reaction here and have decided that waiting for a clear future Redis replacement is being too conservative.

Toolforge still offers ElasticSearch via that project's last FOSS release. We also still offer (quietly) a Python2 image. I don't want to make an argument that hanging on to old things is a best practice, but it is a reality. Another reality today is that we have users who would benefit from a local Redis for their tool. Let's unblock that happening sooner rather than later. We will be able to deal with a Redis to whatever migration in the future when the migration target is known

Now that we have services in jobs (docs coming soon), @bd808 do you want to continue your work on this or do you want for someone else to finish it up?

Now that we have services in jobs (docs coming soon), @bd808 do you want to continue your work on this or do you want for someone else to finish it up?

If other folks are interested in moving this forward I would be fine with that. I invented other solutions for the problem that I had in T360378#9640026 mostly because of the unfortunate licensing changes upstream with Redis. I do however think there are a number of tools that would benefit from this general solution.

I think things are now at the point of needing a +2 on change 1012797, build and store the new container, and then tell folks how to use this local in-memory Redis in their tools as part of the https://wikitech.wikimedia.org/wiki/Help:Toolforge/Redis_for_Toolforge documentation.

A new discussion thread that has happened on IRC is related to the idea of making this container using the build service (in the style of https://gitlab.wikimedia.org/toolforge-repos/pywikibot-buildservice) rather than adding it to the "legacy" shared container collection.

The pros for that include it being easier to bring in the active users of the container to its longer term maintenance. The cons are not well established yet. There are things that we do not yet know how to do with build service, but these are also things that the proposed container does not use either (build from C source with something like T363033: [builds-builder] Support using custom buildpacks, T363027: [builds-builder] Support adding repositories for Apt buildpack).

I have a container that is ready for early-adopter folks to try out:

A REDIS_PASSWORD envvar is required to start the server and to authenticate to the deployed Redis. I decided to make the service require authentication so that folks could trust the integrity of the stored data. It if not likely obvious to everyone, but Toolforge tools can connect to services exposed by other tools. There is currently no system to firewall by namespace in the Toolforge Kubernetes cluster.

Manual testing might look something like:

me@laptop:~$ ssh dev.toolforge.org
me@toolforge:~$ become $TOOL
tool@toolforge:~$ toolforge envvars create REDIS_PASSWORD
tool@toolforge:~$ toolforge jobs run \
  --image tool-containers/redis:latest \
  --command server \
  --continuous \
  --emails none \
  --port 6379 \
  redis
tool@toolforge:~$ webservice buildservice shell --buildservice-image tool-containers/redis:latest --mount none
I_have_no_name!@shell-1717890061:~$ client
redis:6379> INFO server
# Server
redis_version:6.0.16
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:a3fdef44459b3ad6
redis_mode:standalone
os:Linux 6.1.0-18-cloud-amd64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:11.2.0
process_id:7
run_id:91251ca3d842fe72cc45ac2eb0998cb30e7f2bed
tcp_port:6379
uptime_in_seconds:2204
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:6614288
executable:/workspace/redis-server
config_file:/workspace/.config/redis.conf
io_threads_active:0
redis:6379>

Change #1012797 abandoned by BryanDavis:

[operations/docker-images/toollabs-images@master] Add redis image

Reason:

https://phabricator.wikimedia.org/T360378#9873488

https://gerrit.wikimedia.org/r/1012797

I have tried the approach described at https://wikitech.wikimedia.org/wiki/Tool:Containers#Redis_container and it works! The editgroups tool is functional again thanks to that. High five!

One problem I encountered when using the supplied commands was that I got an error saying that the redis job already exists. I had to prefix the name of the job with my tool name: toolforge jobs run … editgroups-redis. And then refer to it as editgroups-redis when connecting to it from other pods.
So I have taken the liberty to update the docs:
https://wikitech.wikimedia.org/w/index.php?title=Tool:Containers&diff=prev&oldid=2192396

Is that the right approach?

Thanks a ton for solving my problem in any case!

I have tried the approach described at https://wikitech.wikimedia.org/wiki/Tool:Containers#Redis_container and it works! The editgroups tool is functional again thanks to that. High five!

Yay!

One problem I encountered when using the supplied commands was that I got an error saying that the redis job already exists. I had to prefix the name of the job with my tool name: toolforge jobs run … editgroups-redis. And then refer to it as editgroups-redis when connecting to it from other pods.
So I have taken the liberty to update the docs:
https://wikitech.wikimedia.org/w/index.php?title=Tool:Containers&diff=prev&oldid=2192396

Is that the right approach?

The error seems unexpected as job names should be scoped to the tool's namespace. I'm not currently able to think of why this would not be the case. Looking at your tool's current namespace I do see a Service object named redis that seems to have leaked from the jobs framework based on it's metadata:

$ kubectl get service redis -o yaml
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2024-06-09T08:02:36Z"
  labels:
    app.kubernetes.io/component: deployments
    app.kubernetes.io/created-by: editgroups
    app.kubernetes.io/managed-by: toolforge-jobs-framework
    app.kubernetes.io/name: redis
    app.kubernetes.io/version: "2"
    toolforge: tool
  name: redis
  namespace: tool-editgroups
  resourceVersion: "1989030307"
  uid: ef7e6933-c37b-45f9-b5a1-f15af0663e89
spec:
  clusterIP: 10.103.164.216
  clusterIPs:
  - 10.103.164.216
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - port: 6379
    protocol: TCP
    targetPort: 6379
  selector:
    app.kubernetes.io/component: deployments
    app.kubernetes.io/created-by: editgroups
    app.kubernetes.io/managed-by: toolforge-jobs-framework
    app.kubernetes.io/name: redis
    app.kubernetes.io/version: "2"
    toolforge: tool
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

I think that this set of circumstances it worthy of a bug report, especially if you can find a way to reproduce the issue. I did try to reproduce this state by starting a job with a service port exposed and then stopping the job, but in my test the job's Deployment, ReplicaSet, Pod, and Service were all removed from the namespace as expected. I was also able to restart that job again with the "redis" name.

That being said, the workaround you came up with of adding the tool's name as a prefix seems like a reasonable one. We can leave it in the docs until we know more about what went wrong.

Ok, I wonder if it could be linked to the fact that I tried creating the redis job earlier, back when the quota didn't allow for it? I'm not sure.

Ok, I wonder if it could be linked to the fact that I tried creating the redis job earlier, back when the quota didn't allow for it? I'm not sure.

That seems possible. I haven't looked at the code that implemented the --port option, but I could imagine the quota limit triggering a failure of the Deployment that didn't also lead to the Service being cleaned up.

One problem I encountered when using the supplied commands was that I got an error saying that the redis job already exists. I had to prefix the name of the job with my tool name: toolforge jobs run … editgroups-redis. And then refer to it as editgroups-redis when connecting to it from other pods.
So I have taken the liberty to update the docs:
https://wikitech.wikimedia.org/w/index.php?title=Tool:Containers&diff=prev&oldid=2192396

That being said, the workaround you came up with of adding the tool's name as a prefix seems like a reasonable one. We can leave it in the docs until we know more about what went wrong.

I went ahead a reverted the documentation change. The $TOOL- prefix is still a reasonable addition for anyone who runs into naming collisions problems like @Pintoch did, but it is not generally expected to be needed and looks to have been caused by a resource leak in the editgroups Kubernetes namespace that I have not been able to recreate in other tools.