⚓ T280881 New Service Request Toolhub

Subject	Repo	Branch	Lines +/-
Add toolhub to cache backends	operations/puppet	production	+5 -0
Add Toolhub public DNS name	operations/dns	master	+2 -0
Add toolhub to discovery	operations/dns	master	+2 -0
service: Switch toolhub to production	operations/puppet	production	+1 -1
service: Switch toolhub to monitoring_setup	operations/puppet	production	+1 -1
service: Switch toolhub to lvs_setup	operations/puppet	production	+1 -1
Add toolhub to LVS	operations/puppet	production	+40 -0
toolhub: Add helmfile.d (second attempt)	operations/deployment-charts	master	+190 -0
toolhub: Add helmfile.d	operations/deployment-charts	master	+168 -0
toolhub: Add mcrouter sidecar for memcached access	operations/deployment-charts	master	+351 -81
toolhub: Set pod requests/limits	operations/deployment-charts	master	+9 -9
Add toolhub.svc.{codfw,eqiad}.wmnet	operations/dns	master	+4 -0
admin_ng: Add toolhub namespace	operations/deployment-charts	master	+1 -0
Add tokens and users for toolhub service	operations/puppet	production	+10 -0
Add k8s users, tokens for toolhub service	labs/private	master	+12 -0

Status	Subtype	Assigned	Task
Open		None	T288685 Establish active/active multi-dc support for Toolhub
Resolved		bd808	T115650 Create an authoritative and well promoted catalog of Wikimedia tools
Resolved		bd808	T271483 Complete and announce initial production deployment of Toolhub
Duplicate		None	T271481 Discuss deployment needs with the Service Operations team
Resolved		Legoktm	T280881 New Service Request Toolhub
Resolved	BUG REPORT	bd808	T290283 `helmfile -e staging -i apply` fails for Toolhub due to missing ConfigMap

bd808 created this task.Apr 21 2021, 10:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 21 2021, 10:31 PM

bd808 added a parent task: T271483: Complete and announce initial production deployment of Toolhub.Apr 21 2021, 10:32 PM

bd808 moved this task from Backlog to Radar on the Toolhub board.

akosiaris triaged this task as Medium priority.Apr 23 2021, 12:19 PM

akosiaris moved this task from Inbox to Externally blocked on the Service-deployment-requests board.

bd808 updated the task description. (Show Details)Jun 25 2021, 11:54 PM

Krinkle awarded a token.Jun 26 2021, 1:17 AM

bd808 added a parent task: T271481: Discuss deployment needs with the Service Operations team.Jul 15 2021, 10:56 PM

Harej awarded a token.Jul 17 2021, 9:57 PM

Legoktm moved this task from Externally blocked to In progress on the Service-deployment-requests board.Jul 21 2021, 8:40 PM

Legoktm updated the task description. (Show Details)Jul 21 2021, 8:48 PM

Legoktm updated the task description. (Show Details)

I added a checklist, assuming that containers already exist for everything, I think the next step would be to start creating a helm chart and test it locally with minikube. I don't think we yet have a scaffold for Python but maybe you can copy linkrecommendation's chart, which I believe is the only other Python k8s service.

Other questions:

What metrics will be exposed by Toolhub? Will it need additional prometheus exporters as sidecars? And do we already have images for all of those?
What format are logs in?
Connecting to MariaDB/elasticsearch should be fine, it just needs to be set in the egress.
For memcached and redis, do those need to use the shared cluster? Could they instead use containers in the pod? I'm not actually sure what we as serviceops prefer, T277711 is where we're at for MW. It's something I can bring up in next week's team meeting.
The HTTP crawler sounds like it might need to use the webproxy.
Will this work active/active? or active/passive?

In T280881#7228714, @Legoktm wrote:

What metrics will be exposed by Toolhub? Will it need additional prometheus exporters as sidecars? And do we already have images for all of those?

T276370: Add django-prometheus monitoring support adds a /metrics endpoint to the app itself to expose some metrics. I think that can be scraped directly. I'd love to hear about other things that I should expose somehow.

What format are logs in?

Whatever format we configure for the Python logger. T276374: Figure out what production logging config needs to look like got the mostly unhelpful "make it work with ECS" response. I have not been able to find anything that looks to be a working ECS config for a Django app. That logging "standard" is more of a hope and pray that folks will do all the work to make everything conform to what is easy for Elasticearch to handle and I have honestly not been able to find projects using it natively. Most things I have found are really about using some ingester to reformat that data on its way into ES.

Connecting to MariaDB/elasticsearch should be fine, it just needs to be set in the egress.

For memcached and redis, do those need to use the shared cluster? Could they instead use containers in the pod? I'm not actually sure what we as serviceops prefer, T277711 is where we're at for MW. It's something I can bring up in next week's team meeting.

No, they do not need to be in the shared cluster. At the moment only memcached is needed (redis work queue is not coded yet, and will not make the 1.0 release). Memcached will be used as a pass-through cache in front of the db. Redis will eventually be used for a Celery work queue. Both should be "tiny" volume compared to what folks are used to from MediaWiki prod needs.

The HTTP crawler sounds like it might need to use the webproxy.

Yes, this seems very likely and something that ideally I can find a testing environment to work out how to do.

Will this work active/active? or active/passive?

I have been assuming active/passive. See T271480: Setup production database for Toolhub for some discussion with the DBA folks on the same topic. As I understand it, the misc db clusters are not currently designed for any multi-master replication which would be needed for active/active.

• wkandek added a project: serviceops.Jul 26 2021, 3:35 PM

bd808 updated the task description. (Show Details)Aug 4 2021, 11:15 PM

I have edited the description to remove celery and redis from the initial deployment requirements. There would only be one celery job to run with the current functionality, and it feels like it will be easier to implement it as a Kubernetes CronJob object (T276405: Setup periodic crawling).

bd808 mentioned this in T273020: Security Readiness Review For Toolhub.Aug 10 2021, 6:11 PM

bd808 merged a task: T271481: Discuss deployment needs with the Service Operations team.Aug 11 2021, 5:21 PM

Change 711624 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] Add tokens and users for toolhub service

https://gerrit.wikimedia.org/r/711624

gerritbot added a project: Patch-For-Review.Aug 11 2021, 5:44 PM

Change 711625 had a related patch set uploaded (by Legoktm; author: Legoktm):

[labs/private@master] Add k8s users, tokens for toolhub service

https://gerrit.wikimedia.org/r/711625

Change 711625 merged by Legoktm:

[labs/private@master] Add k8s users, tokens for toolhub service

https://gerrit.wikimedia.org/r/711625

Change 711624 merged by Legoktm:

[operations/puppet@production] Add tokens and users for toolhub service

https://gerrit.wikimedia.org/r/711624

Legoktm mentioned this in rLPRI0f12a65e20ff: Add k8s users, tokens for toolhub service.Aug 11 2021, 6:00 PM

Change 711637 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/dns@master] Add Toolhub public DNS name

https://gerrit.wikimedia.org/r/711637

Change 711639 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/deployment-charts@master] admin_ng: Add toolhub namespace

https://gerrit.wikimedia.org/r/711639

Change 711648 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Add toolhub to cache backends

https://gerrit.wikimedia.org/r/711648

Change 711639 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Add toolhub namespace

https://gerrit.wikimedia.org/r/711639

Legoktm updated the task description. (Show Details)Aug 11 2021, 6:39 PM

We're still missing the OAuth2 key/secret, but otherwise I think it should be possible to deploy to the staging/eqiad/codfw clusters now once the helmfile.d part is written.

And then we can set up the LVS stuff (needs pairing with Traffic).

Change 711702 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] Add toolhub to LVS

https://gerrit.wikimedia.org/r/711702

Change 711703 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] service: Switch toolhub to lvs_setup

https://gerrit.wikimedia.org/r/711703

Change 711704 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] service: Switch toolhub to monitoring_setup

https://gerrit.wikimedia.org/r/711704

Change 711705 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] service: Switch toolhub to production

https://gerrit.wikimedia.org/r/711705

Change 711726 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/dns@master] Add toolhub.svc.{codfw,eqiad}.wmnet

https://gerrit.wikimedia.org/r/711726

Change 711727 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/dns@master] Add toolhub to discovery

https://gerrit.wikimedia.org/r/711727

Change 711726 merged by Legoktm:

[operations/dns@master] Add toolhub.svc.{codfw,eqiad}.wmnet

https://gerrit.wikimedia.org/r/711726

bd808 updated the task description. (Show Details)Aug 12 2021, 12:38 AM

Change 714867 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Add helmfile.d

https://gerrit.wikimedia.org/r/714867

Change 715286 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Add mcrouter sidecar for memcached access

https://gerrit.wikimedia.org/r/715286

Change 715604 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Set pod requests/limits

https://gerrit.wikimedia.org/r/715604

bd808 moved this task from Radar to Review on the Toolhub board.Aug 30 2021, 9:59 PM

Change 715604 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Set pod requests/limits

https://gerrit.wikimedia.org/r/715604

Change 715286 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Add mcrouter sidecar for memcached access

https://gerrit.wikimedia.org/r/715286

Change 714867 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Add helmfile.d

https://gerrit.wikimedia.org/r/714867

Change 716521 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Add helmfile.d (second attempt)

https://gerrit.wikimedia.org/r/716521

Change 716521 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Add helmfile.d (second attempt)

https://gerrit.wikimedia.org/r/716521

bd808 closed subtask T290283: `helmfile -e staging -i apply` fails for Toolhub due to missing ConfigMap as Resolved.Sep 3 2021, 4:56 PM

Change 711702 merged by Giuseppe Lavagetto:

[operations/puppet@production] Add toolhub to LVS

https://gerrit.wikimedia.org/r/711702

Change 711703 merged by Giuseppe Lavagetto:

[operations/puppet@production] service: Switch toolhub to lvs_setup

https://gerrit.wikimedia.org/r/711703

Change 711704 merged by Giuseppe Lavagetto:

[operations/puppet@production] service: Switch toolhub to monitoring_setup

https://gerrit.wikimedia.org/r/711704

Change 711705 merged by Giuseppe Lavagetto:

[operations/puppet@production] service: Switch toolhub to production

https://gerrit.wikimedia.org/r/711705

Change 711727 merged by Legoktm:

[operations/dns@master] Add toolhub to discovery

https://gerrit.wikimedia.org/r/711727

Mentioned in SAL (#wikimedia-operations) [2021-09-28T19:04:54Z] <legoktm> adding toolhub to discovery DNS (T280881)

Change 711637 merged by Legoktm:

[operations/dns@master] Add Toolhub public DNS name

https://gerrit.wikimedia.org/r/711637

Change 711648 merged by Legoktm:

[operations/puppet@production] Add toolhub to cache backends

https://gerrit.wikimedia.org/r/711648

Legoktm updated the task description. (Show Details)Sep 28 2021, 9:48 PM

akosiaris moved this task from In progress to Done on the Service-deployment-requests board.Dec 8 2021, 10:13 AM

This has been deployed for some time so I moved it to the Done column, but I see 2 remaining unchecked items in the Checklist section of the task

Monitoring dashboard
Integration and Acceptance tests

@bd808, any news on those?

In T280881#7555760, @akosiaris wrote:

@bd808, any news on those?

It was not clear to me that these were tasks that the service requestor was responsible for attending to, so I guess that is the news here. Is there documentation somewhere about what the service ops team expects for these tasks?

In T280881#7556653, @bd808 wrote:

In T280881#7555760, @akosiaris wrote:

@bd808, any news on those?

It was not clear to me that these were tasks that the service requestor was responsible for attending to, so I guess that is the news here. Is there documentation somewhere about what the service ops team expects for these tasks?

Yes there is (now), at least for the dashboard part. I just added https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Pipeline and edited a few more pages to make this part more prominent and easier to find.

I 've also gone ahead and did the process once more to make sure it's fine, the toolhub dashboard is under https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m

The structure we 'd like to have is already in place, data not so much. Saturation wise, both at the total as well as the per container level, the service is good to go, but we 'll need to expose metrics for traffic, errors and latencies as well. If there are any service specific metrics that the service already emits, we can expose those too.

I 've left the total saturation row expanded for now, but this is mostly cause the other rows don't have anything as they are geared towards showing servicerunner services.

As far as the Integration and Acceptance tests go, that's actually up to the service owner/requestor. The pipeline does has the ability to run both unit tests as well as a form of integration tests. Up to now this is not particularly adopted, but the way it was envisioned is to execute a helm test step after the test deploy the pipeline does in the ci env.

bd808 moved this task from Review to Radar on the Toolhub board.Jan 25 2022, 9:50 PM

bd808 mentioned this in T305899: Improve grafana dashboard for monitoring Toolhub in production.Apr 11 2022, 10:37 PM

bd808 updated the task description. (Show Details)May 31 2022, 5:45 PM

jijiki moved this task from Incoming 🐫 to 🙈🙉🙊Backlog on the serviceops board.Sep 28 2022, 2:23 PM

bd808 closed this task as Resolved.Dec 5 2022, 6:58 PM

bd808 assigned this task to Legoktm.

bd808 updated Other Assignee, added: bd808.

New Service Request Toolhub
Closed, ResolvedPublic
Actions

Description

Checklist

Details

Related Objects
Search...

Event Timeline

New Service Request ToolhubClosed, ResolvedPublicActions