Page MenuHomePhabricator

New Service Request Toolhub
Closed, ResolvedPublic

Description

Description: https://meta.wikimedia.org/wiki/Toolhub
Timeline: 2021-08-12 or before (depends on completion of T273020: Security Readiness Review For Toolhub)
Diagram: <Link to an architectural diagram>
Technologies:

  • Python >=3.7
  • Numerous 3rd party Python libraries (Django, etc)
  • Numerous 3rd party JavaScript libraries (vue.js, etc)
  • Task runner (Celery or similar)

WMF services this new service talks to:

Which services will connect to this service and how:

  • Direct usage by end-users on the internet (https)

Will this service use our event platform? No
Does this service talk to an external service? Yes. Part of the application is an HTTP(S) web crawler which will need to periodically fetch URLs registered by end users to populate/update information in the MySQL database.
Point person: @bd808

See also:

Checklist

Event Timeline

Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptApr 21 2021, 10:31 PM
akosiaris triaged this task as Medium priority.Apr 23 2021, 12:19 PM
akosiaris moved this task from Inbox to Externally blocked on the Service-deployment-requests board.
Legoktm updated the task description. (Show Details)

I added a checklist, assuming that containers already exist for everything, I think the next step would be to start creating a helm chart and test it locally with minikube. I don't think we yet have a scaffold for Python but maybe you can copy linkrecommendation's chart, which I believe is the only other Python k8s service.

Other questions:

  • What metrics will be exposed by Toolhub? Will it need additional prometheus exporters as sidecars? And do we already have images for all of those?
  • What format are logs in?
  • Connecting to MariaDB/elasticsearch should be fine, it just needs to be set in the egress.
  • For memcached and redis, do those need to use the shared cluster? Could they instead use containers in the pod? I'm not actually sure what we as serviceops prefer, T277711 is where we're at for MW. It's something I can bring up in next week's team meeting.
  • The HTTP crawler sounds like it might need to use the webproxy.
  • Will this work active/active? or active/passive?
  • What metrics will be exposed by Toolhub? Will it need additional prometheus exporters as sidecars? And do we already have images for all of those?

T276370: Add django-prometheus monitoring support adds a /metrics endpoint to the app itself to expose some metrics. I think that can be scraped directly. I'd love to hear about other things that I should expose somehow.

  • What format are logs in?

Whatever format we configure for the Python logger. T276374: Figure out what production logging config needs to look like got the mostly unhelpful "make it work with ECS" response. I have not been able to find anything that looks to be a working ECS config for a Django app. That logging "standard" is more of a hope and pray that folks will do all the work to make everything conform to what is easy for Elasticearch to handle and I have honestly not been able to find projects using it natively. Most things I have found are really about using some ingester to reformat that data on its way into ES.

  • Connecting to MariaDB/elasticsearch should be fine, it just needs to be set in the egress.
  • For memcached and redis, do those need to use the shared cluster? Could they instead use containers in the pod? I'm not actually sure what we as serviceops prefer, T277711 is where we're at for MW. It's something I can bring up in next week's team meeting.

No, they do not need to be in the shared cluster. At the moment only memcached is needed (redis work queue is not coded yet, and will not make the 1.0 release). Memcached will be used as a pass-through cache in front of the db. Redis will eventually be used for a Celery work queue. Both should be "tiny" volume compared to what folks are used to from MediaWiki prod needs.

  • The HTTP crawler sounds like it might need to use the webproxy.

Yes, this seems very likely and something that ideally I can find a testing environment to work out how to do.

  • Will this work active/active? or active/passive?

I have been assuming active/passive. See T271480: Setup production database for Toolhub for some discussion with the DBA folks on the same topic. As I understand it, the misc db clusters are not currently designed for any multi-master replication which would be needed for active/active.

I have edited the description to remove celery and redis from the initial deployment requirements. There would only be one celery job to run with the current functionality, and it feels like it will be easier to implement it as a Kubernetes CronJob object (T276405: Setup periodic crawling).

Change 711624 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] Add tokens and users for toolhub service

https://gerrit.wikimedia.org/r/711624

Change 711625 had a related patch set uploaded (by Legoktm; author: Legoktm):

[labs/private@master] Add k8s users, tokens for toolhub service

https://gerrit.wikimedia.org/r/711625

Change 711625 merged by Legoktm:

[labs/private@master] Add k8s users, tokens for toolhub service

https://gerrit.wikimedia.org/r/711625

Change 711624 merged by Legoktm:

[operations/puppet@production] Add tokens and users for toolhub service

https://gerrit.wikimedia.org/r/711624

Change 711637 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/dns@master] Add Toolhub public DNS name

https://gerrit.wikimedia.org/r/711637

Change 711639 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/deployment-charts@master] admin_ng: Add toolhub namespace

https://gerrit.wikimedia.org/r/711639

Change 711648 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] Add toolhub to cache backends

https://gerrit.wikimedia.org/r/711648

Change 711639 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: Add toolhub namespace

https://gerrit.wikimedia.org/r/711639

We're still missing the OAuth2 key/secret, but otherwise I think it should be possible to deploy to the staging/eqiad/codfw clusters now once the helmfile.d part is written.

And then we can set up the LVS stuff (needs pairing with Traffic).

Change 711702 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] Add toolhub to LVS

https://gerrit.wikimedia.org/r/711702

Change 711703 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] service: Switch toolhub to lvs_setup

https://gerrit.wikimedia.org/r/711703

Change 711704 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] service: Switch toolhub to monitoring_setup

https://gerrit.wikimedia.org/r/711704

Change 711705 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/puppet@production] service: Switch toolhub to production

https://gerrit.wikimedia.org/r/711705

Change 711726 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/dns@master] Add toolhub.svc.{codfw,eqiad}.wmnet

https://gerrit.wikimedia.org/r/711726

Change 711727 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/dns@master] Add toolhub to discovery

https://gerrit.wikimedia.org/r/711727

Change 711726 merged by Legoktm:

[operations/dns@master] Add toolhub.svc.{codfw,eqiad}.wmnet

https://gerrit.wikimedia.org/r/711726

Change 714867 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Add helmfile.d

https://gerrit.wikimedia.org/r/714867

Change 715286 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Add mcrouter sidecar for memcached access

https://gerrit.wikimedia.org/r/715286

Change 715604 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Set pod requests/limits

https://gerrit.wikimedia.org/r/715604

Change 715604 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Set pod requests/limits

https://gerrit.wikimedia.org/r/715604

Change 715286 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Add mcrouter sidecar for memcached access

https://gerrit.wikimedia.org/r/715286

Change 714867 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Add helmfile.d

https://gerrit.wikimedia.org/r/714867

Change 716521 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/deployment-charts@master] toolhub: Add helmfile.d (second attempt)

https://gerrit.wikimedia.org/r/716521

Change 716521 merged by jenkins-bot:

[operations/deployment-charts@master] toolhub: Add helmfile.d (second attempt)

https://gerrit.wikimedia.org/r/716521

Change 711702 merged by Giuseppe Lavagetto:

[operations/puppet@production] Add toolhub to LVS

https://gerrit.wikimedia.org/r/711702

Change 711703 merged by Giuseppe Lavagetto:

[operations/puppet@production] service: Switch toolhub to lvs_setup

https://gerrit.wikimedia.org/r/711703

Change 711704 merged by Giuseppe Lavagetto:

[operations/puppet@production] service: Switch toolhub to monitoring_setup

https://gerrit.wikimedia.org/r/711704

Change 711705 merged by Giuseppe Lavagetto:

[operations/puppet@production] service: Switch toolhub to production

https://gerrit.wikimedia.org/r/711705

Change 711727 merged by Legoktm:

[operations/dns@master] Add toolhub to discovery

https://gerrit.wikimedia.org/r/711727

Mentioned in SAL (#wikimedia-operations) [2021-09-28T19:04:54Z] <legoktm> adding toolhub to discovery DNS (T280881)

Change 711637 merged by Legoktm:

[operations/dns@master] Add Toolhub public DNS name

https://gerrit.wikimedia.org/r/711637

Change 711648 merged by Legoktm:

[operations/puppet@production] Add toolhub to cache backends

https://gerrit.wikimedia.org/r/711648

This has been deployed for some time so I moved it to the Done column, but I see 2 remaining unchecked items in the Checklist section of the task

  • Monitoring dashboard
  • Integration and Acceptance tests

@bd808, any news on those?

@bd808, any news on those?

It was not clear to me that these were tasks that the service requestor was responsible for attending to, so I guess that is the news here. Is there documentation somewhere about what the service ops team expects for these tasks?

@bd808, any news on those?

It was not clear to me that these were tasks that the service requestor was responsible for attending to, so I guess that is the news here. Is there documentation somewhere about what the service ops team expects for these tasks?

Yes there is (now), at least for the dashboard part. I just added https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Pipeline and edited a few more pages to make this part more prominent and easier to find.

I 've also gone ahead and did the process once more to make sure it's fine, the toolhub dashboard is under https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m

The structure we 'd like to have is already in place, data not so much. Saturation wise, both at the total as well as the per container level, the service is good to go, but we 'll need to expose metrics for traffic, errors and latencies as well. If there are any service specific metrics that the service already emits, we can expose those too.

I 've left the total saturation row expanded for now, but this is mostly cause the other rows don't have anything as they are geared towards showing servicerunner services.

As far as the Integration and Acceptance tests go, that's actually up to the service owner/requestor. The pipeline does has the ability to run both unit tests as well as a form of integration tests. Up to now this is not particularly adopted, but the way it was envisioned is to execute a helm test step after the test deploy the pipeline does in the ci env.

bd808 assigned this task to Legoktm.
bd808 updated Other Assignee, added: bd808.