Page MenuHomePhabricator

Setup production database for Toolhub
Closed, ResolvedPublic

Description

Toolhub will need a MySQL/MariaDB database in production. Talk to the DBA team well in advance of the planned August 12, 2021 launch to figure out where that database will live and any infra work that may be needed to ensure direct access from the k8s cluster.

Event Timeline

bd808 added a project: DBA.
bd808 moved this task from Research needed to In Progress on the Toolhub board.

Adding the DBA tag here as I think I'm ready to talk with the DBA team about deployment needs.

The django backend has very similar storage needs compared to Striker. Specifically it needs a MySQL/MariaDB database with support for utf8mb4 encoding and innodb_large_prefix for indexing >767 bytes of varchar fields. The application has been developed using the official mariadb 10.4 docker container which did not need any settings customization. The application may need some adjustments to deploy on an older version of MariaDB (basically adding ROW_FORMAT=DYNAMIC to all CREATE TABLE statements).

The current storage size of tables in the development and demo environments is tiny. The production instance will have larger storage needs, especially for growth of various tables tracking historic content revisions and audit logging. I do not have any really good estimates for the rate of growth per unit time yet, but I feel pretty safe in saying that it would take an amazing amount of activity on the application for it to surpass 1GiB of storage in its first year of operation.

The application is intended to be deployed on the production Kubernetes cluster. This may or may not make it the first service there wanting direct access to a MariaDB server.

Here's some other info dumped from my development instance:

$ make db-shell
docker-compose exec db bash
root@0d20ad68bbc7:/# mysql -u root -p
Enter password:
MariaDB [(none)]> use toolhub;
MariaDB [toolhub]> show tables;
+-----------------------------------+
| Tables_in_toolhub                 |
+-----------------------------------+
| auditlog_logentry                 |
| auth_group                        |
| auth_group_permissions            |
| auth_permission                   |
| crawler_run                       |
| crawler_runurl                    |
| crawler_runurl_tools              |
| crawler_url                       |
| django_admin_log                  |
| django_content_type               |
| django_migrations                 |
| django_session                    |
| lists_toollist                    |
| lists_toollistitem                |
| oauth2_provider_accesstoken       |
| oauth2_provider_application       |
| oauth2_provider_grant             |
| oauth2_provider_refreshtoken      |
| reversion_revision                |
| reversion_version                 |
| social_auth_association           |
| social_auth_code                  |
| social_auth_nonce                 |
| social_auth_partial               |
| social_auth_usersocialauth        |
| toolinfo_tool                     |
| user_toolhubuser                  |
| user_toolhubuser_groups           |
| user_toolhubuser_user_permissions |
| versioned_revisionmetadata        |
+-----------------------------------+
30 rows in set (0.001 sec)
root@0d20ad68bbc7:~# du -sh /var/lib/mysql/toolhub
15M     /var/lib/mysql/toolhub
LSobanski subscribed.

Hi @bd808. Here's a set of our standard questions

  • QPS:
  • Size: Estimating ~1GB in the first year of operation
  • DB Name:
  • User:
  • Accessed from server (s): k8s
  • Backup Policy: Needed? Frequency?
  • Grants needed:

And a few more from me:

  • "well in advance of the planned June 2021 launch" - Can we assume this is no longer true? If so, what's the current timeline for the launch?
  • Will you be deploying the application in one or both DCs?
  • QPS:

Honestly no idea. This is a green field project that is separate and distinct from the wikis which makes guessing about adoption and traffic patterns pretty difficult. "Tiny" is a non-answer, but honestly I would be thrilled to see it hit even 1 QPS sustained.

  • Size: Estimating ~1GB in the first year of operation

This is my completely wild outside guess at the moment. I honestly expect the db to be much, much smaller than this.

  • DB Name:

toolhub

  • User:
  • toolhub_admin: user with full control of the schema. Access from mwmaint hosts.
  • toolhub: user with CRUD rights on all tables in the schema. Access from the kubernetes cluster.
  • Accessed from server (s): k8s
  • Backup Policy: Needed? Frequency?

Yes, please to back ups. Whatever the "normal" frequency and retention are would probably be fine (daily incremental? weekly full? 4 fulls retained?).

  • Grants needed:

See the "user" section above.

And a few more from me:

  • "well in advance of the planned June 2021 launch" - Can we assume this is no longer true? If so, what's the current timeline for the launch?

Heh. Old task descriptions are old. We are currently hoping for a deploy on or before August 12, 2021 to be in production for Wikimania. This is pending the completion of T273020: Security Readiness Review For Toolhub and other pre-deployment approvals.

  • Will you be deploying the application in one or both DCs?

This is a great question that I do not know the answer to definitively, but multi-dc support seems desirable. I would expect an initial multi-dc deployment to be active-passive with edge routing determining which DC is active.

  • Will you be deploying the application in one or both DCs?

This is a great question that I do not know the answer to definitively, but multi-dc support seems desirable. I would expect an initial multi-dc deployment to be active-passive with edge routing determining which DC is active.

multi-dc as being read from both DCs or even written from both DCs?
This database is likely to go to a misc cluster, which isn't ready at all to be written from both DCs, they could be read from both DCs if needed (and using SSL of course) as we do have the CNames and proxies in place.

multi-dc as being read from both DCs or even written from both DCs?
This database is likely to go to a misc cluster, which isn't ready at all to be written from both DCs, they could be read from both DCs if needed (and using SSL of course) as we do have the CNames and proxies in place.

I was mostly thinking of a classic "warm spare" in the sense of the application code being deployed in both and the data replicated to both but only one or the other DC receiving traffic at any given time with switchover requiring both a change in the db replication to flow the other direction and also switching the edge route for the public service name (toolhub.wikimedia.org).

Multi-master active/active would be ideal, but that is not in any way needed at this time. The main thing I'm hoping to avoid is forgetting to think about DC switchover until the next one is about to happen. :)

multi-dc as being read from both DCs or even written from both DCs?
This database is likely to go to a misc cluster, which isn't ready at all to be written from both DCs, they could be read from both DCs if needed (and using SSL of course) as we do have the CNames and proxies in place.

I was mostly thinking of a classic "warm spare" in the sense of the application code being deployed in both and the data replicated to both but only one or the other DC receiving traffic at any given time with switchover requiring both a change in the db replication to flow the other direction and also switching the edge route for the public service name (toolhub.wikimedia.org).

Yeah, that's no problem. We have infra replicating on both DCs for misc clusters now too (like we do with MW). For what is worth, we've never switched over misc services over to codfw (we've always done MW only). At some point we'll include them too, but for now that hasn't happened.

Multi-master active/active would be ideal, but that is not in any way needed at this time. The main thing I'm hoping to avoid is forgetting to think about DC switchover until the next one is about to happen. :)

:-)

Thanks!

  • toolhub: user with CRUD rights on all tables in the schema. Access from the kubernetes cluster.

Do you have some IPs or a range for the grants?

@bd808 one more question I thought I asked before, but I didn't (sorry!), what would be the impact if the section goes read-only? (ie: maintenance)

  • toolhub: user with CRUD rights on all tables in the schema. Access from the kubernetes cluster.

Do you have some IPs or a range for the grants?

I do not know what the IPs are for the production kubernetes cluster, but I'd imagine that folks in serviceops would be able to answer.

@bd808 one more question I thought I asked before, but I didn't (sorry!), what would be the impact if the section goes read-only? (ie: maintenance)

That is a good question. Authentication should fail because sessions are stored in the database with the current config. API edits would also fail. I'm not sure if read-based anon traffic would be impacted or not. I'll try to make some time today to do some experiments in my dev environment to see what actually happens.

  • toolhub: user with CRUD rights on all tables in the schema. Access from the kubernetes cluster.

Do you have some IPs or a range for the grants?

I do not know what the IPs are for the production kubernetes cluster, but I'd imagine that folks in serviceops would be able to answer.

Tagging serviceops as I believe the above wouldn't notify them. Could you provide a range of IPs we should grant? we normally do 10.64.% and 10.192.% so a similar sort of range for the k8s pods would be good enough.

@bd808 I think I am going to put this database on m5, given that wikitech will be out in a month (fingers crossed), that misc section has enough space for it, is that cool with you?
Remember that as of today m5 doesn't use the proxy, but as soon as wikitech is moved, I am going to restore things back and place the proxies in between - that should be transparent to the application, as it should keep connecting to m5-master.eqiad.wmnet anyways, but just fyi.

Thanks - 10.64.% and 10.192.% should work then

Recap - @bd808 please let me know if this looks good:

cluster: m5
db name: toolhub
entry point: m5-master.eqiad.wmnet
db users:

  • toolhub_admin Grants: ALTER ROUTINE, CREATE, CREATE ROUTINE, CREATE TEMPORARY TABLES, CREATE VIEW, DELETE, DROP, EVENT, EXECUTE, INDEX, INSERT, LOCK TABLES, REFERENCES, SELECT, SHOW VIEW, TRIGGER, UPDATE
  • toolhub Grants: SELECT, INSERT, UPDATE, DELETE

IPs to grant from: 10.64.%, 10.192.%
Backups: yes, normal frequency

Please note that I have not granted ALTER as that is potentially something that can generate issues on big tables, let me know if you think you really need it - I am fine granting it but I would suggest to talk to us before running it.

Recap - @bd808 please let me know if this looks good:

cluster: m5
db name: toolhub
entry point: m5-master.eqiad.wmnet
db users:

  • toolhub_admin Grants: ALTER ROUTINE, CREATE, CREATE ROUTINE, CREATE TEMPORARY TABLES, CREATE VIEW, DELETE, DROP, EVENT, EXECUTE, INDEX, INSERT, LOCK TABLES, REFERENCES, SELECT, SHOW VIEW, TRIGGER, UPDATE
  • toolhub Grants: SELECT, INSERT, UPDATE, DELETE

IPs to grant from: 10.64.%, 10.192.%
Backups: yes, normal frequency

Placement on m5, the user split, and the allowed IP range looks good to me.

Please note that I have not granted ALTER as that is potentially something that can generate issues on big tables, let me know if you think you really need it - I am fine granting it but I would suggest to talk to us before running it.

ALTER will be needed in the future to update the schema for sure. Eventually I do hope this app has enough data added to it that we do have to think about the impact of alters on performance, but in year 0 I will be pretty surprised if we grow to have any table with more than a 200K-300K rows. We can figure out how to have all alter containing migrations applied by a DBA, but it feels like a premature optimization at this point.

/me realizes that he has not made a plan yet for applying schema changes without using Django's manage.py migrate and will need to think about that soon.

Change 709877 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] production-m5.sql.erb: Add toolhub grants

https://gerrit.wikimedia.org/r/709877

@bd808 please review this patch when you get a chance: https://gerrit.wikimedia.org/r/709877

For now I have granted the Pods IP directly for the CRUD user and mwmaint (eqiad and codfw) for the admin one. Keep in mind that right now we don't use the proxies, but I want to revert that once wikitech is moved, so I have also included the grants for the proxies. As long as you set up the application to connect to m5-master.eqiad.wmnet the operation of enabling the proxies should be transparent to you (famous last words!)

bd808 renamed this task from Discuss database needs with the DBA team to Setup production database for Toolhub.Aug 4 2021, 7:29 PM
bd808 reassigned this task from bd808 to Marostegui.
bd808 removed a project: User-bd808.
bd808 moved this task from In Progress to Review on the Toolhub board.

@bd808 one more question I thought I asked before, but I didn't (sorry!), what would be the impact if the section goes read-only? (ie: maintenance)

That is a good question. Authentication should fail because sessions are stored in the database with the current config. API edits would also fail. I'm not sure if read-based anon traffic would be impacted or not. I'll try to make some time today to do some experiments in my dev environment to see what actually happens.

  • logins fail (can't write new session to db)
  • logouts fail (can't prune session from db)
  • most PUT/POST/PATCH/DELETE actions in API fail (can't write to db)

Read actions seem to be fine. So as expected there is service degradation when the database is in read_only mode, but the site does continue to provide some value.

Thanks for the investigation!
The database is now created, as soon as we sort the GRANTs code review, we should be ready to go.

Change 709877 merged by Marostegui:

[operations/puppet@production] production-m5.sql.erb: Add toolhub grants

https://gerrit.wikimedia.org/r/709877

Change 710416 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] production-m5.sql: Add dbproxy20040's IP

https://gerrit.wikimedia.org/r/710416

Change 710416 merged by Marostegui:

[operations/puppet@production] production-m5.sql: Add dbproxy20040's IP

https://gerrit.wikimedia.org/r/710416

Change 710419 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dumps-eqiad-m5.sql: Grants to backup toolhub database.

https://gerrit.wikimedia.org/r/710419

@bd808 the database is in place and so are the grants.
Please note that I have not been able to test the k8s ones, as I don't know which one you'd be using, but I have tested mwmaint admin user one:

root@mwmaint1002:~# mysql --skip-ssl -hm5-master.eqiad.wmnet -u toolhub_admin -p toolhub
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 55499035
Server version: 10.4.18-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

toolhub_admin@m5-master.eqiad.wmnet(toolhub)>

Please note that you should avoid doing cross dc queries (codfw mwmaint -> m5-master.eqiad.wmnet) especially if they are not using SSL. Ideally you should use mwmaint2002 -> codfw master, but codfw master is on read only mode.
Remember to always use "m5-master.eqiad.wmnet" as an entry point rather than db1128.eqiad.wmnet (current m5-master).

@bd808 do not forget to puppetize the passwords for toolhub and toolhub_admin, I have left them at:

root@mwmaint1002:/home/bd808# ls -lh /home/bd808/toolhub
-rw-rw-r-- 1 bd808 wikidev 58 Aug  6 05:04 /home/bd808/toolhub

Pending: merge and add backup user grants after getting Jaime's review for: https://gerrit.wikimedia.org/r/c/operations/puppet/+/710419/

Change 710419 merged by Jcrespo:

[operations/puppet@production] dumps-*-m5.sql: Grants to backup toolhub database.

https://gerrit.wikimedia.org/r/710419

jcrespo subscribed.

@bd808 I've deployed the changes to start backing up toolhub as part of m5 backups. Please ping me back the following tuesday to when you have setup your database to checks backups are being generated correctly.

@bd808 feel free to close this task whenever you think it is all good from your side with the initial setup

@bd808 the original launch date 12th is still on? Should we expect traffic on this database from tomorrow?

@bd808 the original launch date 12th is still on? Should we expect traffic on this database from tomorrow?

I do not expect a production deployment on the 12th (tomorrow when I write this). There is a small amount of work pending in response to the security review recommendations which came in on the 9th (Monday). The helm chart and all prep in the Kubernetes cluster is still pending as well. With Wikimania starting on the 13th, I do not expect all of these remaining issues to be addressed and a release to happen now before the week of August 23rd at the earliest.

@bd808 feel free to close this task whenever you think it is all good from your side with the initial setup

I have been planning to leave this open until the app is deployed into the Kubernetes staging cluster which will be our first opportunity to end-to-end test the application's database credentials and IP origin restrictions. I do not currently have an estimated date for that work, but I will be working on establishing it.

@bd808 were you able to find any timeline for this? thanks!

@bd808 were you able to find any timeline for this? thanks!

I have a deployment into the staging cluster, but have not yet tried to use that to build out the database schema. I'm hoping for a fix for T290357: Maintenance environment needed for running one-off commands, but if that doesn't move soon I will work around the lack of an ideal maint solution with other magic.

Thanks for the update, if you need something from us let me know!

Thanks for the update, if you need something from us let me know!

Everything worked as expected in the Kubernetes "staging" cluster. When trying my first deploy into the "eqiad" cluster, the Toolhub container in the pod is crashing with django.db.utils.OperationalError: (1045, "Access denied for user 'toolhub'@'10.64.66.115' (using password: YES)"). I have verified that the same password is in the configuration as worked from the staging cluster, so this is about the host restrictions. That IP is not in the grant patch from https://gerrit.wikimedia.org/r/c/operations/puppet/+/709877/3/modules/role/templates/mariadb/grants/production-m5.sql.erb.

Per https://netbox.wikimedia.org/search/?q=kubernetes+pod&obj_type= the eqiad pod range is 10.64.64.0/21, which would require grants from 10.64.64.% through 10.64.71.% if I got the math right.

Change 723329 had a related patch set uploaded (by BryanDavis; author: Bryan Davis):

[operations/puppet@production] production-m5.sql.erb: Update toolhub grants

https://gerrit.wikimedia.org/r/723329

Change 723329 merged by Marostegui:

[operations/puppet@production] production-m5.sql.erb: Update toolhub grants

https://gerrit.wikimedia.org/r/723329

Thanks for the update, if you need something from us let me know!

Everything worked as expected in the Kubernetes "staging" cluster. When trying my first deploy into the "eqiad" cluster, the Toolhub container in the pod is crashing with django.db.utils.OperationalError: (1045, "Access denied for user 'toolhub'@'10.64.66.115' (using password: YES)"). I have verified that the same password is in the configuration as worked from the staging cluster, so this is about the host restrictions. That IP is not in the grant patch from https://gerrit.wikimedia.org/r/c/operations/puppet/+/709877/3/modules/role/templates/mariadb/grants/production-m5.sql.erb.

Deployed the path, applied the grants manually (and cleaned up the one that isn't needed)
Let me know if this fixes the issue.

Let me know if this fixes the issue.

The deployment worked this time, so I think we are good. I will file a new bug if we end up finding other issues. Thanks for your help on all this @Marostegui.

\o/ glad to hear! You are welcome!