Page MenuHomePhabricator

New database request: sockpuppet
Closed, ResolvedPublic

Description

Hello! I would like to start the discussion around creating a new database for a service.

The service is the sockpuppet detection service (also known as the similar-users service). The Platform Engineering team are currently assisting with the migration of this service towards prod. This is a service that will infrequently be used to query information about similar users/activity on-wiki. Currently its data model is stored in ~1.5GB of CSV-like files which makes shipping the service a problem, and SQL storage is ideal for this kind of data. It is expected that this service will be lightly used (and only by people with CheckUser access)

One complicating factor in this process that we need to iron out and would appreciate guidance on is that the data used by the service will be refreshed on a monthly basis. A PySpark model currently generates the CSV files and the application needs to be restarted to reload these files. Ideally the process that creates these files would simply update the database in-place. Is this a feasible model? All data is recreated rather than updated currently by virtue of the use of files and the fact that the data is based on the previous month's activity. Happy to open another ticket for this if the discussion needs more space.

QPS: Not clear at present but likely less than hundreds an hour. 
Size: 2GB 
DB Name: sockpuppet
User: sockpuppet_import and sockpuppet_service 
Accessed from server (s): kubernetes*.<eqiad|codfw>.wmnet, kubestage*.<eqiad|codfw>.wmnet, analytics VLANs hosts to load data (or an intermediary host if the envisioned process asks for that)
Backup Policy: Probably not needed - data for the database is generated from PySpark models and can be regenerated 
Grants needed: sockpuppet_import (INSERT, CREATE and TRUNCATE (DROP)) sockpuppet_service (SELECT)

Event Timeline

hnowlan created this task.Nov 23 2020, 4:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 23 2020, 4:53 PM
hnowlan added a subscriber: Isaac.Nov 23 2020, 4:56 PM
akosiaris updated the task description. (Show Details)Nov 23 2020, 5:31 PM
WDoranWMF updated the task description. (Show Details)Nov 23 2020, 5:41 PM

Apologies if this has already been discussed but from the initial description, there are 3 things that are not entirely clear. I have added the DBAs as they will be the best people to help you regarding pure db questions.

  • Could you describe more in detail about the kind of load this will receive? While it seems to be light on reads (few queries) from the description, it looks that there will be intense, even if infrequent, load processes (e.g. 1.5GB files imports in a single batch). More details on this would be important for replication or import methodology and monitoring (e.g. does it need its own separate instance to avoid overloading writes for other dbs?). How is data filled-in, will there be times where data is not "queryable", will you need 2 rotating dbs? etc.
  • Which realm does this belongs to? Is unclear if this is a sort of direct Mediawiki extension db (e.g. like reading lists or translations) or a miscellaneous db (e.g. something that even MW-related, it is not directly used by the mw application and more like an internal service) (like misc's testreduce) - this, together with SLO and privacy (will it contain private user data?), it will inform which cluster this is for ("core" vs "misc").
  • Re: "Backup Policy: Probably not needed - data for the database is generated from PySpark models and can be regenerated". Could you provide an approximate time for regeneration (minutes, hours, days?). If the size is small enough (and looks like it is) and the regeneration takes a lot of time, it is just easier to backup everything. Alternatively, we can also backup the generated csvs.
  • Could you describe more in detail about the kind of load this will receive? While it seems to be light on reads (few queries) from the description, it looks that there will be intense, even if infrequent, load processes (e.g. 1.5GB files imports in a single batch). More details on this would be important for replication or import methodology and monitoring (e.g. does it need its own separate instance to avoid overloading writes for other dbs?). How is data filled-in, will there be times where data is not "queryable", will you need 2 rotating dbs? etc.

The load process happens once a month - support for this service is currently business hours only so having a throttled import is an option if write volume is of concern. I could envisage taking the service down entirely or putting it into maintenance mode while this import is done based on this level of support so having rotating DBs is not necessary.

  • Which realm does this belongs to? Is unclear if this is a sort of direct Mediawiki extension db (e.g. like reading lists or translations) or a miscellaneous db (e.g. something that even MW-related, it is not directly used by the mw application and more like an internal service) (like misc's testreduce) - this, together with SLO and privacy (will it contain private user data?), it will inform which cluster this is for ("core" vs "misc").

The DB exists somewhat in between worlds I suppose - This database will be used by a service running in kuberentes that offers a HTTP endpoint that Mediawiki will query. So the data is only ultimately used by Mediawiki but the sockpuppet service itself shares no code with Mediawiki and doesn't run on appservers. I'm not sure if this is a sufficient answer in this sense.

  • Re: "Backup Policy: Probably not needed - data for the database is generated from PySpark models and can be regenerated". Could you provide an approximate time for regeneration (minutes, hours, days?). If the size is small enough (and looks like it is) and the regeneration takes a lot of time, it is just easier to backup everything. Alternatively, we can also backup the generated csvs.

Regeneration currently takes two hours or so. If the size is not an issue, having backups would be great. Retention does not need to be extensive as this data becomes stale on a monthly basis.

WDoranWMF updated the task description. (Show Details)Nov 30 2020, 10:59 PM
Marostegui triaged this task as Medium priority.Dec 1 2020, 11:31 AM
Marostegui moved this task from Triage to Refine on the DBA board.

@hnowlan let's throttle the writes if we can and it is not too much of a hassle. This database will live with many more, so let's make sure we don't overload the host.

Regarding the grants do you want to go for the approach of two users? One for the application (from what you said SELECT is enough) and another one for the writes/imports?
How will you load the data? INSERTs? Do you need to truncate or create tables as well?

What would be the impact if the server where this DB goes down?

Also, if possible I would prefer if we go for a more generic username, instead of hnowlan, just in case as this can lead to confusions if for instance, you stop being the owner for this service :)

hnowlan updated the task description. (Show Details)Dec 1 2020, 4:44 PM

@hnowlan let's throttle the writes if we can and it is not too much of a hassle. This database will live with many more, so let's make sure we don't overload the host.

Throttled writes is just fine with me.

Regarding the grants do you want to go for the approach of two users? One for the application (from what you said SELECT is enough) and another one for the writes/imports?
How will you load the data? INSERTs? Do you need to truncate or create tables as well?

A two user approach makes sense to me (sockpuppet-import and sockpuppet-service maybe? I'm not fussy). For the data export process, I believe our approach will be to TRUNCATE and INSERT. In our current state the service itself will only do SELECTs.

What would be the impact if the server where this DB goes down?

The app will throw a 503 and it is expected that Mediawiki will respond appropriately to the user saying that lookups are temporarily unavailable. The service will be called by a sufficiently small group of users and infrequently enough that a temporary unavailability will not be a huge issue.

I misunderstood username, apologies! sockpuppet for username would be great, updated in ticket.

@hnowlan let's throttle the writes if we can and it is not too much of a hassle. This database will live with many more, so let's make sure we don't overload the host.

Throttled writes is just fine with me.

Excellent

Regarding the grants do you want to go for the approach of two users? One for the application (from what you said SELECT is enough) and another one for the writes/imports?
How will you load the data? INSERTs? Do you need to truncate or create tables as well?

A two user approach makes sense to me (sockpuppet-import and sockpuppet-service maybe? I'm not fussy). For the data export process, I believe our approach will be to TRUNCATE and INSERT. In our current state the service itself will only do SELECTs.

Let's go for sockpuppet_import and sockpuppet_service as "-" isn't supported for users. I will update the ticket

What would be the impact if the server where this DB goes down?

The app will throw a 503 and it is expected that Mediawiki will respond appropriately to the user saying that lookups are temporarily unavailable. The service will be called by a sufficiently small group of users and infrequently enough that a temporary unavailability will not be a huge issue.

Thanks for the explanation.

I misunderstood username, apologies! sockpuppet for username would be great, updated in ticket.

Thanks!

Marostegui updated the task description. (Show Details)Dec 2 2020, 8:08 AM

@hnowlan how will you create the tables initially? Do you need CREATE?

Change 644745 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] production-m2.sql.erb: Add sockpuppet users

https://gerrit.wikimedia.org/r/644745

@hnowlan how will you create the tables initially? Do you need CREATE?

Yes please! Just for sockpuppet_import.

Marostegui updated the task description. (Show Details)Dec 2 2020, 1:38 PM

@hnowlan one more clarifying question.
On the initial task you wrote: Probably not needed - data for the database is generated from PySpark models and can be regenerated but at T268505#6641811 you mentioned that having backups would be nice, which one do you prefer in the end?
Thanks!

@hnowlan one more clarifying question.
On the initial task you wrote: Probably not needed - data for the database is generated from PySpark models and can be regenerated but at T268505#6641811 you mentioned that having backups would be nice, which one do you prefer in the end?
Thanks!

If there's no significant impact, backups would be preferred. If retention can be configured, it only needs to be 32 days, if not whatever our default is is fine. Thank you!

Marostegui moved this task from Refine to In progress on the DBA board.

Change 644745 merged by Marostegui:
[operations/puppet@production] production-m2.sql.erb: Add sockpuppet users

https://gerrit.wikimedia.org/r/644745

Mentioned in SAL (#wikimedia-operations) [2020-12-03T06:06:47Z] <marostegui> Create sockpuppet database on m2 T268505

@hnowlan the database and users are created.
To connect you have to do use m2-master.eqiad.wmnet.
The users are:

+-----------------------------------------------------------------------------------------------------------------------------+
| Grants for sockpuppet_service@10.192.16.9                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO `sockpuppet_service`@`10.192.16.9` IDENTIFIED BY PASSWORD '*x' |
| GRANT SELECT ON `sockpuppet`.* TO `sockpuppet_service`@`10.192.16.9`                                                        |
+-----------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.001 sec)

+----------------------------------------------------------------------------------------------------------------------------+
| Grants for sockpuppet_import@10.192.16.9                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO `sockpuppet_import`@`10.192.16.9` IDENTIFIED BY PASSWORD '*x' |
| GRANT INSERT, CREATE, DROP ON `sockpuppet`.* TO `sockpuppet_import`@`10.192.16.9`                                          |
+----------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.001 sec)

+-----------------------------------------------------------------------------------------------------------------------------+
| Grants for sockpuppet_service@10.64.0.135                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO `sockpuppet_service`@`10.64.0.135` IDENTIFIED BY PASSWORD '*x' |
| GRANT SELECT ON `sockpuppet`.* TO `sockpuppet_service`@`10.64.0.135`                                                        |
+-----------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.000 sec)

+----------------------------------------------------------------------------------------------------------------------------+
| Grants for sockpuppet_import@10.64.0.135                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO `sockpuppet_import`@`10.64.0.135` IDENTIFIED BY PASSWORD '*x' |
| GRANT INSERT, CREATE, DROP ON `sockpuppet`.* TO `sockpuppet_import`@`10.64.0.135`                                          |
+----------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.000 sec)

+-----------------------------------------------------------------------------------------------------------------------------+
| Grants for sockpuppet_service@10.64.16.19                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO `sockpuppet_service`@`10.64.16.19` IDENTIFIED BY PASSWORD '*x' |
| GRANT SELECT ON `sockpuppet`.* TO `sockpuppet_service`@`10.64.16.19`                                                        |
+-----------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.001 sec)

+----------------------------------------------------------------------------------------------------------------------------+
| Grants for sockpuppet_import@10.64.16.19                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO `sockpuppet_import`@`10.64.16.19` IDENTIFIED BY PASSWORD '*x' |
| GRANT INSERT, CREATE, DROP ON `sockpuppet`.* TO `sockpuppet_import`@`10.64.16.19`                                          |
+----------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.001 sec)

I have left the passwords for them at: mwmaint1002:/home/hnowlan/sockpuppet.sql. Please make sure to commit them to the private repo once you are ready for it.
Quickly tested the connections grants:

# mysql --ssl-verify-server-cert=false -h m2-master.eqiad.wmnet -u sockpuppet_import -p sockpuppet
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 33310364
Server version: 10.4.13-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

sockpuppet_import@m2-master.eqiad.wmnet[sockpuppet]> show tables;
Empty set (0.001 sec)


root@cumin1001:~# mysql --ssl-verify-server-cert=false -h m2-master.eqiad.wmnet -u sockpuppet_service -p sockpuppet
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 33310413
Server version: 10.4.13-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

sockpuppet_service@m2-master.eqiad.wmnet[sockpuppet]> show tables;
Empty set (0.001 sec)

sockpuppet_service@m2-master.eqiad.wmnet[sockpuppet]>

If you happen to need a firewall hole to connect, you can use this as an example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/643239/1/modules/profile/manifests/mariadb/ferm_misc.pp

@jcrespo could you configure this database to be backed up as part of m2?

Thanks

@jcrespo could you configure this database to be backed up as part of m2?

Backup grants configured- I will check next week all backups run as expected.

@jcrespo @Marostegui I am speechless - it's possible I'm crying a little. Thanks for all your help. As a note @hnowlan is off till Monday but @gmodena is about I'll ask him to review and give any responses needed from our side.

@jcrespo @Marostegui you guys rock! Thanks a lot for helping out with this,. I could not test auth myself, but I'll prepare the create statements for the schemas and will sync with @hnowlan on monday.

Marostegui closed this task as Resolved.Dec 4 2020, 6:53 AM

Going to close this as resolved as the DB and grants are deployed. If you need anything else from DBA please re-open
Thanks

@hnowlan backups ran yesterday on both datacenters, and they returned an empty database. Please reopen/ping me if that is unexpected.

hey @jcrespo & @hnowlan just chiming in to let you know that we have not ingested data yet. Empty backup was expected.

Thanks for checking. Let me know if you want me to recheck when data introduced.

Just fyi we're running our data ingestion at the moment - both just to note that the database will have data now, but also to ensure that our ingestion throttling is working acceptably. Let us know if anything stands out

Thanks for the heads up, I am checking the graphs now - there is obviously an increase on pretty much all the graphs, but nothing to worrying as of now.

jcrespo added a comment.EditedDec 17 2020, 2:17 PM

@hnowlan thanks indeed. Once you have done a first complete pass, it would be nice, if possible, to aim for a regular schedule for importing that works for you, dbas and backups, so we maximize available performance for all operations?

While there is no conflict between backups and imports happening at the same time, if we can choose to do them at separate times (e.g. imports at 4am, backups at 7 am, peak db usage at around 6pm), we will have more available resources for each one (backups and imports can happen faster).

Feel free to open a communication line for questions.

@hnowlan how's the throttle being done at the moment, do you have some details about batch sizes, sleep times etc?

@hnowlan how's the throttle being done at the moment, do you have some details about batch sizes, sleep times etc?

Currently we're doing batches of 10000 entries with a 500ms throttle between them. If needs be we can tune things down easily.

@hnowlan thanks indeed. Once you have done a first complete pass, it would be nice, if possible, to aim for a regular schedule for importing that works for you, dbas and backups, so we maximize available performance for all operations?

While there is no conflict between backups and imports happening at the same time, if we can choose to do them at separate times (e.g. imports at 4am, backups at 7 am, peak db usage at around 6pm), we will have more available resources for each one (backups and imports can happen faster).

Feel free to open a communication line for questions.

Absolutely - happy to make a more concrete plan on this in the new year. Currently we're doing manual imports and it's not a really sustainable configuration so we'll have to rearrange how we do it and we'll keep you looped in on decisions about the timings throughout.

@hnowlan how's the throttle being done at the moment, do you have some details about batch sizes, sleep times etc?

Currently we're doing batches of 10000 entries with a 500ms throttle between them. If needs be we can tune things down easily.

Yeah, I was seeing some contention happening on the master, let's tune it to maybe 7000 and 1 second between them to see how that goes?

Hey @Marostegui ; so that you know: we'll resume testing the ingestion process in the new year; we expect a new batch of data (similar size) at the beginning of January.
I'll touch base before firing off a new ingestion & tune throttling as you recommended.

Sounds good, thank you

Hi @Marostegui; we would like to attempt a new import today (2021-01-06) at 1700CET. Same volumes as the previous one, we'll tune throttling and batch sizes as you suggested.

@gmodena Manuel is not around today, could we postpone this?

@LSobanski would sometime tomorrow after 1300CET work? Otherwise, could you maybe suggest timeslots that would work for you?

@gmodena Manuel will be around tomorrow, we can sync up on exact timing then. Will that work?

Hey @LSobanski that should definitely work. Many thanks!

@LSobanski would sometime tomorrow after 1300CET work? Otherwise, could you maybe suggest timeslots that would work for you?

Around or a bit after 13:00 CET would work for me today - thanks

@Marostegui ack. We'll kick in ingestion at around 1300CET - or let you know if the plan changes.

@Marostegui FYI: ingestion is in progress

Thanks - let me know when it is done so we can check the whole process in our graphs

The process finished at around 1900CET. From our end, contention and read/writes stats (grafana) seemed ok throughout the process.
Some throughput stats:

Loading /home/isaacj/sockpuppet/data/2020-12/temporal.tsv: 16108838rows [46:37, 5759.06rows/s]
Loading /home/isaacj/sockpuppet/data/2020-12/metadata.tsv: 7187534rows [23:49, 5027.88rows/s]
Loading /home/isaacj/sockpuppet/data/2020-12/coedit_counts.tsv: 94243952rows [4:29:14, 5833.75rows/s]

Thanks for the heads up.
It looks it went a lot better this time, there was an initial big spike but too worrying and within reasonable values. I see no contention or blockages during the imports.
I am happy with these throttling's values.

Thanks!