Page MenuHomePhabricator

Set up MariaDB for iPoid
Closed, ResolvedPublic

Description

As discussed in https://phabricator.wikimedia.org/T301428#7730915, we have decided to go with MariaDB for our storage needs. After reviewing the MariaDB documentation, we saw that there was a Miscellaneous cluster that seems like it would fit our needs since this is not a Mediawiki project. It's unclear if containerized projects can use the MariaDB clusters, and there weren't that many examples of other projects that are doing this. We also can run our own MariaDB in a container, but for production purposes, it seems better to use a production ready cluster. However, if this is preferred, please let us know and send some examples our way and we can move forward with building our own db.

Suggested Database creation template fields

  • QPS: not sure
  • Size: ~6 GB (we're not planning to store more than 4 GB...but we might need more)
  • DB Name: ipoid
  • Accessed from server (s): Kubernetes wikikube cluster
  • Backup Policy: Last 24h (T305114#7823480)
  • Grants & Users (from T305114#8846665)
    • ipoid_rw: SELECT, DELETE, UPDATE, INSERT, CREATE, CREATE TEMPORARY TABLES
    • ipoid_ro: SELECT

Suggested Schema (from T305114#7897318)

CREATE TABLE actor_data (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   ip VARBINARY(128) NOT NULL,
   org VARBINARY(128),
   client_count INT UNSIGNED,
   -- 1 = 'DESKTOP', 2 = 'HEADLESS', 3 = 'IOT', 4 = 'MOBILE'
   types SET(1,2,3,4)
   conc_geohash VARBINARY(16),
   conc_city VARBINARY(32),
   conc_state VARBINARY(32),
   conc_country VARBINARY(32),
   conc_skew INT,
   conc_density INT,
   countries INT,
   location_country VARBINARY(32),
   -- 1 = 'CALLBACK_PROXY', 2 = 'GEO_MISMATCH', 3 = 'LOGIN_BRUTEFORCE', 4 = 'TUNNEL', 5 = 'WEB_SCRAPING'
   risks SET(1,2,3,4,5),
   PRIMARY KEY (pkid)
);

CREATE TABLE behaviors (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   behaviors VARBINARY(64),
   PRIMARY KEY (pkid)
);

CREATE TABLE actor_data_behaviors (
   actor_data_id BIGINT UNSIGNED,
   behaviors_id BIGINT UNSIGNED
);

CREATE TABLE proxies (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   proxy VARBINARY(32),
   PRIMARY KEY (pkid)
);

CREATE TABLE actor_data_proxies (
   actor_data_id BIGINT UNSIGNED,
   proxies_id BIGINT UNSIGNED
);

CREATE TABLE tunnels (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   tunnels VARBINARY(32),
   PRIMARY KEY (pkid)
);

CREATE TABLE actor_data_tunnels (
   actor_data_id BIGINT UNSIGNED,
   tunnels_id BIGINT UNSIGNED
);

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

At least that's how I'd design this, though I'm certainly not an expert-level SQL DBA.

That's literally my job, to help other teams in their DB work.

Thanks for the example and the schema. From a quick look at the data. I can suggest these changes:

  • pkid should be UNSIGNED and possibly BIGINT if you plan to insert and remove a lot (note that upsert also takes an auto_increment value even if it just updates or noop)
  • the storage of ip should change away from varchar(12). For IPv4, php has native support to store them in a way more compact way in database. See https://stackoverflow.com/a/6427886 but I don't know how that can be pulled off in nodejs :( I can look it up if you need me to.
  • I think you can easily normalize services column to another table, Am I missing something obvious?
  • count is a reserved keyword, while technically you still can create it but in order to avoid headache, I suggest picking another name.

The most important thing is how/why 4GB update happens every day. How these rows completely become useless after 24 hours? e.g. if you're looking for count field update, you can simply update it instead of remove+insert

Hey @Ladsgroup -

Note: after discussing this a bit more with the other folks working on this project, I've made some changes to the schema from what was originally proposed.

Thanks again for all of the help on this. We received some v2 sample data from the vendor recently. The new data looks like this single, example object: P26622. So a bit different from our initial assumptions. After digging through several million rows of vendor-provided raw data to confirm types and values, I've arrived at this table design, which incorporates some of your suggestions above:

CREATE TABLE actor_data (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   ip VARBINARY(128) NOT NULL,
   org VARCHAR(128),
   client_count INT UNSIGNED,
   types SET('DESKTOP','HEADLESS','IOT','MOBILE')
   conc_geohash VARCHAR(16),
   conc_city VARCHAR(32),
   conc_state VARCHAR(32),
   conc_country VARCHAR(32),
   conc_skew INT,
   conc_density INT,
   countries INT,
   location_country VARCHAR(32),
   risks SET('CALLBACK_PROXY','GEO_MISMATCH','LOGIN_BRUTEFORCE','TUNNEL','WEB_SCRAPING'),
   PRIMARY KEY (pkid)
);

CREATE TABLE behaviors (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   behaviors VARCHAR(64),
   PRIMARY KEY (pkid)
);

CREATE TABLE actor_data_behaviors (
   actor_data_id BIGINT UNSIGNED,
   behaviors_id BIGINT UNSIGNED
);

CREATE TABLE proxies (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   proxy VARCHAR(32),
   PRIMARY KEY (pkid)
);

CREATE TABLE actor_data_proxies (
   actor_data_id BIGINT UNSIGNED,
   proxies_id BIGINT UNSIGNED
);

CREATE TABLE tunnels (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   tunnels VARCHAR(32),
   PRIMARY KEY (pkid)
);

CREATE TABLE actor_data_tunnels (
   actor_data_id BIGINT UNSIGNED,
   tunnels_id BIGINT UNSIGNED
);

Some follow-up questions:

  1. I'm pretty sure all of the VARCHAR length values should be reasonable under current assumptions, and I just don't see a way around being somewhat flexible with these columns given the sample data we've been provided.
  2. Does DBA have any best practices around the SET data type? If it's "don't use them", that's fine :) They can be migrated to proper tables. Otherwise, is there a practical limit on the number and size of values they should allow?
  3. After the two SET fields, there are basically three many-to-many data types we need to break out: behaviors, proxies and tunnels. Does what I've provided here seem reasonable? FK constraints seemed a bit much in my mind since the primary data operations here are 1) wholesale import data into the db 2) wholesale delete data from the db 3) run various, simple SELECTs against the db tables. Given that we're already planning to chunk the database import over a longer duration for larger vendor data feeds (as you suggested above) do you envision any additional performance or consistency issues with this approach?
  4. We were looking into compacting both ipv4 and ipv6 data with something like npm's ip-num package.

Thanks!

Some follow-up questions:

  1. I'm pretty sure all of the VARCHAR length values should be reasonable under current assumptions, and I just don't see a way around being somewhat flexible with these columns given the sample data we've been provided.

I suggest using VARBINARY() so you don't need to deal with mess of mysql charsets. That's how mw does it.

As long as it's "var", the size really doesn't matter, as high as you can is okay (don't go too high) but it means we are flexible. There are some niche things like how innodb doesn't differentiate between binary and varbinary in indexes in memory but that's really too small to be of an issue here.

  1. Does DBA have any best practices around the SET data type? If it's "don't use them", that's fine :) They can be migrated to proper tables. Otherwise, is there a practical limit on the number and size of values they should allow?
  2. After the two SET fields, there are basically three many-to-many data types we need to break out: behaviors, proxies and tunnels. Does what I've provided here seem reasonable? FK constraints seemed a bit much in my mind since the primary data operations here are 1) wholesale import data into the db 2) wholesale delete data from the db 3) run various, simple SELECTs against the db tables. Given that we're already planning to chunk the database import over a longer duration for larger vendor data feeds (as you suggested above) do you envision any additional performance or consistency issues with this approach?

SET and ENUM are usually discouraged because any change on them require schema change. One suggestion is to always encode them as constants in your code. e.g. TUNNEL = 1, TUNNEL_AND_FOO = 2, etc. and store int instead. For sets, it's a bit complicated but it should be doable.

  1. We were looking into compacting both ipv4 and ipv6 data with something like npm's ip-num package.

Thanks!

Thanks again, @Ladsgroup. Posting the final version here, with minor changes suggested above, just so that it's publicly documented somewhere:

CREATE TABLE actor_data (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   ip VARBINARY(128) NOT NULL,
   org VARBINARY(128),
   client_count INT UNSIGNED,
   -- 1 = 'DESKTOP', 2 = 'HEADLESS', 3 = 'IOT', 4 = 'MOBILE'
   types SET(1,2,3,4)
   conc_geohash VARBINARY(16),
   conc_city VARBINARY(32),
   conc_state VARBINARY(32),
   conc_country VARBINARY(32),
   conc_skew INT,
   conc_density INT,
   countries INT,
   location_country VARBINARY(32),
   -- 1 = 'CALLBACK_PROXY', 2 = 'GEO_MISMATCH', 3 = 'LOGIN_BRUTEFORCE', 4 = 'TUNNEL', 5 = 'WEB_SCRAPING'
   risks SET(1,2,3,4,5),
   PRIMARY KEY (pkid)
);

CREATE TABLE behaviors (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   behaviors VARBINARY(64),
   PRIMARY KEY (pkid)
);

CREATE TABLE actor_data_behaviors (
   actor_data_id BIGINT UNSIGNED,
   behaviors_id BIGINT UNSIGNED
);

CREATE TABLE proxies (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   proxy VARBINARY(32),
   PRIMARY KEY (pkid)
);

CREATE TABLE actor_data_proxies (
   actor_data_id BIGINT UNSIGNED,
   proxies_id BIGINT UNSIGNED
);

CREATE TABLE tunnels (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   tunnels VARBINARY(32),
   PRIMARY KEY (pkid)
);

CREATE TABLE actor_data_tunnels (
   actor_data_id BIGINT UNSIGNED,
   tunnels_id BIGINT UNSIGNED
);
sbassett lowered the priority of this task from Medium to Low.May 4 2022, 8:34 PM

Keeping this task open and prioritizing as low for now, for when we actually need to create this database within production.

Removing the DBA tag for now. Please re-add it once you want us to proceed.

Mstyles moved this task from In Progress to Blocked on the iPoid-Service board.
sbassett moved this task from Waiting to Back Orders on the Security-Team board.

Are DBA and SRE folks aware that this entire database will essentially be wiped every one or two days and recreated from a dump? Does that complicate things in any way?

Hi, yes, it would complicate replication (and basically choke it in every update). It's always better to diff and only change stuff that's needed. I assume these data don't change that often.

Thanks for replying, @Ladsgroup

@STran can say more, but I have a few questions:

  1. IIUC, the data comes in as a GZipped JSON file. How do you recommend diffing this against what's in the SQL table?
  1. What do we need to replicate this to? The second datacenter? Given that we completely recreate this whole database every other day and can easily regenerate it whenever we need, do we even need to set up replication?

On diffing, it depends. I suggest batching, query 10000 rows, check what exists. To make things simpler, you can just drop anything changed and re-insert them. Would be happy to review and assist on anything possible.

It's always important to have replicas. It's for high availability and ease of maintenance (we even do some of those automatically). E.g. when we need to take a db out of rotation for security updates.

If you want to delete and recreate the entire database every 2 days, you do have to keep in mind replication lag.
For deleting it, you could delete a number of rows per second and wait for replication. Same for inserting.

  1. What do we need to replicate this to? The second datacenter? Given that we completely recreate this whole database every other day and can easily regenerate it whenever we need, do we even need to set up replication?

You don't need to worry about it. This database will live in a set of servers that already have other databases on them. It is automatically replicated to the second datacenter. It will be all transparent for your service.
Your service only needs to point to a given CNAME and that should be it.

Batching is fine and makes sense. As built, it already ooms on dev trying to ingest the entire feed so we would need to batch to solve that problem regardless. @Ladsgroup is there a good way to expire rows? We should confirm with the vendor, but I believe IPs can possibly drop off so I think we'd need to be able to delete stale rows if that's the case.

The way that parsercache does it is that you introduce a new column called expiry, you add an index on it and with a cron job (systemd timer in our infra but that's details) you delete anything older than than the expiry (in batches), that should be quite straightforward. Most queries should have a condition on expiry not being passed too (unless you're okay with serving stale data). HTH, let me know if you want something different and we can think of something.

The vendor replied with numbers over the last two days. Here are some numbers that might help decide on a strategy. There is a new dump of IPs every day. Between the last two successive dumps (today and yesterday) there is:

  • 10-12% daily change
  • 1.7M New entries
  • 2M Aged off entries (i.e. dropped/removed)

Total number of IPs today was around 21M. But the churn is highly variable.

The vendor replied with numbers over the last two days. Here are some numbers that might help decide on a strategy. There is a new dump of IPs every day. Between the last two successive dumps (today and yesterday) there is:

  • 10-12% daily change
  • 1.7M New entries
  • 2M Aged off entries (i.e. dropped/removed)

Total number of IPs today was around 21M. But the churn is highly variable.

Batching is fine and makes sense. As built, it already ooms on dev trying to ingest the entire feed so we would need to batch to solve that problem regardless. @Ladsgroup is there a good way to expire rows? We should confirm with the vendor, but I believe IPs can possibly drop off so I think we'd need to be able to delete stale rows if that's the case.

@STran Are we able to download yesterday's data, or does the vendor only provide a "current" data dump? If we can download both, we could make sure that any IPs from yesterday's data that don't exist in today's are removed from the database when we are processing the current day's data import.

@STran Are we able to download yesterday's data, or does the vendor only provide a "current" data dump? If we can download both, we could make sure that any IPs from yesterday's data that don't exist in today's are removed from the database when we are processing the current day's data import.

@kostajh - I just added you to our internal client directory with some doc on the feed products. I'll Slack you about that in a second. I believe one can pull the compressed data from the vendor any time they'd like, but the feed updates once every 24 hours (there's another version that updates every 5 minutes, but I don't think we purchased that one). Anyhow, we could do a diff, though it is a fairly massive amount of information - 4 to 6 Gb, uncompressed.

I thought the technical solution was going to be another script that runs daily, checking for stale rows based on a expiry/last_updated column. I added the expiry column already but have a patch up to change that to last_updated if you wanted to chat about that there (or in the tickets that discuss implementing these features)

I'll also have more information by the end of this week but it seems non-trivial to process all of these rows. I've been working on batching and it's taken about 45 minutes for my local server to process 2.5M lines (which suggests about 6 hours to finish writing everything). I don't think we want to diff, given this rough time estimate. It seems more efficient to run a query once a day deleting anything with a timestamp more than a day old.

I thought the technical solution was going to be another script that runs daily, checking for stale rows based on a expiry/last_updated column. I added the expiry column already but have a patch up to change that to last_updated if you wanted to chat about that there (or in the tickets that discuss implementing these features)

I see; using the last_updated column makes sense to me.

I'll also have more information by the end of this week but it seems non-trivial to process all of these rows. I've been working on batching and it's taken about 45 minutes for my local server to process 2.5M lines (which suggests about 6 hours to finish writing everything). I don't think we want to diff, given this rough time estimate. It seems more efficient to run a query once a day deleting anything with a timestamp more than a day old.

I'd propose we continue this discussion in T305724: Investigate database data invalidation questions and chunked/timed API to MySQL/MariaDB ETL and document in that task (and on wikitech as part of T336165: Create Wikitech page documenting the iPoid service) what approaches we considered for the import/expiration and the one we've settled on. In the meantime, maybe this task should be "Stalled" until we're ready for the DBAs to create the database and tables.

For what is worth, and in parallel to the clean up discussion, if I can get the following information for the database creation:

User: Originally it was said: mstyles, stran, sbassett but we should probably get just one global user, unless it is _really_ needed to have different users.
Further, it is strange that we only need those users and not an user for the application itself? Is this intended or it was just missed?

Grants needed: We need to know which grants the users accessing this info will have as well as the application would need.

I'll also have more information by the end of this week but it seems non-trivial to process all of these rows. I've been working on batching and it's taken about 45 minutes for my local server to process 2.5M lines (which suggests about 6 hours to finish writing everything). I don't think we want to diff, given this rough time estimate. It seems more efficient to run a query once a day deleting anything with a timestamp more than a day old.

Unfortunately that's not possible, deleting everything and reinserting again would choke the replication not just for this database for also for the rest of databases in the same cluster leading to extended outages/stale data in every service relying on databases in that cluster.

If it takes 6 hours in your PC, it might still be fast in production, given that databases have a decent-size memory and reads are much faster from memory. And generally database servers are optimized for I/O operations. Maybe the diffing logic can be optimized. We can't be sure until it's deployed and we run those scripts there.

I'll also have more information by the end of this week but it seems non-trivial to process all of these rows. I've been working on batching and it's taken about 45 minutes for my local server to process 2.5M lines (which suggests about 6 hours to finish writing everything). I don't think we want to diff, given this rough time estimate. It seems more efficient to run a query once a day deleting anything with a timestamp more than a day old.

Unfortunately that's not possible, deleting everything and reinserting again would choke the replication not just for this database for also for the rest of databases in the same cluster leading to extended outages/stale data in every service relying on databases in that cluster.

Hmm, what if the batching / delete process is something like:

  • use 1,000 entries at a time from the file
  • of the 1,000 entries, UPDATE the relevant row with a new last_updated value if found, otherwise INSERT
  • once the import finishes, run the delete script to get rid of rows from previous day/import

That would mean there's a period of time when there is some stale data in the DB (after every batch from the import finishes, and before the deletion script finishes), but maybe that window is small enough that it doesn't matter.

Alternatively, can the UPDATE/INSERT be done in a single transaction (does that work for 17,000,000 rows, which I see in one of the dump files), and the DELETEs in a separate transaction after the previous transaction succeeds? Then the window for stale data would be pretty much non-existent, AIUI.

Doing a 17M rows operation in one transaction is going to cause lots of replication lag.
In my opinion, the best way would be to write a script that deletes batches of, let's start with 1k, and checks for replication lag, if there is none, it can keep deleting, if there's lag, it stops and wait for it. We can increase those batches while testing until we see some lag appearing, so we know more or less which transaction size would be the most efficient without causing lag.

I'll also have more information by the end of this week but it seems non-trivial to process all of these rows. I've been working on batching and it's taken about 45 minutes for my local server to process 2.5M lines (which suggests about 6 hours to finish writing everything). I don't think we want to diff, given this rough time estimate. It seems more efficient to run a query once a day deleting anything with a timestamp more than a day old.

Unfortunately that's not possible, deleting everything and reinserting again would choke the replication not just for this database for also for the rest of databases in the same cluster leading to extended outages/stale data in every service relying on databases in that cluster.

Hmm, what if the batching / delete process is something like:

  • use 1,000 entries at a time from the file
  • of the 1,000 entries, UPDATE the relevant row with a new last_updated value if found, otherwise INSERT
  • once the import finishes, run the delete script to get rid of rows from previous day/import

That would mean there's a period of time when there is some stale data in the DB (after every batch from the import finishes, and before the deletion script finishes), but maybe that window is small enough that it doesn't matter.

For avoiding stale data, you can add expiry < NOW() as a condition to your select queries, that would basically mean stale rows will be automatically ignored. Would that help?

Alternatively, can the UPDATE/INSERT be done in a single transaction (does that work for 17,000,000 rows, which I see in one of the dump files), and the DELETEs in a separate transaction after the previous transaction succeeds? Then the window for stale data would be pretty much non-existent, AIUI.

It wouldn't work.

User: Originally it was said: mstyles, stran, sbassett but we should probably get just one global user, unless it is _really_ needed to have different users.
Further, it is strange that we only need those users and not an user for the application itself? Is this intended or it was just missed?

Grants needed: We need to know which grants the users accessing this info will have as well as the application would need.

Er, I think that's a mistake. We'd definitely need to create an application user for this. For privileges, I think we'd want these users be granted most or all of the basic Database, Table and Column privileges as defined within mariadb's doc. I'm not sure if DBA has general rules or preferences around these sorts of things though, so we'd obviously want to defer to your team on best practices.

User: Originally it was said: mstyles, stran, sbassett but we should probably get just one global user, unless it is _really_ needed to have different users.
Further, it is strange that we only need those users and not an user for the application itself? Is this intended or it was just missed?

Grants needed: We need to know which grants the users accessing this info will have as well as the application would need.

Er, I think that's a mistake. We'd definitely need to create an application user for this. For privileges, I think we'd want these users be granted most or all of the basic Database, Table and Column privileges as defined within mariadb's doc. I'm not sure if DBA has general rules or preferences around these sorts of things though, so we'd obviously want to defer to your team on best practices.

My advise would be to create two users, one for the application and one for reading the data.
I would guess at the very least for the application normal usage you'd need:
SELECT, DELETE, UPDATE, INSERT, CREATE, CREATE TEMPORARY TABLES

For the read only user I'd go for just SELECT for now.

My advise would be to create two users, one for the application and one for reading the data.
I would guess at the very least for the application normal usage you'd need:
SELECT, DELETE, UPDATE, INSERT, CREATE, CREATE TEMPORARY TABLES

For the read only user I'd go for just SELECT for now.

This sounds like a perfectly reasonable starting point. I'm not sure we care about what the usernames will be, though it's looking like the new service name with be "iPoid" (T336218).

My advise would be to create two users, one for the application and one for reading the data.
I would guess at the very least for the application normal usage you'd need:
SELECT, DELETE, UPDATE, INSERT, CREATE, CREATE TEMPORARY TABLES

For the read only user I'd go for just SELECT for now.

This sounds like a perfectly reasonable starting point. I'm not sure we care about what the usernames will be, though it's looking like the new service name with be "iPoid" (T336218).

Ok, we can go for:

ipoid_rw: SELECT, DELETE, UPDATE, INSERT, CREATE, CREATE TEMPORARY TABLES
ipoid_ro: SELECT
@sbassett maybe we should also call the database ipoid or is there any specific reason why you decided security_api for it?

Change 920194 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] production-m5.sql: Add ipoid grants

https://gerrit.wikimedia.org/r/920194

I have created ipoid database on m5.
Connections will need to go to: m5-master.eqiad.wmnet

The initial agreed grants are waiting for a review at https://gerrit.wikimedia.org/r/c/operations/puppet/+/920194

@sbassett how would the application side handle this database being unavailable (either fully down or on read only mode). As you probably know, from time to time we do have to do maintenance on our hosts (kernel upgrades, mariadb upgrades etc), we usually switch masters which involve some read only time (around 1-2 minutes).
But in some cases, if the server fully crashes the database might not be available at all. How would it cope with these scenarios?

@sbassett the following tables would need a PK or in other words, is there any specific reason why they don't have it?:

  • actor_data_behaviors
  • actor_data_proxies
  • actor_data_tunnels

@sbassett how would the application side handle this database being unavailable (either fully down or on read only mode). As you probably know, from time to time we do have to do maintenance on our hosts (kernel upgrades, mariadb upgrades etc), we usually switch masters which involve some read only time (around 1-2 minutes).
But in some cases, if the server fully crashes the database might not be available at all. How would it cope with these scenarios?

Likely rely upon an intermediate caching layer or just throw some kind of "system temporarily unavailable" error. While this system is important, I do not personally believe that it needs to guarantee anything close to the uptime of the projects or other mission-critical production systems. @STran or @Mstyles - any other thoughts here?

@sbassett the following tables would need a PK or in other words, is there any specific reason why they don't have it?:

  • actor_data_behaviors
  • actor_data_proxies
  • actor_data_tunnels

These were initially designed to be join tables due to some quirks in the way the vendor organizes the JSON data which ultimately populates these tables. If we need APKs for indexing or some other reason, that's fine, but our application layers currently would not make use of them.

@sbassett the following tables would need a PK or in other words, is there any specific reason why they don't have it?:

  • actor_data_behaviors
  • actor_data_proxies
  • actor_data_tunnels

These were initially designed to be join tables due to some quirks in the way the vendor organizes the JSON data which ultimately populates these tables. If we need APKs for indexing or some other reason, that's fine, but our application layers currently would not make use of them.

Using PKs is normally preferred in database design to ensure data integrity and prevent data duplication. I am not sure if data can have duplicate entries on those tables (and if that is ok), but operations without a PK tend to be quite a pain. If duplicate data is fine from an application point of view we can simply go for actor_data_id as PKs

Change 920194 merged by Marostegui:

[operations/puppet@production] production-m5.sql: Add ipoid grants

https://gerrit.wikimedia.org/r/920194

I have deployed the grants on m5 for ipoid database and tested the connection for both users:

# mysql --ssl-verify-server-cert=false -uipoid_rw -p -h m5-master.eqiad.wmnet
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 12584271
Server version: 10.6.12-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

ipoid_rw@m5-master.eqiad.wmnet((none))> Ctrl-C -- exit!
Aborted
root@mwmaint1002:/home/jiji# mysql --ssl-verify-server-cert=false -uipoid_ro -p -h m5-master.eqiad.wmnet
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 12584297
Server version: 10.6.12-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

ipoid_ro@m5-master.eqiad.wmnet((none))> Ctrl-C -- exit!


# mysql --ssl-verify-server-cert=false -uipoid_rw -p -h m5-master.eqiad.wmnet
Enter password:
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 12584249
Server version: 10.6.12-MariaDB-log MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

ipoid_rw@m5-master.eqiad.wmnet((none))> Ctrl-C -- exit!

@jijiki I assume you'll be the one productionizing this and placing this on the private repo, so I have left the passwords at your `/home/ in mwmaint1002:

root@mwmaint1002:/home/jiji# pwd
/home/jiji
root@mwmaint1002:/home/jiji# ls -lh ipoid
-rw-rw-r-- 1 root root 52 May 17 06:50 ipoid
jijiki renamed this task from Set up MariaDB for Security API to Set up MariaDB for iPoid.May 17 2023, 10:36 AM

Using PKs is normally preferred in database design to ensure data integrity and prevent data duplication. I am not sure if data can have duplicate entries on those tables (and if that is ok), but operations without a PK tend to be quite a pain. If duplicate data is fine from an application point of view we can simply go for actor_data_id as PKs

Ok, sounds good. Thanks.

Marostegui claimed this task.

Per my chat with @jijiki this can be closed.

We're preparing to do a test deploy and import soon. Given that, we were wondering if:

  • We could connect to a database to do manual updates
  • How terrible it would be if one of those manual updates was dropping the database

There's a non-zero chance we might have to drop the entire thing and start over. We can try to mitigate this by running the first import with a subset of data but if at some point things get bad (too much drift, compounding errors, db), is truncating and starting over possible?


Additionally, iPoid mimics mediawiki's table updating mechanism (https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/tree/main/schema?ref_type=heads) and we've been using the ipoid_rw to run updates on the schema so it's been using ALTER queries. Would it be possible to 1. drop the database and let the ipoid_rw user re-create it? And give that user ALTER permissions so it can run the updates?

cc @Marostegui @Ladsgroup

We're preparing to do a test deploy and import soon. Given that, we were wondering if:

  • We could connect to a database to do manual updates

You can always use the user/password provided. However, I strongly recommend not to do manual operations unless you're really sure of the implications (eg: a massive delete can create replication lag).
Right now the database is empty, so I'd say go ahead. Though, we normally want to get a heads up when things like this are going to happen in case we see some strange patter in the graphs.

  • How terrible it would be if one of those manual updates was dropping the database

Your user doesn't have DROP privilege and I'd prefer if it remains that way. If you need to drop the database it is probably better if you send a ticket.

There's a non-zero chance we might have to drop the entire thing and start over. We can try to mitigate this by running the first import with a subset of data but if at some point things get bad (too much drift, compounding errors, db), is truncating and starting over possible?

The user doesn't have TRUNCATE either, but we could do that for you. If that becomes a recurrent pattern then we can re-evaluate the current grants.


Additionally, iPoid mimics mediawiki's table updating mechanism (https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/tree/main/schema?ref_type=heads) and we've been using the ipoid_rw to run updates on the schema so it's been using ALTER queries. Would it be possible to 1. drop the database and let the ipoid_rw user re-create it? And give that user ALTER permissions so it can run the updates?

cc @Marostegui @Ladsgroup

We can give it ALTER for sure. As I wrote above, you cannot truncate or drop, we can do that for you. The ipoid_rw does have CREATE, so it can create the tables.

Change 972388 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] production-m5.sql: Add ALTER to ipoid_rw

https://gerrit.wikimedia.org/r/972388

Change 972388 merged by Marostegui:

[operations/puppet@production] production-m5.sql: Add ALTER to ipoid_rw

https://gerrit.wikimedia.org/r/972388

@Marostegui please go ahead and allow us to DROP TABLE until the app is on its feet. I will reply on this task to have this revoked as soon as we do not need it anymore.

Change 972835 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] production-m5.sql: Add DROP

https://gerrit.wikimedia.org/r/972835

Change 972835 merged by Marostegui:

[operations/puppet@production] production-m5.sql: Add DROP

https://gerrit.wikimedia.org/r/972835

@Marostegui please go ahead and allow us to DROP TABLE until the app is on its feet. I will reply on this task to have this revoked as soon as we do not need it anymore.

There's not such thing as DROP table, there's DROP which also enables DROP database. Anyways, I have added it. Please reopen this task once it can be removed.

@Marostegui please go ahead and allow us to DROP TABLE until the app is on its feet. I will reply on this task to have this revoked as soon as we do not need it anymore.

There's not such thing as DROP table, there's DROP which also enables DROP database. Anyways, I have added it. Please reopen this task once it can be removed.

I know a DROP is DROP:)
I just wanted to stress that we want to use it just to DROP TABLE !

Change 973313 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/puppet@production] ipoid: Grant INDEX to ipoid_rw user for ipoid DB

https://gerrit.wikimedia.org/r/973313

Change 973313 abandoned by Kosta Harlan:

[operations/puppet@production] ipoid: Grant INDEX to ipoid_rw user for ipoid DB

Reason:

https://gerrit.wikimedia.org/r/973313

Change 973313 restored by Marostegui:

[operations/puppet@production] ipoid: Grant INDEX to ipoid_rw user for ipoid DB

https://gerrit.wikimedia.org/r/973313

Change 973313 merged by Marostegui:

[operations/puppet@production] ipoid: Grant INDEX to ipoid_rw user for ipoid DB

https://gerrit.wikimedia.org/r/973313

Given that T339331: Prepare for initial data import on production servers is done, can any of the grants be cleaned up?

@Marostegui Can we revisit that question in a week or two? We'd like to make sure that we're able to recover from errors (active work in T344941: Respond to data update and import errors). It may be that we need to drop everything, recreate, and do an initial import again sometime this week.

Absolutely, no problem!. Let's make sure we don't forget though