Page MenuHomePhabricator

DBA audit of ipoid database
Open, MediumPublic

Description

Ahead of deploying ipoid, we should make sure that the DBA team are happy with the database schema.

The current schema is:


MariaDB [ipoid]> DESCRIBE actor_data; DESCRIBE actor_data_behaviors; DESCRIBE actor_data_proxies; DESCRIBE actor_data_tunnels; DESCRIBE behaviors; DESCRIBE proxies; DESCRIBE tunnels;
+------------------+-------------------------------------------------------------------------------------------+------+-----+------------------+----------------+
| Field            | Type                                                                                      | Null | Key | Default          | Extra          |
+------------------+-------------------------------------------------------------------------------------------+------+-----+------------------+----------------+
| pkid             | bigint(20) unsigned                                                                       | NO   | PRI | NULL             | auto_increment |
| ip               | varbinary(128)                                                                            | NO   | UNI | NULL             |                |
| org              | varbinary(1280)                                                                           | YES  |     | NULL             |                |
| client_count     | int(10) unsigned                                                                          | YES  |     | NULL             |                |
| types            | set('UNKNOWN','DESKTOP','HEADLESS','IOT','MOBILE')                                        | YES  |     | NULL             |                |
| conc_city        | varbinary(128)                                                                            | YES  |     | NULL             |                |
| conc_state       | varbinary(128)                                                                            | YES  |     | NULL             |                |
| conc_country     | varbinary(32)                                                                             | YES  |     | NULL             |                |
| countries        | int(11)                                                                                   | YES  |     | NULL             |                |
| location_country | varbinary(32)                                                                             | YES  |     | NULL             |                |
| risks            | set('UNKNOWN','CALLBACK_PROXY','GEO_MISMATCH','LOGIN_BRUTEFORCE','TUNNEL','WEB_SCRAPING') | YES  |     | NULL             |                |
| last_updated     | int(11)                                                                                   | YES  |     | unix_timestamp() |                |
+------------------+-------------------------------------------------------------------------------------------+------+-----+------------------+----------------+
12 rows in set (0.002 sec)

+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| actor_data_id | bigint(20) unsigned | NO   |     | NULL    |       |
| behavior_id   | bigint(20) unsigned | NO   |     | NULL    |       |
+---------------+---------------------+------+-----+---------+-------+
2 rows in set (0.001 sec)

+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| actor_data_id | bigint(20) unsigned | NO   |     | NULL    |       |
| proxy_id      | bigint(20) unsigned | NO   |     | NULL    |       |
+---------------+---------------------+------+-----+---------+-------+
2 rows in set (0.001 sec)

+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| actor_data_id | bigint(20) unsigned | NO   |     | NULL    |       |
| tunnel_id     | bigint(20) unsigned | NO   |     | NULL    |       |
+---------------+---------------------+------+-----+---------+-------+
2 rows in set (0.001 sec)

+----------+---------------------+------+-----+---------+----------------+
| Field    | Type                | Null | Key | Default | Extra          |
+----------+---------------------+------+-----+---------+----------------+
| pkid     | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| behavior | varbinary(64)       | YES  | UNI | NULL    |                |
+----------+---------------------+------+-----+---------+----------------+
2 rows in set (0.001 sec)

+-------+---------------------+------+-----+---------+----------------+
| Field | Type                | Null | Key | Default | Extra          |
+-------+---------------------+------+-----+---------+----------------+
| pkid  | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| proxy | varbinary(128)      | YES  | UNI | NULL    |                |
+-------+---------------------+------+-----+---------+----------------+
2 rows in set (0.001 sec)

+-----------+---------------------+------+-----+---------+----------------+
| Field     | Type                | Null | Key | Default | Extra          |
+-----------+---------------------+------+-----+---------+----------------+
| pkid      | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| operator  | varbinary(256)      | YES  | UNI | NULL    |                |
| type      | varbinary(5)        | YES  |     | NULL    |                |
| anonymous | tinyint(1)          | YES  |     | NULL    |                |
+-----------+---------------------+------+-----+---------+----------------+
4 rows in set (0.001 sec)

Typical INSERT query:

INSERT INTO actor_data (ip,org,client_count,types,conc_city,conc_state,conc_country,countries,location_country,risks) VALUES (?,?,?,?,?,?,?,?,?,?);
INSERT INTO actor_data_proxies (actor_data_id, proxy_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = ?), (SELECT pkid FROM proxies WHERE proxy = ?) );

Typical SELECT query:

SELECT * FROM actor_data WHERE ip = ? LIMIT 1;
SELECT * FROM actor_data_proxies WHERE actor_data_id IN (?);

Details

ReferenceSource BranchDest BranchAuthorTitle
repos/mediawiki/services/ipoid!78display-data-labelsmaintchandersDisplay labels for data stored as bitmasks
repos/mediawiki/services/ipoid!75prefer-int-over-setmainstranDon't use sets in schema
repos/mediawiki/services/ipoid!74add-table-pksmainstranUpdate db schemas
Customize query in GitLab

Event Timeline

Tchanders updated the task description. (Show Details)
Tchanders added subscribers: kostajh, Ladsgroup.

From conversations so far, we may need to replace the sets?

Any reason why some tables do not have a PK? I thought we discussed about them on some other tasks. We really need to have PK on all tables.

Marostegui moved this task from Triage to In progress on the DBA board.

if you're querying with ip, you'd need an index on it in actor_data.

Beside that I wouldn't be too worried about the schema, as long as update patterns have been figured out, any needed change can be done via schema changes.

After meeting with @Ladsgroup:

  • If using sets, we should check whether sets are available in Mariadb 10.4 (which we use on prod), but it's better not to
  • Alternative: if there can be multiple values, use bitmask
  • Alternative: if single value, use hard-coded integer (and some kind of small int column) - e.g. see RC types

(If JSON was available we could use that - switching to 10.6 gradually over coming months, but not in time for us...)

What makes sets unideal? We made the change awhile back for legibility: https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/15#note_41879. It's fine if we can't use sets but I'm not familiar enough with sql to know why they aren't preferred.

Can I assume any time we need to query something we'll need an index? There's a comment from Dom here about slow queries: https://phabricator.wikimedia.org/T344273#9177308

@TThoabala I am finding the requests to /vpns and /vpn/:label can be very slow.

I imported part of the Spur data, about 600000 rows in actor_data. Because the tunnels table did not have anything in the type column (see T346634), I set it to VPN in every row. This might not be realistic.

Calls to /vpn/:label can take anywhere from a few seconds to 30 minutes.

For example, /vpn/ABCPROXY_PROXY takes about 10 minutes (I ran the request in both curl and querying the database directly). It only returns 139 IPs.

/vpn/LUMINATI_PROXY took about 30 minutes and returned ~1300 IPs.

The call to /vpns took over 10 minutes before I cancelled it.

This is running on docker.

What makes sets unideal? We made the change awhile back for legibility: https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/15#note_41879. It's fine if we can't use sets but I'm not familiar enough with sql to know why they aren't preferred.

SETs and ENUMs are among anti-patterns of SQL (https://www.oreilly.com/library/view/sql-antipatterns/9781680500073/) and the biggest reason is that if you want to change values of the set (let's assume we want to add a new IP type), then you need to do a schema change and schema changes are much more complicated than code changes.

Can I assume any time we need to query something we'll need an index? There's a comment from Dom here about slow queries: https://phabricator.wikimedia.org/T344273#9177308

If you query for your stats and understanding, no. But if it's part of the service and users will cause that query needing results fast, yes. Writing good indexes is way more complicated than it looks, there is a dedicated chapter on this in High performance MySQL (https://www.oreilly.com/library/view/high-performance-mysql/9780596101718/). Feel free to drop the read queries the service makes here and I can suggest some indexes.

Thanks for the context!

The queries we expect users to run are here: https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/blob/main/routes/feed/v1.js
And are:

// All IPs on a named proxy
SELECT ad.ip FROM actor_data ad
        JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
        JOIN proxies p ON adp.proxy_id = p.pkid WHERE p.proxy = ?;
// All known proxies
SELECT p.proxy AS label, GROUP_CONCAT(ad.ip) AS ips
        FROM actor_data ad
        JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
        JOIN proxies p ON adp.proxy_id = p.pkid
        GROUP BY p.proxy;
// All IPs on a named VPN
`SELECT ad.ip
        FROM actor_data ad
        JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
        JOIN proxies p ON adp.proxy_id = p.pkid
        JOIN actor_data_tunnels adt ON ad.pkid = adt.actor_data_id
        JOIN tunnels t ON adt.tunnel_id = t.pkid
        WHERE t.type = 'VPN' AND p.proxy = ?;`
// All known VPNs
SELECT p.proxy AS label, GROUP_CONCAT(ad.ip) AS ips
          FROM actor_data ad
          JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
          JOIN proxies p ON adp.proxy_id = p.pkid
          JOIN actor_data_tunnels adt ON ad.pkid = adt.actor_data_id
          JOIN tunnels t ON adt.tunnel_id = t.pkid
          WHERE t.type = 'VPN'
          GROUP BY p.proxy;

These endpoints are meant for eventual ingest by varnish (https://gerrit.wikimedia.org/r/c/operations/puppet/+/945819/). The request was for everything to come back in one query. I did a (very very) casual grep to see how big the result might be and grep says around 100k rows.

@Ladsgroup You recommended a bitmasks if multiple values are possible but for ease of legibility and handling, would a lookup table be acceptable? For instance, an actor can have multiple types:

types SET('UNKNOWN','DESKTOP','HEADLESS','IOT','MOBILE'),

and a lookup table could show every combination of these:

CREATE TABLE actor_type (
pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  unknown BOOLEAN,
  desktop BOOLEAN,
  headless BOOLEAN,
  iot BOOLEAN,
  mobile BOOLEAN
  PRIMARY KEY (pkid)
);

and the pkid would then be referred to in actor_data.

That would mean if we need to add more types, we will have to introduce a new column and schema changes which was the exact same reason using SET is discouraged (without the benefits of using SET at least)

Did some quick query analysis and think we also need an index on actor_data_proxies(proxy). I've been working on a limited subset of data to make it faster to iterate and I'm not sure about the last query, as my tunnels table might be too small atm for this EXPLAIN to be useful.


SELECT * FROM actor_data WHERE ip = ? LIMIT 1;

SELECT ad.ip FROM actor_data ad
        JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
        JOIN proxies p ON adp.proxy_id = p.pkid WHERE p.proxy = ?;

SELECT p.proxy AS label, GROUP_CONCAT(ad.ip) AS ips
        FROM actor_data ad
        JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
        JOIN proxies p ON adp.proxy_id = p.pkid
        GROUP BY p.proxy;

SELECT ad.ip
        FROM actor_data ad
        JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
        JOIN proxies p ON adp.proxy_id = p.pkid
        JOIN actor_data_tunnels adt ON ad.pkid = adt.actor_data_id
        JOIN tunnels t ON adt.tunnel_id = t.pkid
        WHERE t.type = 'VPN' AND p.proxy = ?;

SELECT p.proxy AS label, GROUP_CONCAT(ad.ip) AS ips
          FROM actor_data ad
          JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
          JOIN proxies p ON adp.proxy_id = p.pkid
          JOIN actor_data_tunnels adt ON ad.pkid = adt.actor_data_id
          JOIN tunnels t ON adt.tunnel_id = t.pkid
          WHERE t.type = 'VPN'
          GROUP BY p.proxy;

☝️ Those are all the queries we currently expect to run. I ran some EXPLAIN statements:

> EXPLAIN SELECT ad.ip FROM actor_data ad JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id JOIN proxies p ON adp.proxy_id = p.pkid WHERE p.proxy = 'ABCPROXY_PROXY';
+------+-------------+-------+--------+---------------+---------+---------+-------------------------+------+-------------+
| id   | select_type | table | type   | possible_keys | key     | key_len | ref                     | rows | Extra       |
+------+-------------+-------+--------+---------------+---------+---------+-------------------------+------+-------------+
|    1 | SIMPLE      | p     | const  | PRIMARY,proxy | proxy   | 131     | const                   | 1    | Using index |
|    1 | SIMPLE      | adp   | ALL    | NULL          | NULL    | NULL    | NULL                    | 9775 | Using where |
|    1 | SIMPLE      | ad    | eq_ref | PRIMARY       | PRIMARY | 8       | ipoid.adp.actor_data_id | 1    |             |
+------+-------------+-------+--------+---------------+---------+---------+-------------------------+------+-------------+

> EXPLAIN SELECT p.proxy AS label, GROUP_CONCAT(ad.ip) AS ips FROM actor_data ad JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id JOIN proxies p ON adp.proxy_id = p.pkid GROUP BY p.proxy;
+------+-------------+-------+--------+---------------+---------+---------+-------------------------+------+---------------------------------+
| id   | select_type | table | type   | possible_keys | key     | key_len | ref                     | rows | Extra                           |
+------+-------------+-------+--------+---------------+---------+---------+-------------------------+------+---------------------------------+
|    1 | SIMPLE      | adp   | ALL    | NULL          | NULL    | NULL    | NULL                    | 9775 | Using temporary; Using filesort |
|    1 | SIMPLE      | ad    | eq_ref | PRIMARY       | PRIMARY | 8       | ipoid.adp.actor_data_id | 1    |                                 |
|    1 | SIMPLE      | p     | eq_ref | PRIMARY       | PRIMARY | 8       | ipoid.adp.proxy_id      | 1    |                                 |
+------+-------------+-------+--------+---------------+---------+---------+-------------------------+------+---------------------------------+

So we need to add an index to the actor proxy bc of the ALL table scan that happened?

> CREATE INDEX actor_data_proxies_proxy ON actor_data_proxies(proxy_id);
Query OK, 0 rows affected (0.034 sec)
Records: 0  Duplicates: 0  Warnings: 0

> EXPLAIN SELECT ad.ip FROM actor_data ad JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id JOIN proxies p ON adp.proxy_id = p.pkid WHERE p.proxy = 'ABCPROXY_PROXY';
+------+-------------+-------+--------+-----------------------------+--------------------------+---------+-------------------------+------+-------------+
| id   | select_type | table | type   | possible_keys               | key                      | key_len | ref                     | rows | Extra       |
+------+-------------+-------+--------+-----------------------------+--------------------------+---------+-------------------------+------+-------------+
|    1 | SIMPLE      | p     | const  | PRIMARY,proxy,proxies_proxy | proxy                    | 131     | const                   | 1    | Using index |
|    1 | SIMPLE      | adp   | ref    | actor_data_proxies_proxy    | actor_data_proxies_proxy | 8       | const                   | 63   |             |
|    1 | SIMPLE      | ad    | eq_ref | PRIMARY                     | PRIMARY                  | 8       | ipoid.adp.actor_data_id | 1    |             |
+------+-------------+-------+--------+-----------------------------+--------------------------+---------+-------------------------+------+-------------+
3 rows in set (0.004 sec)

> EXPLAIN SELECT p.proxy AS label, GROUP_CONCAT(ad.ip) AS ips FROM actor_data ad JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id JOIN proxies p ON adp.proxy_id = p.pkid GROUP BY p.proxy;
+------+-------------+-------+--------+--------------------------+--------------------------+---------+-------------------------+------+-------------+
| id   | select_type | table | type   | possible_keys            | key                      | key_len | ref                     | rows | Extra       |
+------+-------------+-------+--------+--------------------------+--------------------------+---------+-------------------------+------+-------------+
|    1 | SIMPLE      | p     | index  | PRIMARY                  | proxy                    | 131     | NULL                    | 29   | Using index |
|    1 | SIMPLE      | adp   | ref    | actor_data_proxies_proxy | actor_data_proxies_proxy | 8       | ipoid.p.pkid            | 168  |             |
|    1 | SIMPLE      | ad    | eq_ref | PRIMARY                  | PRIMARY                  | 8       | ipoid.adp.actor_data_id | 1    |             |
+------+-------------+-------+--------+--------------------------+--------------------------+---------+-------------------------+------+-------------+
3 rows in set (0.001 sec)

Then for the vpn endpoints, it looks like we don’t need any more indexes after the one on actor_data_proxies?

> EXPLAIN SELECT ad.ip
    ->         FROM actor_data ad
    ->         JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
    ->         JOIN proxies p ON adp.proxy_id = p.pkid
    ->         JOIN actor_data_tunnels adt ON ad.pkid = adt.actor_data_id
    ->         JOIN tunnels t ON adt.tunnel_id = t.pkid
    ->         WHERE t.type = 'VPN' AND p.proxy = 'AWM_PROXY';
+------+-------------+-------+--------+----------------------------------+---------+---------+-------------------------------+------+-------------+
| id   | select_type | table | type   | possible_keys                    | key     | key_len | ref                           | rows | Extra       |
+------+-------------+-------+--------+----------------------------------+---------+---------+-------------------------------+------+-------------+
|    1 | SIMPLE      | p     | const  | PRIMARY,proxy                    | proxy   | 131     | const                         | 1    | Using index |
|    1 | SIMPLE      | adt   | index  | PRIMARY                          | PRIMARY | 16      | NULL                          | 24   | Using index |
|    1 | SIMPLE      | t     | eq_ref | PRIMARY                          | PRIMARY | 8       | ipoid.adt.tunnel_id           | 1    | Using where |
|    1 | SIMPLE      | adp   | eq_ref | PRIMARY,actor_data_proxies_proxy | PRIMARY | 16      | ipoid.adt.actor_data_id,const | 1    | Using index |
|    1 | SIMPLE      | ad    | eq_ref | PRIMARY                          | PRIMARY | 8       | ipoid.adt.actor_data_id       | 1    |             |
+------+-------------+-------+--------+----------------------------------+---------+---------+-------------------------------+------+-------------+

> EXPLAIN SELECT p.proxy AS label, GROUP_CONCAT(ad.ip) AS ips
    ->           FROM actor_data ad
    ->           JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
    ->           JOIN proxies p ON adp.proxy_id = p.pkid
    ->           JOIN actor_data_tunnels adt ON ad.pkid = adt.actor_data_id
    ->           JOIN tunnels t ON adt.tunnel_id = t.pkid
    ->           WHERE t.type = 'VPN'
    ->           GROUP BY p.proxy;
+------+-------------+-------+--------+----------------------------------+---------+---------+-------------------------+------+----------------------------------------------+
| id   | select_type | table | type   | possible_keys                    | key     | key_len | ref                     | rows | Extra                                        |
+------+-------------+-------+--------+----------------------------------+---------+---------+-------------------------+------+----------------------------------------------+
|    1 | SIMPLE      | adt   | index  | PRIMARY                          | PRIMARY | 16      | NULL                    | 112  | Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | t     | eq_ref | PRIMARY                          | PRIMARY | 8       | ipoid.adt.tunnel_id     | 1    | Using where                                  |
|    1 | SIMPLE      | adp   | ref    | PRIMARY,actor_data_proxies_proxy | PRIMARY | 8       | ipoid.adt.actor_data_id | 1    | Using index                                  |
|    1 | SIMPLE      | p     | eq_ref | PRIMARY                          | PRIMARY | 8       | ipoid.adp.proxy_id      | 1    |                                              |
|    1 | SIMPLE      | ad    | eq_ref | PRIMARY                          | PRIMARY | 8       | ipoid.adt.actor_data_id | 1    |                                              |
+------+-------------+-------+--------+----------------------------------+---------+---------+-------------------------+------+----------------------------------------------+
5 rows in set (0.001 sec)

The temporary table and filesort in "Extra" look suspicious. There might be a struggle between grouping by proxy on the one hand but filtering by tunnel type on the other hand.

Discussed with with @STran and it looks like the query is wrong - we need tunnels with tunnels.type = 'VPN', but we shouldn't be joining on the proxies table.

So we need to add an index to the actor proxy bc of the ALL table scan that happened?
Then for the vpn endpoints, it looks like we don’t need any more indexes after the one on actor_data_proxies?

Looks like it.

So we need to add an index to the actor proxy bc of the ALL table scan that happened?
Then for the vpn endpoints, it looks like we don’t need any more indexes after the one on actor_data_proxies?

Looks like it.

Clarification: We don't really know what indexes are needed for the VPN queries until we know what those queries are.

@Marostegui @Ladsgroup The requested changes have been made to the schema and it is ready for re-review.

The tables now look like this:

MariaDB [ipoid]> DESCRIBE actor_data; DESCRIBE actor_data_behaviors; DESCRIBE actor_data_proxies; DESCRIBE actor_data_tunnels; DESCRIBE behaviors; DESCRIBE proxies; DESCRIBE tunnels; DESCRIBE import_status;
+------------------+---------------------+------+-----+------------------+----------------+
| Field            | Type                | Null | Key | Default          | Extra          |
+------------------+---------------------+------+-----+------------------+----------------+
| pkid             | bigint(20) unsigned | NO   | PRI | NULL             | auto_increment |
| ip               | varbinary(128)      | NO   | UNI | NULL             |                |
| org              | varbinary(1280)     | YES  |     | NULL             |                |
| client_count     | int(10) unsigned    | YES  |     | NULL             |                |
| types            | smallint(6)         | YES  |     | NULL             |                |
| conc_city        | varbinary(128)      | YES  |     | NULL             |                |
| conc_state       | varbinary(128)      | YES  |     | NULL             |                |
| conc_country     | varbinary(32)       | YES  |     | NULL             |                |
| countries        | int(11)             | YES  |     | NULL             |                |
| location_country | varbinary(32)       | YES  |     | NULL             |                |
| risks            | smallint(6)         | YES  |     | NULL             |                |
| last_updated     | int(11)             | YES  |     | unix_timestamp() |                |
+------------------+---------------------+------+-----+------------------+----------------+
12 rows in set (0.002 sec)

+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| actor_data_id | bigint(20) unsigned | NO   | PRI | NULL    |       |
| behavior_id   | bigint(20) unsigned | NO   | PRI | NULL    |       |
+---------------+---------------------+------+-----+---------+-------+
2 rows in set (0.002 sec)

+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| actor_data_id | bigint(20) unsigned | NO   | PRI | NULL    |       |
| proxy_id      | bigint(20) unsigned | NO   | PRI | NULL    |       |
+---------------+---------------------+------+-----+---------+-------+
2 rows in set (0.001 sec)

+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| actor_data_id | bigint(20) unsigned | NO   | PRI | NULL    |       |
| tunnel_id     | bigint(20) unsigned | NO   | PRI | NULL    |       |
+---------------+---------------------+------+-----+---------+-------+
2 rows in set (0.001 sec)

+----------+---------------------+------+-----+---------+----------------+
| Field    | Type                | Null | Key | Default | Extra          |
+----------+---------------------+------+-----+---------+----------------+
| pkid     | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| behavior | varbinary(64)       | YES  | UNI | NULL    |                |
+----------+---------------------+------+-----+---------+----------------+
2 rows in set (0.001 sec)

+-------+---------------------+------+-----+---------+----------------+
| Field | Type                | Null | Key | Default | Extra          |
+-------+---------------------+------+-----+---------+----------------+
| pkid  | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| proxy | varbinary(128)      | YES  | UNI | NULL    |                |
+-------+---------------------+------+-----+---------+----------------+
2 rows in set (0.001 sec)

+-----------+---------------------+------+-----+---------+----------------+
| Field     | Type                | Null | Key | Default | Extra          |
+-----------+---------------------+------+-----+---------+----------------+
| pkid      | bigint(20) unsigned | NO   | PRI | NULL    | auto_increment |
| operator  | varbinary(256)      | YES  | UNI | NULL    |                |
| type      | tinyint(4)          | YES  | MUL | NULL    |                |
| anonymous | tinyint(1)          | YES  |     | NULL    |                |
+-----------+---------------------+------+-----+---------+----------------+
4 rows in set (0.001 sec)

+---------------------+---------------------+------+-----+---------------------+----------------+
| Field               | Type                | Null | Key | Default             | Extra          |
+---------------------+---------------------+------+-----+---------------------+----------------+
| batchid             | bigint(20) unsigned | NO   | PRI | NULL                | auto_increment |
| timestamp           | timestamp           | NO   |     | current_timestamp() |                |
| feed_file_yesterday | varbinary(128)      | YES  |     | NULL                |                |
| feed_file_today     | varbinary(128)      | YES  |     | NULL                |                |
| batch_file          | varbinary(128)      | YES  |     | NULL                |                |
| batch_count         | int(11)             | YES  |     | NULL                |                |
| batch_status        | tinyint(4)          | YES  |     | NULL                |                |
+---------------------+---------------------+------+-----+---------------------+----------------+
7 rows in set (0.005 sec)

and the following indexes:

CREATE INDEX actor_data_ip ON actor_data(ip);
CREATE INDEX actor_data_proxies_proxyid ON actor_data_proxies(proxy_id);
CREATE INDEX tunnels_type_operator ON tunnels(type, operator);
CREATE INDEX actor_data_tunnels_tunnelid ON actor_data_tunnels(tunnel_id);

We expect to run the following queries:

// Importing/Updating data
INSERT INTO actor_data (ip,org,client_count,types,conc_city,conc_state,conc_country,countries,location_country,risks) VALUES (?,?,?,?,?,?,?,?,?,?);
INSERT INTO actor_data_proxies (actor_data_id, proxy_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = ?), (SELECT pkid FROM proxies WHERE proxy = ?) );

// /ip/:ip
SELECT * FROM actor_data WHERE ip = ? LIMIT 1;

// Getting and translating pkids for behaviors, proxies, tunnels
SELECT * FROM actor_data_behaviors WHERE actor_data_id IN (?);
SELECT * FROM behaviors WHERE pkid IN ( ? );


// /proxy/:label
SELECT ad.ip FROM actor_data ad
        JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
        JOIN proxies p ON adp.proxy_id = p.pkid WHERE p.proxy = ?;

// /proxies
SELECT p.proxy AS label, GROUP_CONCAT(ad.ip) AS ips
        FROM actor_data ad
        JOIN actor_data_proxies adp ON ad.pkid = adp.actor_data_id
        JOIN proxies p ON adp.proxy_id = p.pkid
        GROUP BY p.proxy;

// /vpn/:label
SELECT ad.ip
        FROM actor_data ad
        JOIN actor_data_tunnels adt ON ad.pkid = adt.actor_data_id
        JOIN tunnels t ON adt.tunnel_id = t.pkid
        WHERE t.type = 1 AND t.operator = ?;

// vpns
SELECT t.operator AS label, GROUP_CONCAT(ad.ip) AS ips
          FROM actor_data ad
          JOIN actor_data_tunnels adt ON ad.pkid = adt.actor_data_id
          JOIN tunnels t ON adt.tunnel_id = t.pkid
          WHERE t.type = 1
          GROUP BY t.operator;