Page MenuHomePhabricator

Ensure dependent SQL statements are all run in the same batch
Closed, ResolvedPublic

Description

The ipoid database needs to be updated daily with new data for that day. To do this we diff yesterday's data and today's data, and produce a .sql file for the changes that need to be made. The file typically contains millions of statements.

Following T344272: Use batching for daily data import, the .sql file is processed in batches, in order not to use up database resources. Batches are created by splitting the file into smaller files with n lines. However, some of the statements have dependencies. Currently, there is no guarantee that dependent statements are in the same batch, so if one batch fails, it might cause subsequent batches to fail.

We should ensure that dependent statements are in the same batch.

Details

TitleReferenceAuthorSource BranchDest Branch
output-diff.js: Ensure statements for one actor are on one linerepos/mediawiki/services/ipoid!47tchandersbatch-dependenciesmain
Customize query in GitLab

Event Timeline

Examples

I think there are two examples of dependent statements:

  1. Updating data about an actor, treated as a DELETE plus INSERT. The INSERT is dependent on the DELETE, since actor_data.ip is unique. (INSERT IGNORE was decided against due to silencing errors. Example:
DELETE FROM actor_data WHERE pkid IN (SELECT pkid FROM actor_data WHERE ip = 1.2.3.4);
INSERT INTO actor_data (ip,org,client_count,types,conc_city,conc_state,conc_country,countries,location_country,risks) VALUES ('1.2.3.4',?,?,?,?,?,?,?,?,?);
  1. Updating actor_data_behaviors, actor_data_proxies, or actor_data_tunnels, which reads IDs from the actor_data and behaviors/proxies/tunnels tables, so those entries must already have been added. Example:
INSERT IGNORE INTO proxies (proxy) VALUES ('SOME_PROXY');
INSERT INTO actor_data (ip,org,client_count,types,conc_city,conc_state,conc_country,countries,location_country,risks) VALUES ('1.2.3.4',?,?,?,?,?,?,?,?,?);
INSERT INTO actor_data_proxies (actor_data_id, proxy_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = '1.2.3.4'), (SELECT pkid FROM proxies WHERE proxy = 'SOME_PROXY' ) );

Solution

We can solve this by putting the statements for the same actor all on the same line.

The dependencies on entries in the behaviors/proxies/tunnels tables will be solved by T344499: Move behaviors/proxies/tunnels processing into an independent script, after which the possible values will already exist in the tables, so we don't need to address those here.