Page MenuHomePhabricator

Duplicate database entries
Open, Needs TriagePublicBUG REPORT

Description

What is the problem?

There are circumstances where duplicate rows are written to the database tables.

So far, I have seen it with actor_data_behaviors, actor_data_proxies, behaviors and proxies.

I am not sure if this will lead to any problems. Perhaps incorrect counting of numbers of proxies?

Steps to reproduce problem
  1. Download and put it into the ipoid/tmp directory
  2. If necessary, start up docker in the ipoid directory (e.g. docker compose up -d)
  3. Initialise the database: docker compose exec web node init-db.js
  4. Run this command: docker compose exec web node import-data.js ./tmp/duplicate_rows_reprod.json.gz
  5. Access the database (e.g. docker compose exec db mysql test -u root -p)
  6. Run the queries:
SELECT * FROM (SELECT *, COUNT(*) AS duplicates FROM actor_data_behaviors GROUP BY actor_data_id,behavior_id) AS foo WHERE duplicates > 1;
SELECT * FROM (SELECT *, COUNT(*) AS duplicates FROM actor_data_proxies GROUP BY actor_data_id,proxy_id) AS foo WHERE duplicates > 1;
SELECT * FROM (SELECT *, COUNT(*) AS duplicates FROM behaviors GROUP BY behavior) AS foo WHERE duplicates > 1;
SELECT * FROM (SELECT *, COUNT(*) AS duplicates FROM proxies GROUP BY proxy) AS foo WHERE duplicates > 1;

Expected behavior: No results should be returned
Observed behavior: Output like:

+---------------+-------------+------------+
| actor_data_id | behavior_id | duplicates |
+---------------+-------------+------------+
|            52 |           2 |          2 |
|            53 |           2 |          2 |
|           106 |          14 |          2 |
|           107 |          10 |          2 |
|           107 |          14 |          2 |
+---------------+-------------+------------+
5 rows in set (0.001 sec)

+---------------+----------+------------+
| actor_data_id | proxy_id | duplicates |
+---------------+----------+------------+
|            54 |        3 |          2 |
|            54 |        7 |          5 |
|           108 |        3 |          2 |
|           108 |       14 |          5 |
+---------------+----------+------------+
4 rows in set (0.000 sec)

+------+----------+------------+
| pkid | behavior | duplicates |
+------+----------+------------+
|    2 | NULL     |          2 |
+------+----------+------------+
1 row in set (0.000 sec)

+------+-------+------------+
| pkid | proxy | duplicates |
+------+-------+------------+
|    1 | NULL  |          6 |
+------+-------+------------+
1 row in set (0.000 sec)
Environment

ipoid commit 8cb472e49f7b8297940a808a004a66a81ebbfaf4

Event Timeline

(Did a really quick skim of this so please double check what I'm saying)

I don't think there's a bug in the actor_* tables. Those are expected to duplicate actor ids and proxy ids since actors can have multiple proxies and proxies can be used by multiple actors.

The null thing might be a problem but iirc we've merged the guard patch already so it's possible that's fixed. We probably don't want null values in the behaviors/proxies/tunnels so could someone check the schema and see if we should add a NOT NULL?