Page MenuHomePhabricator

Investigate non-deterministic behaviour of import
Open, Needs TriagePublicBUG REPORT

Description

What is the problem?

I have seen a few examples of data being imported incorrectly or differently, but it is not consistently reproducible.

I wonder if this is a symptom of a larger issue which we should investigate.

The example below shows where a row is not being written to the actor_data_tunnels table when it should. It occurs maybe half the times I have tried it.

Here are other examples I have found so far:

Steps to reproduce problem
  1. Save the JSON from "Reproduction data" below as a .gz file (e.g. reprod.json.gz) into the ipoid/tmp directory
  2. If necessary, start up docker in the ipoid directory (e.g. docker compose up -d)
  3. Initialise the database: docker compose exec web node init-db.js
  4. Run this command: docker compose exec web node import-data.js ./tmp/reprod.json.gz
  5. Access the database (e.g. docker compose exec db mysql test -u root -p)
  6. Run the query: SELECT * FROM actor_data_tunnels;
  7. You may need to repeat steps 3 and 4 multiple times to reproduce

Expected behavior: Output something like:

MariaDB [test]> SELECT * FROM actor_data_tunnels;
+---------------+-----------+
| actor_data_id | tunnel_id |
+---------------+-----------+
|             1 |         1 |
+---------------+-----------+

Observed behavior: Output something like:

MariaDB [test]> Empty set (0.001 sec)
Environment

ipoid commit 8cb472e49f7b8297940a808a004a66a81ebbfaf4

Reproduction data
{"as": {"number": 12345, "Organization": "Foobar"}, "client": {"behaviors": ["FILE_SHARING"], "concentration": {"geohash": "srwcr5ugt", "city": "Foobar", "state": "Foobar", "country": "NL", "skew": 0.5, "density": 0.5, "countries": 10}, "count": 10, "countries": 10, "proxies": ["OXYLABS_PROXY"], "spread": 12345, "types": ["MOBILE"]}, "infrastructure": "MOBILE", "location": {"city": "Baku", "state": "Baku City", "country": "AZ"}, "organization": "Foobar", "risks": ["CALLBACK_PROXY"], "services": ["IPSEC"], "tunnels": [{"operator": "TUNNELBEAR_VPN"}], "ip": "6bcaac10-88d8-4e2c-8e98-ae27d8840e26"}
Useful scripts

QA Results - iPod

Details

TitleReferenceAuthorSource BranchDest Branch
Document the counter in async scriptsrepos/mediawiki/services/ipoid!143tchandersdocument-countermain
Ensure the DB connection stays open until the last statement is runrepos/mediawiki/services/ipoid!62tchandersfix-early-closemain
Customize query in GitLab

Event Timeline

I ran this around 10 times on main and was unable to reproduce this and looking at the data, I'm not sure why this one would fail specifically. 50% is a concerning rate of failure. Have longer files failed as well? If they do they fail with similar rates?

STran mentioned this in Unknown Object (Task).Jul 15 2023, 4:21 AM

I ran this around 10 times on main and was unable to reproduce this and looking at the data, I'm not sure why this one would fail specifically. 50% is a concerning rate of failure. Have longer files failed as well? If they do they fail with similar rates?

I tried on a different computer with mariadb 10.4. In ~10 runs I reproduced it every time. I'll see if I can get a longer file to fail.

I have been testing (from the updated master branch) importing different reprod data shared across various tickets and checking the results in the DB following the steps above. I was enabled to reproduce the error explained above and get the expected behavior. However, some other Reprod data are still rising errors as seen below.

With reprod data from T339204 (second comment from Dom)

select * from actor_data_tunnels;
+---------------+-----------+
| actor_data_id | tunnel_id |
+---------------+-----------+
|             2 |         1 |
|             1 |         1 |
+---------------+-----------+

With reprod data from T339204 (First comment from Dom)

select * from actor_data_tunnels;
+---------------+-----------+
| actor_data_id | tunnel_id |
+---------------+-----------+
|             2 |         1 |
+---------------+-----------+

With reprod Data from T339324 = "ER_DATA_TOO_LONG"

/srv/service/node_modules/mariadb/lib/misc/errors.js:61
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: (conn=409, no: 1406, SQLState: 22001) Data too long for column 'type' at row 1
sql: INSERT INTO
                        tunnels (operator, type, anonymous)
                    VALUES
                        (?, ?, ?) - parameters:['¢','𺋒',true]
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:61:10)
    at PacketNodeEncoded.readError (/srv/service/node_modules/mariadb/lib/io/packet.js:511:19)
    at Query.readResponsePacket (/srv/service/node_modules/mariadb/lib/cmd/resultset.js:46:28)
    at PacketInputStream.receivePacketBasic (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:104:9)
    at PacketInputStream.onData (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:169:20)
    at Socket.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:315:12)
    at readableAddChunk (node:internal/streams/readable:289:9)
    at Socket.Readable.push (node:internal/streams/readable:228:10)
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23) {
  text: "Data too long for column 'type' at row 1",
  sql: 'INSERT INTO\n' +
    '                        tunnels (operator, type, anonymous)\n' +
    '                    VALUES\n' +
    "                        (?, ?, ?) - parameters:['¢','𺋒.\b',true]",
  fatal: false,
  errno: 1406,
  sqlState: '22001',
  code: 'ER_DATA_TOO_LONG'
}

With data from T339837 = "ER_TRUNCATED_WRONG_VALUE_FOR_FIELD"

/srv/service/node_modules/mariadb/lib/misc/errors.js:61
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: (conn=425, no: 1366, SQLState: 22007) Incorrect integer value: '?\008DOêÔ' for column `test`.`tunnels`.`anonymous` at row 1
sql: INSERT INTO
                        tunnels (operator, type, anonymous)
                    VALUES
                        (?, ?, ?) - parameters:['¨',-26,'𰄛OêÔ']
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:61:10)
    at PacketNodeEncoded.readError (/srv/service/node_modules/mariadb/lib/io/packet.js:511:19)
    at Query.readResponsePacket (/srv/service/node_modules/mariadb/lib/cmd/resultset.js:46:28)
    at PacketInputStream.receivePacketBasic (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:104:9)
    at PacketInputStream.onData (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:169:20)
    at Socket.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:315:12)
    at readableAddChunk (node:internal/streams/readable:289:9)
    at Socket.Readable.push (node:internal/streams/readable:228:10)
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23) {
  text: "Incorrect integer value: '?\\008DOêÔ' for column `test`.`tunnels`.`anonymous` at row 1",
  sql: 'INSERT INTO\n' +
    '                        tunnels (operator, type, anonymous)\n' +
    '                    VALUES\n' +
    "                        (?, ?, ?) - parameters:['¨',-26,'𰄛\x8DOêÔ']",
  fatal: false,
  errno: 1366,
  sqlState: '22007',
  code: 'ER_TRUNCATED_WRONG_VALUE_FOR_FIELD'
}

With reprod data from T340178 = "ER_WARN_DATA_OUT_OF_RANGE"

return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: (conn=528, no: 1264, SQLState: 22003) Out of range value for column 'anonymous' at row 1
sql: INSERT INTO
                        tunnels (operator, type, anonymous)
                    VALUES
                        (?, ?, ?) - parameters:['',null,-6860]
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:61:10)
    at PacketNodeEncoded.readError (/srv/service/node_modules/mariadb/lib/io/packet.js:511:19)
    at Query.readResponsePacket (/srv/service/node_modules/mariadb/lib/cmd/resultset.js:46:28)
    at PacketInputStream.receivePacketBasic (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:104:9)
    at PacketInputStream.onData (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:169:20)
    at Socket.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:315:12)
    at readableAddChunk (node:internal/streams/readable:289:9)
    at Socket.Readable.push (node:internal/streams/readable:228:10)
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23) {
  text: "Out of range value for column 'anonymous' at row 1",
  sql: 'INSERT INTO\n' +
    '                        tunnels (operator, type, anonymous)\n' +
    '                    VALUES\n' +
    "                        (?, ?, ?) - parameters:['\x7F',null,-6860]",
  fatal: false,
  errno: 1264,
  sqlState: '22003',
  code: 'ER_WARN_DATA_OUT_OF_RANGE'
}

With reprod data from T341660:

select * from actor_data_tunnels;
 +---------------+-----------+
| actor_data_id | tunnel_id |
+---------------+-----------+
|             1 |         1 |
+---------------+-----------+

Even after reinitializing the DB and importing again, still the same result.
If I do not reinitialize the DB I just get new rows added with the right ID and the same tunnel_id (1)

Getting a long table of values from Reprod datas shared in T341180

Screenshot 2023-07-20 at 8.12.24 PM.png (1×1 px, 299 KB)

Screenshot 2023-07-20 at 8.12.01 PM.png (928×990 px, 220 KB)

I ran this around 10 times on main and was unable to reproduce this and looking at the data, I'm not sure why this one would fail specifically. 50% is a concerning rate of failure. Have longer files failed as well? If they do they fail with similar rates?

This file sometimes does not store all the rows in behaviors. It fails about 50% of the times I have tried it.

After importing it, I run this query to check for dangling rows: SELECT * FROM behaviors LEFT JOIN actor_data_behaviors ON pkid=behavior_id WHERE actor_data_id IS NULL;

Possibly it is because the offending entry causes an uncaught exception (see T339345#9047859).

dom_walden mentioned this in Unknown Object (Task).Jul 27 2023, 1:37 PM

Took another look at my import and results

Using this reprod data:


Following the steps described in the description above (initiating and re-importing the same file), the only difference I see for this specific reprod file is this one:
Actor data #127 has tunnel_id 7 in one import and tunnel_id 9 in another import.

Screenshot 2023-08-14 at 11.05.17 AM.png (632×724 px, 54 KB)
Screenshot 2023-08-14 at 11.05.08 AM.png (626×716 px, 53 KB)

Using this reprod data:


The import is being stopped on a duplicated entry error each time I import (initiating the db each time and importing the same file). The values imported before the import stop at the error aren't the same except for a few ones (actor data 71).
The difference I'm seeing is for value actor_data 91, the tunnel_id is 11 and 12 in one import and 10 and 12 in the second import.
Could be related to the whole file not being imported entirely and missing all the values.

Screenshot 2023-08-14 at 11.26.06 AM.png (746×718 px, 62 KB)
Screenshot 2023-08-14 at 11.25.53 AM.png (632×724 px, 54 KB)

Using reprod data:


This import stopped every time on a duplicate entry error.
I see inconsistency with actor_data_id 131, which has tunnel_id 11 in one import and 9 in the second
Actord_data_id 128 has tunnel_id 9 -10 - 12 - 13 in the first import and 10 -11 -12 - 13 in the second.

Screenshot 2023-08-14 at 11.42.47 AM.png (768×718 px, 61 KB)
Screenshot 2023-08-14 at 11.43.04 AM.png (798×732 px, 64 KB)

Those inconsistencies happened at every other import.
I suspect the issue is the same as observed in T339325

@dom_walden Thanks for the helpful task description, test data, scripts, etc.

I ran init/import with the test data provided several times, and encountered the bug every time.

To investigate, I logged the SQL statements generated in mariadb/lib/connection.js and typically saw something like this:

REPLACE INTO
            actor_data (ip,org,client_count,types,conc_city,conc_state,conc_country,countries,location_country,risks)
        VALUES
            (?,?,?,?,?,?,?,?,?,?)
SELECT pkid from behaviors where behavior = ?;
SELECT pkid from proxies where proxy = ?;
SELECT pkid from tunnels where operator = ?;
INSERT INTO behaviors (behavior) VALUES (?);
INSERT INTO proxies (proxy) VALUES (?);
INSERT INTO
                        tunnels (operator, type, anonymous)
                    VALUES
                        (?, ?, ?)
DELETE FROM actor_data where last_updated < UNIX_TIMESTAMP(NOW() - INTERVAL 1 DAY)

The final statement comes from the data stream's close event handler, which also closes the connection pool.

The bug here is that the pool is being terminated before the various async functions updating the database have finished running.

I'll re-use this task for actually fixing the problem

tchanders updated https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/62

Ensure the DB connection stays open until the last statement is run

Ready for QA, but moving to stalled until we've made calling the script easier. (It has changed since the task description was written and will likely change again)

Ready for QA, but moving to stalled until we've made calling the script easier. (It has changed since the task description was written and will likely change again)

Moving back to QA since this is sorted. Presumably won't need further instructions if tested after T341122: Implement daily data update routine - but please let me know if I'm wrong about that!

@Tchanders After step 6 as seen below, I received an error.

Status: ❌ FAIL
Environment: iPoid: 9ab97d4d2d80fe3dffd80b33a135095e90a62eca, Tue Oct 17 07:55:28 2023
OS: macOS Sonoma 14.0
Browser: Chrome 117
Skins. N/A
Device: MBA M2
Emulated Device:: n/a
Test Links:
Visual Studio- under iPoid folder

❌AC1: https://phabricator.wikimedia.org/T341660

STEPS

  1. Save the JSON from "Reproduction data" from description as a .gz file (e.g. reprod1.json.gz) into the ipoid/tmp directory
  2. If necessary, start up docker in the ipoid directory (e.g. docker compose up -d)
  3. Added Users: docker compose exec web node -e "require('./create-users.js')();"
  4. Initialise the database: docker compose exec web node -e "require('./init-db.js')();"
  5. Ran this command: docker compose exec web ./diff.sh --today tmp/reprod1.json.gz
  6. Import Script: docker compose exec web ./import.sh
Importing properties generated by diff...
Resetting and generating substatement files...
Importing /tmp/sub/query_split_aa.sql...
{"name":"data-import","hostname":"8607fac2092c","pid":242,"level":50,"err":{"message":"(conn=28, no: 1048, SQLState: 23000) Column 'tunnel_id' cannot be null\nsql: INSERT INTO actor_data_tunnels (actor_data_id, tunnel_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = '6bcaac10-88d8-4e2c-8e98-ae27d8840e26'), (SELECT pkid FROM tunnels WHERE operator = 'TUNNELBEAR_VPN' ) ); - parameters:[]","name":"SqlError","stack":"SqlError: (conn=28, no: 1048, SQLState: 23000) Column 'tunnel_id' cannot be null\nsql: INSERT INTO actor_data_tunnels (actor_data_id, tunnel_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = '6bcaac10-88d8-4e2c-8e98-ae27d8840e26'), (SELECT pkid FROM tunnels WHERE operator = 'TUNNELBEAR_VPN' ) ); - parameters:[]\n    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:64:10)\n    at PacketNodeEncoded.readError (/srv/service/node_modules/mariadb/lib/io/packet.js:575:19)\n    at Query.readResponsePacket (/srv/service/node_modules/mariadb/lib/cmd/parser.js:58:28)\n    at PacketInputStream.receivePacketBasic (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:85:9)\n    at PacketInputStream.onData (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:145:26)\n    at Socket.emit (node:events:513:28)\n    at addChunk (node:internal/streams/readable:315:12)\n    at readableAddChunk (node:internal/streams/readable:289:9)\n    at Socket.Readable.push (node:internal/streams/readable:228:10)\n    at TCP.onStreamRead (node:internal/stream_base_commons:190:23)","code":"ER_BAD_NULL_ERROR"},"yesterdayFile":"","todayFile":"tmp/reprod1.json.gz","batchFile":"/tmp/sub/query_split_aa.sql","statement":"'INSERT INTO actor_data_tunnels (actor_data_id, tunnel_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = \\'6bcaac10-88d8-4e2c-8e98-ae27d8840e26\\'), (SELECT pkid FROM tunnels WHERE operator = \\'TUNNELBEAR_VPN\\' ) );'","msg":"(conn=28, no: 1048, SQLState: 23000) Column 'tunnel_id' cannot be null\nsql: INSERT INTO actor_data_tunnels (actor_data_id, tunnel_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = '6bcaac10-88d8-4e2c-8e98-ae27d8840e26'), (SELECT pkid FROM tunnels WHERE operator = 'TUNNELBEAR_VPN' ) ); - parameters:[]","time":"2023-10-18T15:38:02.161Z","v":0}
node:internal/process/promises:279
            triggerUncaughtException(err, true /* fromPromise */);
            ^

SqlError: (conn=28, no: 1048, SQLState: 23000) Column 'tunnel_id' cannot be null
sql: INSERT INTO actor_data_tunnels (actor_data_id, tunnel_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = '6bcaac10-88d8-4e2c-8e98-ae27d8840e26'), (SELECT pkid FROM tunnels WHERE operator = 'TUNNELBEAR_VPN' ) ); - parameters:[]
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:64:10)
    at PacketNodeEncoded.readError (/srv/service/node_modules/mariadb/lib/io/packet.js:575:19)
    at Query.readResponsePacket (/srv/service/node_modules/mariadb/lib/cmd/parser.js:58:28)
    at PacketInputStream.receivePacketBasic (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:85:9)
    at PacketInputStream.onData (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:145:26)
    at Socket.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:315:12)
    at readableAddChunk (node:internal/streams/readable:289:9)
    at Socket.Readable.push (node:internal/streams/readable:228:10)
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23) {
  sqlMessage: "Column 'tunnel_id' cannot be null",
  sql: "INSERT INTO actor_data_tunnels (actor_data_id, tunnel_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = '6bcaac10-88d8-4e2c-8e98-ae27d8840e26'), (SELECT pkid FROM tunnels WHERE operator = 'TUNNELBEAR_VPN' ) ); - parameters:[]",
  fatal: false,
  errno: 1048,
  sqlState: '23000',
  code: 'ER_BAD_NULL_ERROR'
}

Thanks @GMikesell-WMF.

The bug is now a new problem: we're only storing tunnels that have type either VPN or PROXY, but we're still attempting to link an actor with a tunnel that has neither of these types (hence we're trying to access a tunnel that was never stored, hence the SQL error).

We should solve this bug as part of T348659: [S] Decide which tunnels to store.

The non-deterministic problem should be solved, so I'll close this in favour of the other task.

The non-deterministic problem should be solved, so I'll close this in favour of the other task.

Because we haven't been able to test whether or not the non-deterministic behaviour still exists, I might argue that we should keep this open for now until either:

  1. We find another way to test for non-determinism (perhaps we can update the script to work with more recent changes to the ipoid cli), or
  2. T348659 is resolved

@Tchanders @GMikesell-WMF

Fair enough - let's block on T348659: [S] Decide which tunnels to store and bring this into the new sprint.

kostajh removed a subscriber: AGueyte.

@STran @Tchanders is this task still relevant, or can we close it?