Page MenuHomePhabricator

Catch errors when inserting data for individual IP addresses into the database
Open, Needs TriagePublic

Description

ipoid populates a database via the script init-db.js, which reads a data file that typically has tens of millions of lines.

Inserts may be unsuccessful due to the nature of the data for a particular IP address, e.g.:

The script currently crashes if it encounters an error, but instead it should catch and log the error, and continue.

Details

ReferenceSource BranchDest BranchAuthorTitle
repos/mediawiki/services/ipoid!3init-db-errorsmaintchandersinit-db.js: Catch and log errors with inserting data
Customize query in GitLab

Event Timeline

@STran How robust do we need/want this to be?

I find with input data:

{"as": {}, "client": {"behaviors": [], "concentration": {}, "count": true, "countries": true, "proxies": [0], "spread": false, "types": []}, "infrastructure": false, "location": {}, "organization": false, "risks": [], "services": [], "tunnels": [], "ip": "6dda8fa6-e044-498c-b66d-bcd90f3e99c6"}
{"as": {}, "client": {"behaviors": [], "concentration": {"city": "", "country": 0}, "count": false, "countries": false, "proxies": [false], "spread": false, "types": []}, "infrastructure": false, "location": {}, "organization": "", "risks": [], "services": [], "tunnels": [], "ip": "039e2d19-8584-40cf-ad18-474f0f09c05b"}

I get:

/srv/service/node_modules/mariadb/lib/misc/errors.js:61
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: (conn=1045, no: 1054, SQLState: 42S22) Unknown column 'undefined' in 'field list'
sql: INSERT INTO
                actor_data_proxies (actor_data_id,proxy_id)
            VALUES
                (2, undefined) - parameters:[]
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:61:10)
    at PacketNodeEncoded.readError (/srv/service/node_modules/mariadb/lib/io/packet.js:511:19)
    at Query.readResponsePacket (/srv/service/node_modules/mariadb/lib/cmd/resultset.js:46:28)
    at PacketInputStream.receivePacketBasic (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:104:9)
    at PacketInputStream.onData (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:169:20)
    at Socket.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:315:12)
    at readableAddChunk (node:internal/streams/readable:289:9)
    at Socket.Readable.push (node:internal/streams/readable:228:10)
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23) {
  text: "Unknown column 'undefined' in 'field list'",
  sql: 'INSERT INTO\n' +
    '                actor_data_proxies (actor_data_id,proxy_id)\n' +
    '            VALUES\n' +
    '                (2, undefined) - parameters:[]',
  fatal: false,
  errno: 1054,
  sqlState: '42S22',
  code: 'ER_BAD_FIELD_ERROR'
}

I have also seen similar errors reported with the actor_data_behaviors table.

I think something to do with the fact you have got both "proxies": [0] and "proxies": [false], which both get changed to 0 when attempting to insert into the database, so the second INSERT will fail. Notice that you can swap the order of those two lines and still get the same error.

FYI, I am using this script to generate the JSON for import: https://gitlab.wikimedia.org/dwalden/ipmasking-testing/-/blob/main/spur_random_data.py

Well it should probably be robust enough to handle that by throwing away the false value or something. If you could file these as you see them, that'd be amazing! This task was spun out of a vaguer task that acknowledged there could be forms of the data that aren't handled and to call them out so that we can build handling for them.

Well it should probably be robust enough to handle that by throwing away the false value or something. If you could file these as you see them, that'd be amazing! This task was spun out of a vaguer task that acknowledged there could be forms of the data that aren't handled and to call them out so that we can build handling for them.

@STran Do we want to validate every variable before entering it into the database, perhaps falling back to a default if it is not valid? So theoretically there should be no exceptions, caught or uncaught. Let me know if you want me to raise a separate task.

Here are all the caught exceptions I have found so far:

Cannot read properties of null (reading 'count')","name":"TypeError","stack":"TypeError: Cannot read properties of null (reading 'count')
Cannot read properties of null (reading 'country')","name":"TypeError","stack":"TypeError: Cannot read properties of null (reading 'country')
Cannot read properties of null (reading 'operator')","name":"TypeError","stack":"TypeError: Cannot read properties of null (reading 'operator')
Cannot read properties of undefined (reading 'count')","name":"TypeError","stack":"TypeError: Cannot read properties of undefined (reading 'count')
Cannot read properties of undefined (reading 'country')","name":"TypeError","stack":"TypeError: Cannot read properties of undefined (reading 'country')
Column 'ip' cannot be null
Data too long for column 'conc_country' at row 1
Data too long for column 'conc_geohash' at row 1
Data truncated for column 'client_count' at row 1
Incorrect integer value: '' for column `test`.`actor_data`.`conc_skew` at row 1
Incorrect integer value: '\\0001\\009A\\001C?ç?¥£??Ù??ǽ¼\\0008蠐Ù\\0080\\000EGÒ\\000E' for column `test`.`actor_data`.`conc_density` at row 1
Incorrect integer value: '®' for column `test`.`actor_data`.`countries` at row 1
Incorrect integer value: 'Ë-¨HÆÕ?îS\\009DÑ´?N_哕 \\0083' for column `test`.`actor_data`.`client_count` at row 1
Out of range value for column 'client_count' at row 1
Out of range value for column 'conc_density' at row 1
Out of range value for column 'conc_skew' at row 1
Out of range value for column 'countries' at row 1
Parameter at position 1 is undefined
Parameter at position 2 is undefined
entity.behaviors.forEach is not a function","name":"TypeError","stack":"TypeError: entity.behaviors.forEach is not a function
entity.proxies.forEach is not a function","name":"TypeError","stack":"TypeError: entity.proxies.forEach is not a function

Data file to reproduce:

I have found some circumstances where caught exceptions are leading to the script only recording partial data for an IP.

For example, this data:

{"as": {"number": 12345, "Organization": "Foobar"}, "client": {"behaviors": 1, "concentration": {"geohash": "srwcr5ugt", "city": "Foobar", "state": "Foobar", "country": "NL", "skew": 0.5, "density": 0.5, "countries": 10}, "count": 10, "countries": 10, "proxies": ["OXYLABS_PROXY"], "spread": 12345, "types": ["MOBILE"]}, "infrastructure": "MOBILE", "location": {"city": "Baku", "state": "Baku City", "country": "AZ"}, "organization": "Foobar", "risks": ["CALLBACK_PROXY"], "services": ["IPSEC"], "tunnels": [{"operator": "TUNNELBEAR_VPN", "type": "VPN", "anonymous": true}], "ip": "d79b6038-e748-4b18-935e-b410454a2044"}

Output of import-data.js:

{"name":"data-import","hostname":"dca69ef4d4bc","pid":1341,"level":50,"err":{"message":"entity.behaviors.forEach is not a function","name":"TypeError","stack":"TypeError: entity.behaviors.forEach is not a function\n    at writeEntityToDb (/srv/service/import-data.js:122:22)\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)\n    at async Transform.transform [as _transform] (/srv/service/import-data.js:254:13)"},"ip":"d79b6038-e748-4b18-935e-b410454a2044","msg":"entity.behaviors.forEach is not a function","time":"2023-06-27T12:04:28.790Z","v":0}

But does record an entry in the actor_data table, but all the other tables are empty, even though it has a legit entry for proxies and tunnels:

SELECT * FROM actor_data;
+------+--------------------------------------+--------+--------------+--------+--------------+-----------+------------+--------------+-----------+--------------+-----------+------------------+-------+--------------+
| pkid | ip                                   | org    | client_count | types  | conc_geohash | conc_city | conc_state | conc_country | conc_skew | conc_density | countries | location_country | risks | last_updated |
+------+--------------------------------------+--------+--------------+--------+--------------+-----------+------------+--------------+-----------+--------------+-----------+------------------+-------+--------------+
|    1 | d79b6038-e748-4b18-935e-b410454a2044 | Foobar |           10 | MOBILE | srwcr5ugt    | Foobar    | Foobar     | NL           |         1 |            1 |        10 | AZ               |       |   1687867468 |
+------+--------------------------------------+--------+--------------+--------+--------------+-----------+------------+--------------+-----------+--------------+-----------+------------------+-------+--------------+