Page MenuHomePhabricator

"Data too long for column 'type' at row 1"
Closed, InvalidPublic2 Estimated Story PointsBUG REPORT

Description

What is the problem?

Some combinations of data will lead to the below exception. The exception does not always show (see "Example shell session" below), perhaps because node is entering the data asynchronously.

/srv/service/node_modules/mariadb/lib/misc/errors.js:61
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: (conn=45, no: 1406, SQLState: 22001) Data too long for column 'type' at row 1
sql: INSERT INTO
                        tunnels (operator, type, anonymous)
                    VALUES
                        (?, ?, ?) - parameters:['¢','𺋒',true]
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:61:10)
    at PacketNodeEncoded.readError (/srv/service/node_modules/mariadb/lib/io/packet.js:511:19)
    at Query.readResponsePacket (/srv/service/node_modules/mariadb/lib/cmd/resultset.js:46:28)
    at PacketInputStream.receivePacketBasic (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:104:9)
    at PacketInputStream.onData (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:169:20)
    at Socket.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:315:12)
    at readableAddChunk (node:internal/streams/readable:289:9)
    at Socket.Readable.push (node:internal/streams/readable:228:10)
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23) {
  text: "Data too long for column 'type' at row 1",
  sql: 'INSERT INTO\n' +
    '                        tunnels (operator, type, anonymous)\n' +
    '                    VALUES\n' +
    "                        (?, ?, ?) - parameters:['¢','𺋒.\b',true]",
  fatal: false,
  errno: 1406,
  sqlState: '22001',
  code: 'ER_DATA_TOO_LONG'
}
Reproduction data
{"as": {}, "client": {"behaviors": [null, -106, "\u0003", -30479], "concentration": {"country": true, "skew": -15301, "state": null}, "count": false, "countries": -23218, "proxies": [false, true, false, "\ud9ff\ude86\ud997\udd27\u00d6?\ud92f\udf71\u00f7\u00c7\u00ec\u00d0%\u00dfMX\ud9f6\uddb1+\u000bPB\u00d1\ud8b2\ude56"], "spread": "", "types": ["HEADLESS", "DESKTOP", "HEADLESS", "HEADLESS"]}, "infrastructure": null, "location": {"city": "\udb68\udf27", "country": "\u00fc\u00f0|\u00a8\udb53\ude5a\ud977\udf0c\u00ac&X\ud956\udc18\u001b\udbd1\uddda", "state": "\u0083"}, "organization": "\u00d3\u00a1", "risks": ["WEB_SCRAPING", "WEB_SCRAPING", "WEB_SCRAPING", "WEB_SCRAPING", "GEO_MISMATCH", "LOGIN_BRUTEFORCE", "CALLBACK_PROXY", "WEB_SCRAPING", "TUNNEL", "GEO_MISMATCH", "TUNNEL"], "services": [false, null], "tunnels": [{"anonymous": true, "entries": [true, null, "\udbdf\udd91\u00e90", "\u0082", "\u00a2f\uda3b\udf65\u009c\"\r", false, false, "#", 30612], "operator": "\u00a2", "type": "\ud8a8\uded2.\b"}, {"anonymous": "\ud8c2\udfda", "entries": ["\\", "\u00cb#_\u007f", "a", -8932], "operator": "\u00bf;\u00b8\u000f\u000b\u00c2\u00c8\u00de\ud84b\udcce", "type": "\u00ab"}, {"anonymous": 14808, "entries": ["\udb9b\uded9\u0013\u00de", null, false, true, -7, false, 5732, null, null], "operator": "\u009c", "type": "\u00da\u00a6\u00c2\nJ1\u00a8y\u0097\u00eaH}\u00b3\u00e6\udbf2\udd6f\u0016"}, {"anonymous": "\uda4c\udf55", "entries": [-16, "u"], "operator": 1882290103, "type": "\udb56\udec2\u0000"}, {"anonymous": "\ud8a9\udefb", "entries": [], "operator": "\u00c9\u00b0", "type": "~"}, {"anonymous": -96936237097570309219903303351460102886, "entries": ["\u009eQ\udb9f\udc14"], "operator": "\u009b", "type": null}, {"anonymous": "", "entries": ["\u001d\u00f2\udbc9\udeaa \u008c", true, "\u0005\ud8bf\udda6\uda34\udc51", false, false, null, null, false, 21388, null, 54, -28538, null], "operator": "\uda68\udd1c", "type": "\ua965"}], "ip": "182c7264-8dfc-4f78-8ff1-8bf12fca19a4"}
{"as": {"Organization": false}, "client": {"behaviors": [], "concentration": {"country": 10376}, "count": -10809, "countries": false, "proxies": [], "spread": null, "types": ["IOT", "DESKTOP", "MOBILE", "HEADLESS", "MOBILE", "HEADLESS"]}, "infrastructure": null, "location": {"state": true, "city": false, "country": true}, "organization": "\u00a8\udb49\udda8", "risks": ["CALLBACK_PROXY", "GEO_MISMATCH", "LOGIN_BRUTEFORCE"], "services": ["\udae9\udda4W", "\u00aa", false, true, "\u00c3", false, null], "tunnels": [{"anonymous": "\uda00\udcf9", "entries": ["\udbd2\udd75\u008a\u00b8\u00dc", "\u875e", true, null, "\u00b2\u00147<\u001eZ", 25457, "\u8eca"], "operator": -307550890, "type": "\u000bC\u00e5;\u0084\u0013\u00bee"}, {"anonymous": "(n\u00c4S\u00a5\ud9f8\udedf\u00ac\u00f9\u0012\u00b6", "entries": [null, "", -1162566802, null, 1966272010, false], "operator": false, "type": true}, {"anonymous": "\u00bb\uda98\udc86\ud98f\udc7d", "entries": [false, "\u00f7\u00b0\u00a8", "{Lg\u0085\udb98\udef0\u00e0", "", "\u00b8", "\u00f2", 6619], "operator": "\u0001", "type": "\udaf9\udc1c"}, {"anonymous": null, "entries": ["\ud998\ude1b", null], "operator": null, "type": "\u00be"}, {"anonymous": null, "entries": ["}", 6367086318877159199, "\ud9ed\ude85"], "operator": null, "type": false}, {"anonymous": "p", "entries": ["\u00d2", true, "\ud837\uddb5\ud942\udf7a\u00e4", "\u0088\u00f5u\u00d9\"\u00b7\u00a3\ud923\udf30\udac1\udf09\ud8e5\udd32\u00ac\u00b1\u00d1", true, null, 808646626, true, 14965], "operator": -84, "type": "\u000eU\ud8a1\ude18\udb6a\udd8a<\ud8c7\udfae\ud8f8\ude56Ti\uda8b\udf10\u00f5\udb7b\udf05"}, {"anonymous": "\u00e0\u00e5\udbb3\udc29\u00c4\udb68\udc2d\u00b2\u0096\u00da\u0091", "entries": [-103, 5152, "t", true, null, -41, "r\u00e5\u0099\u00f9", -118, "\u00f5", "\u00cd\u00d06\udb7b\udf76\u0001", "L", -7853, "\u00de", null, "\u00f3"], "operator": "p\u00b5\u0087", "type": -11008}, {"anonymous": 95, "entries": [true, "1\u0006", "\uda2d\udcc4", -15444], "operator": "\u00a2r\u0091\u00d0\u00d7\ud927\udec4\udb67\udd13}\udbb2\udf5d\udb05\udf81\u00b2", "type": null}], "ip": "93ffc461-ac49-43e1-bdc5-e7cad79851f3"}
Steps to reproduce problem
  1. Save the JSON from "Reproduction data" above as a .gz file (e.g. type_too_big.json.gz) into the ipoid/tmp directory
  2. If necessary, start up docker in the ipoid directory (e.g. docker compose up -d)
  3. Initialise the database: docker compose exec web node init-db.js
  4. Run: docker compose exec -e FEED_PATH=./tmp/type_too_big.json.gz web node import-data.js
  5. The first time you run this you might see the exception from T339325 instead, so you might need to run the command again
  6. If you still don't see the exception from the description above, keep running the command until you do
Environment

ipoid commit deed45d61626f952ce8b9365e12e8bd98703fa17.

Example shell session showing exception
$ docker-compose exec -e FEED_PATH=./tmp/type_too_big.json.gz web node import-data.js
{"name":"data-import","hostname":"4122150a58ac","pid":81,"level":50,"err":{"message":"(conn=37, no: 1264, SQLState: 22003) Out of range value for column 'client_count' at row 1\nsql: REPLACE INTO\n            actor_data (ip,org,client_count,types,conc_geohash,conc_city,conc_state,conc_country,conc_skew,conc_density,countries,location_country,risks)\n        VALUES\n            (?,?,?,?,?,?,?,?,?,?,?,?,?) - parameters:['93ffc461-ac49-43e1-...]","name":"Error","stack":"Error: (conn=37, no: 1264, SQLState: 22003) Out of range value for column 'client_count' at row 1\nsql: REPLACE INTO\n            actor_data (ip,org,client_count,types,conc_geohash,conc_city,conc_state,conc_country,conc_skew,conc_density,countries,location_country,risks)\n        VALUES\n            (?,?,?,?,?,?,?,?,?,?,?,?,?) - parameters:['93ffc461-ac49-43e1-...]\n    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:61:10)\n    at PacketNodeEncoded.readError (/srv/service/node_modules/mariadb/lib/io/packet.js:511:19)\n    at Query.readResponsePacket (/srv/service/node_modules/mariadb/lib/cmd/resultset.js:46:28)\n    at PacketInputStream.receivePacketBasic (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:104:9)\n    at PacketInputStream.onData (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:169:20)\n    at Socket.emit (node:events:513:28)\n    at addChunk (node:internal/streams/readable:315:12)\n    at readableAddChunk (node:internal/streams/readable:289:9)\n    at Socket.Readable.push (node:internal/streams/readable:228:10)\n    at TCP.onStreamRead (node:internal/stream_base_commons:190:23)","code":"ER_WARN_DATA_OUT_OF_RANGE"},"ip":"93ffc461-ac49-43e1-bdc5-e7cad79851f3","msg":"(conn=37, no: 1264, SQLState: 22003) Out of range value for column 'client_count' at row 1\nsql: REPLACE INTO\n            actor_data (ip,org,client_count,types,conc_geohash,conc_city,conc_state,conc_country,conc_skew,conc_density,countries,location_country,risks)\n        VALUES\n            (?,?,?,?,?,?,?,?,?,?,?,?,?) - parameters:['93ffc461-ac49-43e1-...]","time":"2023-06-16T06:20:48.711Z","v":0}
$ docker-compose exec -e FEED_PATH=./tmp/type_too_big.json.gz web node import-data.js
{"name":"data-import","hostname":"4122150a58ac","pid":98,"level":50,"err":{"message":"(conn=46, no: 1264, SQLState: 22003) Out of range value for column 'client_count' at row 1\nsql: REPLACE INTO\n            actor_data (ip,org,client_count,types,conc_geohash,conc_city,conc_state,conc_country,conc_skew,conc_density,countries,location_country,risks)\n        VALUES\n            (?,?,?,?,?,?,?,?,?,?,?,?,?) - parameters:['93ffc461-ac49-43e1-...]","name":"Error","stack":"Error: (conn=46, no: 1264, SQLState: 22003) Out of range value for column 'client_count' at row 1\nsql: REPLACE INTO\n            actor_data (ip,org,client_count,types,conc_geohash,conc_city,conc_state,conc_country,conc_skew,conc_density,countries,location_country,risks)\n        VALUES\n            (?,?,?,?,?,?,?,?,?,?,?,?,?) - parameters:['93ffc461-ac49-43e1-...]\n    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:61:10)\n    at PacketNodeEncoded.readError (/srv/service/node_modules/mariadb/lib/io/packet.js:511:19)\n    at Query.readResponsePacket (/srv/service/node_modules/mariadb/lib/cmd/resultset.js:46:28)\n    at PacketInputStream.receivePacketBasic (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:104:9)\n    at PacketInputStream.onData (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:169:20)\n    at Socket.emit (node:events:513:28)\n    at addChunk (node:internal/streams/readable:315:12)\n    at readableAddChunk (node:internal/streams/readable:289:9)\n    at Socket.Readable.push (node:internal/streams/readable:228:10)\n    at TCP.onStreamRead (node:internal/stream_base_commons:190:23)","code":"ER_WARN_DATA_OUT_OF_RANGE"},"ip":"93ffc461-ac49-43e1-bdc5-e7cad79851f3","msg":"(conn=46, no: 1264, SQLState: 22003) Out of range value for column 'client_count' at row 1\nsql: REPLACE INTO\n            actor_data (ip,org,client_count,types,conc_geohash,conc_city,conc_state,conc_country,conc_skew,conc_density,countries,location_country,risks)\n        VALUES\n            (?,?,?,?,?,?,?,?,?,?,?,?,?) - parameters:['93ffc461-ac49-43e1-...]","time":"2023-06-16T06:20:50.468Z","v":0}
/srv/service/node_modules/mariadb/lib/misc/errors.js:61
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: (conn=45, no: 1406, SQLState: 22001) Data too long for column 'type' at row 1
sql: INSERT INTO
                        tunnels (operator, type, anonymous)
                    VALUES
                        (?, ?, ?) - parameters:['¢','𺋒',true]
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:61:10)
    at PacketNodeEncoded.readError (/srv/service/node_modules/mariadb/lib/io/packet.js:511:19)
    at Query.readResponsePacket (/srv/service/node_modules/mariadb/lib/cmd/resultset.js:46:28)
    at PacketInputStream.receivePacketBasic (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:104:9)
    at PacketInputStream.onData (/srv/service/node_modules/mariadb/lib/io/packet-input-stream.js:169:20)
    at Socket.emit (node:events:513:28)
    at addChunk (node:internal/streams/readable:315:12)
    at readableAddChunk (node:internal/streams/readable:289:9)
    at Socket.Readable.push (node:internal/streams/readable:228:10)
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23) {
  text: "Data too long for column 'type' at row 1",
  sql: 'INSERT INTO\n' +
    '                        tunnels (operator, type, anonymous)\n' +
    '                    VALUES\n' +
    "                        (?, ?, ?) - parameters:['¢','𺋒.\b',true]",
  fatal: false,
  errno: 1406,
  sqlState: '22001',
  code: 'ER_DATA_TOO_LONG'
}

QA Results - Ipoid SQL

Details

TitleReferenceAuthorSource BranchDest Branch
Update getTunnel to only validates ENUM tunnels typerepos/mediawiki/services/ipoid!40agueytetooLongmain
Update getTunnel to only validates ENUM tunnels typerepos/mediawiki/services/ipoid!39agueytetooLongmain
Update getTunnel to only validates ENUM tunnels typerepos/mediawiki/services/ipoid!36agueytedataTooLongmain
Customize query in GitLab

Event Timeline

In the Schema file, the type row was set to 5 characters

CREATE TABLE IF NOT EXISTS tunnels (
   pkid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
   operator VARBINARY(256) UNIQUE,
   type VARBINARY(5),
   anonymous BOOLEAN,
   PRIMARY KEY (pkid)
);

which creates an error for the first entry of this reprod data:

tunnel type=  '𺋒.\b'

Screenshot 2023-07-24 at 2.15.23 PM.png (392×1 px, 55 KB)

Concerns & Questions:

  • Should we allow more tunnels.type other than "VPN", "PROXY" or "NULL" to be added to the database? Should we expect other tunnels.types?

The Spur documentation says: "This network is an explicit exit point for a VPN, proxy, or other type of service. All traffic from this network is likely associated with this tunnel."
{F37149343}

  • How likely it is to get values that aren't "VPN", "PROXY", "NULL" or 'other type of service' from Spur?
  • Do we build our import-data file with the idea that we could get untreatable data (such as the character one) from Spur?
  • What is the purpose of testing our import data file with data that won't be provided by Spur.

Solution:

  • Updating the value to 128 characters fixes the issue.

Cons: it allows special characters or strings to be saved despite the fact they won't be useable later on.

  • Adding an array of validated tunnels type and filtering the values before saving them into DB, leaving the need to update the tunnelTypes array as we know all the existing and acceptable types.

Cons: knowing the list of validated tunnels.type ahead and keep it updated.

Example: this will prevent a row of data with the non-pre-validated tunnel.type values to be saved.
Another version could also save the row but record "NULL" if the tunnel.type isn't a pre-validated value.

const tunnelTypes = [
    'VPN',
    'PROXY'
];

function getTunnels(tunnels) {
    // Filter for tunnels with associated operators
    return tunnels.length ?
        tunnels
            .filter(function (tunnel) {
                return !!tunnel.operator;
            })
            .filter(function (tunnel, i, array) {
                return tunnelTypes.indexOf(tunnel.type) !== -1;  
            })
            .map(function (tunnel) {
                return {
                    operator: tunnel.operator,
                    type: tunnel.type || null,
                    anonymous: tunnel.anonymous !== undefined ? tunnel.anonymous : null
                };
            }) :
        false;
}

cc @STran @dom_walden @TThoabala

AGueyte set the point value for this task to 2.Jul 24 2023, 6:51 PM

Should we allow more tunnels.type other than "VPN", "PROXY" or "NULL" to be added to the database? Should we expect other tunnels.types?

According to T338967: Evaluate how we store IP data, tunnels.type should be an enum but I couldn't find on the Spur docs the explicit list.

How likely it is to get values that aren't "VPN", "PROXY", "NULL" or 'other type of service' from Spur?

You could email support@spur.us or if you've imported enough rows into your db, you could run SELECT DISTINCT type from tunnels; and see what shows up and what we can start with. From what I have in my db, VPN and PROXY are the only values so far.

Do we build our import-data file with the idea that we could get untreatable data (such as the character one) from Spur?

Probably not if we consider it an enum.

What is the purpose of testing our import data file with data that won't be provided by Spur.

Stress testing. Presumably the data was generated at a time when we weren't as sure of the use of enums. It gives us an opportunity to evaluate how flexible we want our types to be.


As for the solution, I suggest treating tunnels.type as an enum (see T338967: Evaluate how we store IP data).

@AGueyte I used the latest commit and created the reproduction data that was in the description but I am still getting the code: 'ER_DATA_TOO_LONG'

Status: ❌ = FAIL
Environment: Ipoid SQL Visual Studio Code
OS: macOS Ventura
Browser: N/A
Device: MBP
Emulated Device:N/A
Test Links: N/A
New Ipoid commit: 4bc031990460a3c84a6fc3b3bd02589b3db312f9

❌AC1: https://phabricator.wikimedia.org/T339324

2023-08-28_08-16-12.png (570×1 px, 136 KB)
2023-08-28_08-19-00.png (1×2 px, 946 KB)
Aklapper added a subscriber: AGueyte.

Removing inactive task assignee (please do so as part of offboarding steps).

Tunnels are now a bitmask and the values tested here should no longer be accepted.