Respond to data update and import errors
Closed, ResolvedPublic2 Estimated Story Points
Actions

Assigned To

Authored By

	Tchanders
	Aug 24 2023, 5:04 PM

Description

Implement some way to deal with failures (incomplete DB import) - failures are expected to happen; we shouldn’t assume each day that yesterday went perfectly. We should investigate how to deal with this.

Instead of trying once a day, we'll run this script a little more frequently. Right now it's pretty catastrophic if an import fails for whatever reason. It'll stop and rollback the batch and then exit the process completely. That batch and anything after it doesn't get imported so the next day, presumably, everything goes pear-shaped because entire chunks of IPs can be missing and the database doesn't have any concept of that.

Additionally, we didn't factor in for problems outside of our control: kubernetes (apparently) can decide to kill a job midway, connections can be lost, etc, etc. These can also lead to batches failing for non-code reasons.

This proposal aims to solve the latter:

Allow batches to move forward even if one fails
Running the script at a TBD rate (a few hours?), check for imports with failed batches and attempt to rerun those batches. Batches are deterministic and every bit of data needed to recreate them should be stored in import_status.
Only allow a new full import if there aren't any errors in batches

It also gives us a little flexibility to fix the emergent problems, since we should be dealing with them asap but instead of everything going very badly in 24 hours we'll slowly fall behind which seems less disastrous.

Details

Subject	Repo	Branch	Lines +/-
ipoid: Bump chart version	operations/deployment-charts	master	+1 -1
ipoid: Bump version and schedule multiple runs	operations/deployment-charts	master	+11 -2
ipoid: Bump version	operations/deployment-charts	master	+1 -1
ipoid: Bump version	operations/deployment-charts	master	+1 -1
ipoid: Bump version	operations/deployment-charts	master	+2 -2

Customize query in gerrit

Title	Reference	Author	Source Branch	Dest Branch
blubber: Use "includes" instead of "copies"	repos/mediawiki/services/ipoid!194	kharlan	main-I5e7aef67f734f234e632f05f4a2a7ebaee6bc74a	main
init-db: Don't assume that database exists	repos/mediawiki/services/ipoid!193	kharlan	main-Ibadfffbb27d96ff9c4ef30bc1db22de6b39d5ab7	main
update-db.js: Fix logging to import_status	repos/mediawiki/services/ipoid!187	tchanders	fix-import-continue	main
Guard against already-run imports	repos/mediawiki/services/ipoid!172	stran	main-If648356e2b62b2c310d9adcef7db361040c29dac	main
Recover from failed batches	repos/mediawiki/services/ipoid!171	stran	main-I2dc633c649816aa02e79ee04f5a01b19ec456a8a	main
Allow imports to continue even if a batch has an error	repos/mediawiki/services/ipoid!170	stran	main-I3a652b4673a21d5d1c9b95468859a2ca7b9fa3a8	main
Guard against already-run imports	repos/mediawiki/services/ipoid!169	stran	guard-rerunning-imports	main
Draft: Recover from failed batches	repos/mediawiki/services/ipoid!168	stran	rerun-failed-batches	main
Draft: Allow imports to continue even if a batch has an error	repos/mediawiki/services/ipoid!167	stran	implement-batch-err-recovery	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
In Progress	Niharika	T324492 Temporary accounts - MVP
Open	None	T340895 [Epic] IP Info accommodations for temporary accounts
Open	STran	T341395 Display Spur data on IPInfo infobox
Resolved	kostajh	T339284 Deploy ipoid
Resolved	STran	T344941 Respond to data update and import errors
Duplicate	None	T355429 data pipeline: Don't attempt to re-import current day if full import already existed

Event Timeline

Tchanders created this task.Aug 24 2023, 5:04 PM

Tchanders mentioned this in T344499: Move behaviors/proxies/tunnels processing into an independent script.

Tchanders mentioned this in T344939: Log when the lookup fails on attempting to INSERT to actor_data (because the proxy/behaviour doesn't exist).Aug 24 2023, 5:12 PM

Tchanders moved this task from Backlog to Deployment blockers on the iPoid-Service board.Oct 23 2023, 6:42 PM

Tchanders updated the task description. (Show Details)

Tchanders assigned this task to STran.Oct 27 2023, 10:34 AM

Tchanders edited projects, added Trust and Safety Product Team, Trust and Safety Product Sprint; removed Anti-Harassment.

Tchanders removed a project: Trust and Safety Product Team.Oct 27 2023, 10:36 AM

STran moved this task from Backlog to Sprint Bodhrán on the Trust and Safety Product Sprint board.Oct 31 2023, 9:28 AM

STran edited projects, added Trust and Safety Product Sprint (Sprint Bodhrán); removed Trust and Safety Product Sprint.

STran moved this task from Priority Backlog to In Progress on the Trust and Safety Product Sprint (Sprint Bodhrán) board.

STran moved this task from Deployment blockers to In Progress on the iPoid-Service board.Nov 1 2023, 6:44 PM

Tchanders renamed this task from Investigate: Respond to data update and import errors to [M] Investigate: Respond to data update and import errors.Nov 1 2023, 6:48 PM

Tchanders merged a task: T345237: Investigate: Recovering from skipped updates when pulling from the daily feed.

stran opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/167

Draft: Allow imports to continue even if a batch has an error

STran updated the task description. (Show Details)Nov 7 2023, 6:28 AM

stran opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/168

Draft: Recover from failed batches

stran opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/169

Guard against already-run imports

stran closed https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/167

Draft: Allow imports to continue even if a batch has an error

stran closed https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/168

Draft: Recover from failed batches

stran closed https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/169

Guard against already-run imports

Maintenance_bot removed a project: Patch-For-Review.Nov 8 2023, 12:10 PM

As discussed with @STran, there are broadly two ways that import can fail:

Due to some bad data, one statement fails
Something catastrophic happens externally and an import doesn't complete

If (1) happens, then running the whole pipeline again is very inefficient. However, if (2) happens, then it's necessary (until we find a way to store temporary files more permanently).

The changes we are making at the moment are targeted to dealing with (2).

We can detect whether the last run completed by whether the expected batch count for the last run (import_status.batch_count) equals the number of rows for the last run - whatever the batch_status for each row.
If these are equal and all batch_status are COMPLETED, then there's nothing more to do.
If these are equal but any batch_status are ERROR, then single IPs failed, and a re-run is not necessary. In the future we could store which IPs failed and have another pipeline that can target specific IPs. For now, we'd either need to leave it or respond manually, or do a re-run (the status quo in the current patches).

stran updated https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/170

Allow imports to continue even if a batch has an error

kostajh mentioned this in T305114: Set up MariaDB for iPoid.Nov 14 2023, 9:57 AM

tchanders merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/170

Allow imports to continue even if a batch has an error

tchanders opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/187

update-db.js: Fix logging to import_status

Tchanders mentioned this in T351449: SQL error: Duplicate entry <redacted> for key 'ip'.Nov 16 2023, 6:48 PM

dom_walden mentioned this in T351490: Errors when there are duplicate entries for proxies, behaviours and tunnels.Nov 17 2023, 8:01 AM

kharlan merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/187

update-db.js: Fix logging to import_status

Dreamy_Jazz edited projects, added Trust and Safety Product Sprint (Sprint Shamisen (20th Nov - 1st Dec. 2023)); removed Trust and Safety Product Sprint (Sprint Bodhrán).Nov 20 2023, 4:14 PM

Dreamy_Jazz renamed this task from [M] Investigate: Respond to data update and import errors to Investigate: Respond to data update and import errors.Nov 20 2023, 4:28 PM

Dreamy_Jazz set the point value for this task to 2.

Dreamy_Jazz moved this task from Priority Backlog to Needs review on the Trust and Safety Product Sprint (Sprint Shamisen (20th Nov - 1st Dec. 2023)) board.

stran updated https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/171

Recover from failed batches

Tchanders edited projects, added Trust and Safety Product Sprint (Sprint Lute (11th Dec. - 22nd Dec. 2023)); removed Trust and Safety Product Sprint (Sprint Shamisen (20th Nov - 1st Dec. 2023)).Dec 11 2023, 5:09 PM

Tchanders moved this task from Priority Backlog to Needs review on the Trust and Safety Product Sprint (Sprint Lute (11th Dec. - 22nd Dec. 2023)) board.

stran updated https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/171

Recover from failed batches

kharlan merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/171

Recover from failed batches

Change 982364 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982364

Change 982364 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982364

kharlan opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/193

init-db: Don't assume that database exists

stran merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/193

init-db: Don't assume that database exists

Change 982375 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982375

Change 982375 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982375

kharlan opened https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/194

blubber: Use "includes" instead of "copies"

stran merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/194

blubber: Use "includes" instead of "copies"

Change 982386 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982386

Change 982386 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982386

stran updated https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/172

Guard against already-run imports

stran updated https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/172

[WIP] Guard against already-run imports

Tchanders renamed this task from Investigate: Respond to data update and import errors to Respond to data update and import errors.Jan 12 2024, 12:14 PM

Tchanders edited projects, added Trust and Safety Product Sprint (Sprint Northumbrian smallpipes (8th Jan.‘24 - 19th Jan.'24)); removed Trust and Safety Product Sprint (Sprint Lute (11th Dec. - 22nd Dec. 2023)).

Tchanders mentioned this in T354944: Ensure database is up-to-date with latest data file.Jan 12 2024, 12:21 PM

https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/172 implements the following workflow:

The script is expected to run at a more frequent cadence than daily and should reach one conclusion on every run. With more frequent runs, that should be enough tries to recover from an error and continue that day's import before the next one.

kostajh mentioned this in T355246: Respond to 17.01.2024 data import failure.Jan 17 2024, 4:54 PM

tchanders merged https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/172

Guard against already-run imports

Change 991416 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version and schedule multiple runs

https://gerrit.wikimedia.org/r/991416

Change 991416 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version and schedule multiple runs

https://gerrit.wikimedia.org/r/991416

Change 991417 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump chart version

https://gerrit.wikimedia.org/r/991417

Change 991417 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump chart version

https://gerrit.wikimedia.org/r/991417

Tchanders moved this task from Priority Backlog to Needs QA on the Trust and Safety Product Sprint (Sprint Northumbrian smallpipes (8th Jan.‘24 - 19th Jan.'24)) board.Jan 19 2024, 12:07 PM

kostajh mentioned this in T355429: data pipeline: Don't attempt to re-import current day if full import already existed.Jan 19 2024, 2:49 PM

STran moved this task from In Progress to Done on the iPoid-Service board.Jan 23 2024, 12:02 PM

@STran What kinds of errors are we guarding against?

I have been struggling to simulate realistic import errors. I have experimented with using timeout.

For example, in import.sh I have modified line 67:

timeout <seconds> node -e "require('./update-db.js')('$file');"

and then ran docker compose exec web ./main.sh --init true --today <date> --debug true

Experimenting with different values of <seconds> I have managed to get import_status into potentially unexpected states:

SELECT * FROM import_status;
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       1 | 2024-01-24 16:31:14 |                     | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/sub/query_split_aaaaq.sql |          20 |            0 |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+

and

+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       1 | 2024-01-24 16:36:30 |                     | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/sub/query_split_aaaaq.sql |          20 |            1 |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+

This is because main.sh exited after its first call to node -e "require('./update-db.js')('$file');", and so only one batch was processed.

Afterwards, running docker compose exec web ./main.sh --today <date> --yesterday <date> --debug true I have seen exceptions:

For batch_status=0:

{"log.level":"info","@timestamp":"2024-01-24T16:32:58.103Z","process.pid":8642,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Update init already run, skipping..."}
{"log.level":"info","@timestamp":"2024-01-24T16:32:58.105Z","process.pid":8642,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Update remove-tunnel-anonymous-property already run, skipping..."}
{"log.level":"info","@timestamp":"2024-01-24T16:32:58.105Z","process.pid":8642,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Database updated!"}
/srv/service/import-status-utils.js:101
		latestImportDates.feed_file_yesterday ? `${DATADIR}/${latestImportDates.feed_file_yesterday}.json.gz` : '',
		                  ^

TypeError: Cannot read properties of undefined (reading 'feed_file_yesterday')
    at getIncompleteImport (/srv/service/import-status-utils.js:101:21)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async Object.getImportStatus (/srv/service/import-status-utils.js:167:34)

and for batch_status=1

{"log.level":"info","@timestamp":"2024-01-24T16:36:48.497Z","process.pid":8830,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Update init already run, skipping..."}
{"log.level":"info","@timestamp":"2024-01-24T16:36:48.500Z","process.pid":8830,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Update remove-tunnel-anonymous-property already run, skipping..."}
{"log.level":"info","@timestamp":"2024-01-24T16:36:48.501Z","process.pid":8830,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Database updated!"}
{"log.level":"debug","@timestamp":"2024-01-24T16:36:48.880Z","process.pid":8819,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Found incomplete import. Attempting to restart import from null and 20240109."}
{"log.level":"info","@timestamp":"2024-01-24T16:36:49.160Z","process.pid":8880,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Feed for 20240109 already exists, skipping attempt to download."}
/srv/service/get-feed.js:54
		throw new Error( 'Invalid date passed. Date should be in yyyymmdd format.' );
		      ^

Error: Invalid date passed. Date should be in yyyymmdd format.
    at Object.init (/srv/service/get-feed.js:54:9)
    at [eval]:1:26
    at Script.runInThisContext (node:vm:129:12)
    at Object.runInThisContext (node:vm:313:38)
    at node:internal/process/execution:76:19
    at [eval]-wrapper:6:22
    at evalScript (node:internal/process/execution:75:60)
    at node:internal/main/eval_string:27:3

I don't know how realistic these types of errors are and whether we are concerned about them. Could you perhaps show me examples of real errors on production?

I have tried a few other ways to simulate batches failing.

SQL

In SQL, I ran the query: SET GLOBAL wait_timeout=1;

On calling main.sh I got:

...
{"log.level":"info","@timestamp":"2024-01-25T10:24:31.979Z","process.pid":11772,"host.hostname":"bdbf42f4715d","ecs.version":"8.10.0","message":"Importing /tmp/ipoid/sub/query_split_aaaaq.sql..."}
{"log.level":"info","@timestamp":"2024-01-25T10:24:35.172Z","process.pid":11823,"host.hostname":"bdbf42f4715d","ecs.version":"8.10.0","message":"All updates complete"}
/srv/service/node_modules/mariadb/lib/misc/errors.js:64
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: Cannot execute new commands: connection closed
sql: UPDATE import_status SET batch_status = ? WHERE batchid = ? - parameters:[2,1]
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:64:10)
    at Query.throwUnexpectedError (/srv/service/node_modules/mariadb/lib/cmd/command.js:39:24)
    at Query.throwNewError (/srv/service/node_modules/mariadb/lib/cmd/command.js:69:22)
    at Connection.addCommandDisabled (/srv/service/node_modules/mariadb/lib/connection.js:1104:21)
    at Connection.query (/srv/service/node_modules/mariadb/lib/connection.js:1349:10)
    at new Promise (<anonymous>)
    at ConnectionPromise.query (/srv/service/node_modules/mariadb/lib/connection-promise.js:93:12)
    at /srv/service/update-db.js:95:29
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5) {
  sqlMessage: 'Cannot execute new commands: connection closed',
  sql: 'UPDATE import_status SET batch_status = ? WHERE batchid = ? - parameters:[2,1]',
  fatal: true,
  errno: 45013,
  sqlState: '08S01',
  code: 'ER_CMD_CONNECTION_CLOSED'
}

actor_data had the correct number of rows for that one batch, but:

SELECT * FROM import_status;
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       1 | 2024-01-25 13:36:02 |                     | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/sub/query_split_aaaaq.sql |          20 |            0 |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+

NodeJS

In the call the createPool in update-db.js, I added the parameter connectTimeout: 1 and got:

...
{"log.level":"info","@timestamp":"2024-01-25T11:58:32.194Z","process.pid":390,"host.hostname":"d5b18bbf7a0d","ecs.version":"8.10.0","message":"Importing /tmp/ipoid/sub/query_split_aaaaa.sql..."}
/srv/service/node_modules/mariadb/lib/misc/errors.js:64
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: (conn=-1, no: 45028, SQLState: HY000) retrieve connection from pool timeout after 10002ms
    (pool connections: active=0 idle=0 limit=10)
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:64:10)
    at Pool._requestTimeoutHandler (/srv/service/node_modules/mariadb/lib/pool.js:349:26)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7) {
  sqlMessage: 'retrieve connection from pool timeout after 10002ms\n' +
    '    (pool connections: active=0 idle=0 limit=10)',
  sql: null,
  fatal: false,
  errno: 45028,
  sqlState: 'HY000',
  code: 'ER_GET_CONNECTION_TIMEOUT'
}

Nothing in actor_data or import_status.

I added instead the parameter socketTimeout: 1 and got:

...
{"log.level":"info","@timestamp":"2024-01-25T12:52:53.647Z","process.pid":1446,"host.hostname":"d5b18bbf7a0d","ecs.version":"8.10.0","message":"Importing /tmp/ipoid/sub/query_split_aaaaa.sql..."}
/srv/service/node_modules/mariadb/lib/misc/errors.js:88
  return new SqlError(msg, sql, true, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: (conn=277, no: 45026, SQLState: 08S01) socket timeout
    at Object.module.exports.createFatalError (/srv/service/node_modules/mariadb/lib/misc/errors.js:88:10)
    at Connection.socketTimeoutReached (/srv/service/node_modules/mariadb/lib/connection.js:1042:24)
    at Object.onceWrapper (node:events:627:28)
    at Socket.emit (node:events:513:28)
    at Socket._onTimeout (node:net:526:8)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7) {
  sqlMessage: 'socket timeout',
  sql: null,
  fatal: true,
  errno: 45026,
  sqlState: '08S01',
  code: 'ER_SOCKET_TIMEOUT'
}

This time nothing in actor_data but in import_status:

+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       1 | 2024-01-25 12:52:53 |                     | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           1 |            0 |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+

I have also been doing successful imports, and then changing rows in import_status to simulate failed or incomplete batches.

If yesterday's import was from a clean DB, I get exceptions like the ones above (T344941#9485893) when it retries yesterday's failed batches.

If yesterday's import has a mix of failed (batch_status=2) and incomplete (batch_status=0) batches, when you update it will only retry the failed batches. You then need to run the update again to retry the incomplete batches.

If there are failed batches for both yesterday and today, it will retry for both days but I think it gets the batch for today wrong. For example:

SELECT * FROM import_status;
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday         | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       3 | 2024-01-25 13:54:47 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            2 |
|       4 | 2024-01-25 13:54:48 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           2 |            1 |
|       5 | 2024-01-25 13:55:27 | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/20240111.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            2 |
|       6 | 2024-01-25 13:55:29 | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/20240111.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           2 |            1 |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+

After re-running today's update:

+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday         | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       3 | 2024-01-25 13:54:47 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            3 |
|       4 | 2024-01-25 13:54:48 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           2 |            1 |
|       5 | 2024-01-25 13:55:27 | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/20240111.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            2 |
|       6 | 2024-01-25 13:55:29 | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/20240111.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           2 |            1 |
|       7 | 2024-01-25 13:56:41 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            1 |
|       8 | 2024-01-25 13:56:42 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            1 |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+

Note that batchid=5 has not had its batch_status changed to a 3. Both batchid 7 and 8 have the same values for feed_file_yesterday and feed_file_today. I am not sure what state the data in the database is.

Would there ever be a case where yesterday's batches have failed but today's have succeeded? When you re-run the update for today it will retry yesterday's failed batches but not re-apply today's updates, leaving the data in an inconsistent state.

Batches are deterministic...

You do seem to need to make sure the batchsize parameter you used today is the same as yesterday. Perhaps we shouldn't make it a parameter.

In T344941#9461968, @STran wrote:

The script is expected to run at a more frequent cadence than daily and should reach one conclusion on every run. With more frequent runs, that should be enough tries to recover from an error and continue that day's import before the next one.

What if yesterday's retries keep failing and we go into tomorrow with no rows in import_status for today (we haven't imported any data for today)?

What kinds of errors are we guarding against?

This script is less focused on guarding against errors (stopping them from ever happening) and more concerned with graceful recovery/preventing compounding errors. In the case of an error, it should stop importing new data and re-attempt old dates until it's successful. For instance:

If there are batches with errors, it should re-attempt those batches and only those batches until they succeed. This means if something fails consistently (eg. we didn't guard against bad data) then it should continue to attempt and fail and not import anything new until an engineer can resolve the bad data handling.
If the import is incomplete for whatever reason, it should attempt to import the batches that were never attempted and continue to do that until the import is considered complete (batch count of batches with complete status is equal to the expected total)

I don't know how realistic these types of errors are and whether we are concerned about them.

Ideally, they're states the script should recover from regardless. If you run a regular import afterwards, without any timeouts, the script should successfully recover/eventually finish importing.

Could you perhaps show me examples of real errors on production?

A real error we had was the connection dropping midway through an import. In this case, a lot of batches had been completed and no errors in batches were logged. Script was expected to (and eventually did) only import the remaining batches.

A potential error we could have is if an error happens on import, eg. we didn't guard against a value coming in. Expected state would be an batch with the error status but the other batches to have completed. Script should retry the error batch.

The two could be combined, as disconnects are independent of bad data in which case we'd see both the error status and a missing number of batches.

If yesterday's import has a mix of failed (batch_status=2) and incomplete (batch_status=0) batches, when you update it will only retry the failed batches. You then need to run the update again to retry the incomplete batches.

This is expected behavior. The script can only determine one state at a time and incomplete is considered a different state than errors.

If there are failed batches for both yesterday and today, it will retry for both days but I think it gets the batch for today wrong. For example:

This should never happen naturally. The script should always prevent additional imports if there are errors or incompletes to resolve first.

You do seem to need to make sure the batchsize parameter you used today is the same as yesterday. Perhaps we shouldn't make it a parameter.

I don't think this is a mandatory parameter for you to enter if you use main.sh as the entrypoint. This should default to 10000 which is what we expect on production and that doesn't change. Smaller batch sizes are very useful for development testing and it's documented in the README as a development-only parameter although there's nothing reinforcing that in the code.

What if yesterday's retries keep failing and we go into tomorrow with no rows in import_status for today (we haven't imported any data for today)?

It should stay stopped until it's resolved. After it's resolved, it'll use the day's date as the today date and diff accordingly. The script doesn't care if multiple days are skipped and from a product perspective, we also prefer to be stale a few days while import is resolved and then update to the most up to date data whenever the database is healthy again.

kostajh edited projects, added Trust and Safety Product Sprint (Sprint Kazoo (Jan 29 - Feb 9 2024)); removed Trust and Safety Product Sprint (Sprint Northumbrian smallpipes (8th Jan.‘24 - 19th Jan.'24)).Jan 29 2024, 1:35 PM

dom_walden moved this task from Priority Backlog to Needs QA on the Trust and Safety Product Sprint (Sprint Kazoo (Jan 29 - Feb 9 2024)) board.Jan 29 2024, 2:22 PM

I don't think I can spend much more time on this. I think I have made some progress in simulating import errors and I have some observations below.

Adding a random socketTimeout to the pool object in update-db.js, I have been able to get errors on import of the form:

{"log.level":"info","@timestamp":"2024-02-09T14:02:19.215Z","process.pid":116297,"host.hostname":"4a2e0da12c3c","ecs.version":"8.10.0","message":"Importing /tmp/ipoid/sub/query_split_aaaag.sql..."}
{"@timestamp":"2024-02-09T14:02:20.097Z","ecs.version":"8.10.0","log.level":"info","message":"All updates complete"}
/srv/service/node_modules/mariadb/lib/misc/errors.js:64
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: Cannot execute new commands: connection closed
sql: UPDATE import_status SET batch_status = ? WHERE batchid = ? - parameters:[2,2]
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:64:10)
    at Query.throwUnexpectedError (/srv/service/node_modules/mariadb/lib/cmd/command.js:39:24)
    at Query.throwNewError (/srv/service/node_modules/mariadb/lib/cmd/command.js:69:22)
    at Connection.addCommandDisabled (/srv/service/node_modules/mariadb/lib/connection.js:1104:21)
    at Connection.query (/srv/service/node_modules/mariadb/lib/connection.js:1349:10)
    at new Promise (<anonymous>)
    at ConnectionPromise.query (/srv/service/node_modules/mariadb/lib/connection-promise.js:93:12)
    at /srv/service/update-db.js:164:29
    at processTicksAndRejections (node:internal/process/task_queues:96:5) {
  sqlMessage: 'Cannot execute new commands: connection closed',
  sql: 'UPDATE import_status SET batch_status = ? WHERE batchid = ? - parameters:[2,2]',
  fatal: true,
  errno: 45013,
  sqlState: '08S01',
  code: 'ER_CMD_CONNECTION_CLOSED'
}

Then, on subsequent retries of the import, I see lots of errors of the form:

{"@timestamp":"2024-02-09T14:02:26.270Z","ecs.version":"8.10.0","log.level":"info","message":"Duplicate entry found for ip, attempting to delete and re-insert"}

and

{"@timestamp":"2024-02-09T14:02:32.487Z","ecs.version":"8.10.0","log.level":"error","message":{"batchFile":"/tmp/ipoid/sub/query_split_aaaag.sql","err":{"code":"ER_DUP_ENTRY","errno":1062,"fatal":false,"name":"SqlError","sql":"INSERT INTO actor_data_proxies (actor_data_id, proxy_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = 'x.x.x.x'), (SELECT pkid FROM proxies WHERE proxy = 'FOO_PROXY' ) ); - parameters:[]","sqlMessage":"Duplicate entry '10679-1' for key 'PRIMARY'","sqlState":"23000"},"statement":"'INSERT INTO actor_data_proxies (actor_data_id, proxy_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = \\'x.x.x.x\\'), (SELECT pkid FROM proxies WHERE proxy = \\'FOO_PROXY\\' ) );'","todayFile":"/tmp/ipoid/20240102.json.gz","yesterdayFile":"/tmp/ipoid/20240101.json.gz"}}

I am guessing this means the rollback() did not work, although I don't see it throw any errors or messages to that effect. It is difficult to tell the state of the data after this. It is possible this particular issue will be fixed with T356736, and I may do more investigation once that task is complete.

In the case that yesterday has only incomplete (batch_status=0) imports, require('./import-status-utils.js').getImportStatus() returns NOT_STARTED.

This does not appear to lead to any negative consequences when you run an update. The batches for yesterday appear to just be ignored and stay in the database with batch_status=0. However, I wonder if this behaviour will be confusing for future maintainers.

For example, with import_status:

+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday         | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       1 | 2024-02-12 07:41:01 |                             | /tmp/ipoid/20240101.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           1 |            1 |
|       2 | 2024-02-12 07:41:13 | /tmp/ipoid/20240101.json.gz | /tmp/ipoid/20240102.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           1 |            0 |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+

Running docker-compose exec web node -e "require('./import-status-utils.js').getImportStatus('20240102', '20240103');" returns:

{"status":"NOT_STARTED","yesterday":"20240102","today":"20240103","retryBatches":null,"completedBatches":null}

Latest ipoid commit tested: ad79ee0c90b9c088d7df3242b567e7633b07d2ff

Dreamy_Jazz closed this task as Resolved.Mon, May 13, 9:19 AM

Dreamy_Jazz removed a project: Patch-For-Review.

	F41693309: image.png
	Jan 16 2024, 12:12 PM

Respond to data update and import errorsClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

SQL

NodeJS

Respond to data update and import errors
Closed, ResolvedPublic2 Estimated Story Points
Actions

Related Objects
Search...