Page MenuHomePhabricator

Respond to data update and import errors
Closed, ResolvedPublic2 Estimated Story Points

Description

Implement some way to deal with failures (incomplete DB import) - failures are expected to happen; we shouldn’t assume each day that yesterday went perfectly. We should investigate how to deal with this.

Instead of trying once a day, we'll run this script a little more frequently. Right now it's pretty catastrophic if an import fails for whatever reason. It'll stop and rollback the batch and then exit the process completely. That batch and anything after it doesn't get imported so the next day, presumably, everything goes pear-shaped because entire chunks of IPs can be missing and the database doesn't have any concept of that.

Additionally, we didn't factor in for problems outside of our control: kubernetes (apparently) can decide to kill a job midway, connections can be lost, etc, etc. These can also lead to batches failing for non-code reasons.

This proposal aims to solve the latter:

  • Allow batches to move forward even if one fails
  • Running the script at a TBD rate (a few hours?), check for imports with failed batches and attempt to rerun those batches. Batches are deterministic and every bit of data needed to recreate them should be stored in import_status.
  • Only allow a new full import if there aren't any errors in batches

It also gives us a little flexibility to fix the emergent problems, since we should be dealing with them asap but instead of everything going very badly in 24 hours we'll slowly fall behind which seems less disastrous.

Details

TitleReferenceAuthorSource BranchDest Branch
blubber: Use "includes" instead of "copies"repos/mediawiki/services/ipoid!194kharlanmain-I5e7aef67f734f234e632f05f4a2a7ebaee6bc74amain
init-db: Don't assume that database existsrepos/mediawiki/services/ipoid!193kharlanmain-Ibadfffbb27d96ff9c4ef30bc1db22de6b39d5ab7main
update-db.js: Fix logging to import_statusrepos/mediawiki/services/ipoid!187tchandersfix-import-continuemain
Guard against already-run importsrepos/mediawiki/services/ipoid!172stranmain-If648356e2b62b2c310d9adcef7db361040c29dacmain
Recover from failed batchesrepos/mediawiki/services/ipoid!171stranmain-I2dc633c649816aa02e79ee04f5a01b19ec456a8amain
Allow imports to continue even if a batch has an errorrepos/mediawiki/services/ipoid!170stranmain-I3a652b4673a21d5d1c9b95468859a2ca7b9fa3a8main
Guard against already-run importsrepos/mediawiki/services/ipoid!169stranguard-rerunning-importsmain
Draft: Recover from failed batchesrepos/mediawiki/services/ipoid!168stranrerun-failed-batchesmain
Draft: Allow imports to continue even if a batch has an errorrepos/mediawiki/services/ipoid!167stranimplement-batch-err-recoverymain
Customize query in GitLab

Event Timeline

Tchanders renamed this task from Investigate: Respond to data update and import errors to [M] Investigate: Respond to data update and import errors.Nov 1 2023, 6:48 PM

As discussed with @STran, there are broadly two ways that import can fail:

  1. Due to some bad data, one statement fails
  2. Something catastrophic happens externally and an import doesn't complete

If (1) happens, then running the whole pipeline again is very inefficient. However, if (2) happens, then it's necessary (until we find a way to store temporary files more permanently).

The changes we are making at the moment are targeted to dealing with (2).

  • We can detect whether the last run completed by whether the expected batch count for the last run (import_status.batch_count) equals the number of rows for the last run - whatever the batch_status for each row.
  • If these are equal and all batch_status are COMPLETED, then there's nothing more to do.
  • If these are equal but any batch_status are ERROR, then single IPs failed, and a re-run is not necessary. In the future we could store which IPs failed and have another pipeline that can target specific IPs. For now, we'd either need to leave it or respond manually, or do a re-run (the status quo in the current patches).
Dreamy_Jazz renamed this task from [M] Investigate: Respond to data update and import errors to Investigate: Respond to data update and import errors.Nov 20 2023, 4:28 PM
Dreamy_Jazz set the point value for this task to 2.

Change 982364 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982364

Change 982364 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982364

Change 982375 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982375

Change 982375 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982375

Change 982386 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982386

Change 982386 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/982386

Tchanders renamed this task from Investigate: Respond to data update and import errors to Respond to data update and import errors.Jan 12 2024, 12:14 PM

https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/172 implements the following workflow:

image.png (1×2 px, 151 KB)

The script is expected to run at a more frequent cadence than daily and should reach one conclusion on every run. With more frequent runs, that should be enough tries to recover from an error and continue that day's import before the next one.

Change 991416 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump version and schedule multiple runs

https://gerrit.wikimedia.org/r/991416

Change 991416 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version and schedule multiple runs

https://gerrit.wikimedia.org/r/991416

Change 991417 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] ipoid: Bump chart version

https://gerrit.wikimedia.org/r/991417

Change 991417 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump chart version

https://gerrit.wikimedia.org/r/991417

@STran What kinds of errors are we guarding against?

I have been struggling to simulate realistic import errors. I have experimented with using timeout.

For example, in import.sh I have modified line 67:

timeout <seconds> node -e "require('./update-db.js')('$file');"

and then ran docker compose exec web ./main.sh --init true --today <date> --debug true

Experimenting with different values of <seconds> I have managed to get import_status into potentially unexpected states:

SELECT * FROM import_status;
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       1 | 2024-01-24 16:31:14 |                     | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/sub/query_split_aaaaq.sql |          20 |            0 |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+

and

+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       1 | 2024-01-24 16:36:30 |                     | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/sub/query_split_aaaaq.sql |          20 |            1 |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+

This is because main.sh exited after its first call to node -e "require('./update-db.js')('$file');", and so only one batch was processed.

Afterwards, running docker compose exec web ./main.sh --today <date> --yesterday <date> --debug true I have seen exceptions:

For batch_status=0:

{"log.level":"info","@timestamp":"2024-01-24T16:32:58.103Z","process.pid":8642,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Update init already run, skipping..."}
{"log.level":"info","@timestamp":"2024-01-24T16:32:58.105Z","process.pid":8642,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Update remove-tunnel-anonymous-property already run, skipping..."}
{"log.level":"info","@timestamp":"2024-01-24T16:32:58.105Z","process.pid":8642,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Database updated!"}
/srv/service/import-status-utils.js:101
		latestImportDates.feed_file_yesterday ? `${DATADIR}/${latestImportDates.feed_file_yesterday}.json.gz` : '',
		                  ^

TypeError: Cannot read properties of undefined (reading 'feed_file_yesterday')
    at getIncompleteImport (/srv/service/import-status-utils.js:101:21)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async Object.getImportStatus (/srv/service/import-status-utils.js:167:34)

and for batch_status=1

{"log.level":"info","@timestamp":"2024-01-24T16:36:48.497Z","process.pid":8830,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Update init already run, skipping..."}
{"log.level":"info","@timestamp":"2024-01-24T16:36:48.500Z","process.pid":8830,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Update remove-tunnel-anonymous-property already run, skipping..."}
{"log.level":"info","@timestamp":"2024-01-24T16:36:48.501Z","process.pid":8830,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Database updated!"}
{"log.level":"debug","@timestamp":"2024-01-24T16:36:48.880Z","process.pid":8819,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Found incomplete import. Attempting to restart import from null and 20240109."}
{"log.level":"info","@timestamp":"2024-01-24T16:36:49.160Z","process.pid":8880,"host.hostname":"8e3a1bb052f6","ecs.version":"8.10.0","message":"Feed for 20240109 already exists, skipping attempt to download."}
/srv/service/get-feed.js:54
		throw new Error( 'Invalid date passed. Date should be in yyyymmdd format.' );
		      ^

Error: Invalid date passed. Date should be in yyyymmdd format.
    at Object.init (/srv/service/get-feed.js:54:9)
    at [eval]:1:26
    at Script.runInThisContext (node:vm:129:12)
    at Object.runInThisContext (node:vm:313:38)
    at node:internal/process/execution:76:19
    at [eval]-wrapper:6:22
    at evalScript (node:internal/process/execution:75:60)
    at node:internal/main/eval_string:27:3

I don't know how realistic these types of errors are and whether we are concerned about them. Could you perhaps show me examples of real errors on production?

I have tried a few other ways to simulate batches failing.

SQL

In SQL, I ran the query: SET GLOBAL wait_timeout=1;

On calling main.sh I got:

...
{"log.level":"info","@timestamp":"2024-01-25T10:24:31.979Z","process.pid":11772,"host.hostname":"bdbf42f4715d","ecs.version":"8.10.0","message":"Importing /tmp/ipoid/sub/query_split_aaaaq.sql..."}
{"log.level":"info","@timestamp":"2024-01-25T10:24:35.172Z","process.pid":11823,"host.hostname":"bdbf42f4715d","ecs.version":"8.10.0","message":"All updates complete"}
/srv/service/node_modules/mariadb/lib/misc/errors.js:64
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: Cannot execute new commands: connection closed
sql: UPDATE import_status SET batch_status = ? WHERE batchid = ? - parameters:[2,1]
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:64:10)
    at Query.throwUnexpectedError (/srv/service/node_modules/mariadb/lib/cmd/command.js:39:24)
    at Query.throwNewError (/srv/service/node_modules/mariadb/lib/cmd/command.js:69:22)
    at Connection.addCommandDisabled (/srv/service/node_modules/mariadb/lib/connection.js:1104:21)
    at Connection.query (/srv/service/node_modules/mariadb/lib/connection.js:1349:10)
    at new Promise (<anonymous>)
    at ConnectionPromise.query (/srv/service/node_modules/mariadb/lib/connection-promise.js:93:12)
    at /srv/service/update-db.js:95:29
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5) {
  sqlMessage: 'Cannot execute new commands: connection closed',
  sql: 'UPDATE import_status SET batch_status = ? WHERE batchid = ? - parameters:[2,1]',
  fatal: true,
  errno: 45013,
  sqlState: '08S01',
  code: 'ER_CMD_CONNECTION_CLOSED'
}

actor_data had the correct number of rows for that one batch, but:

SELECT * FROM import_status;
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       1 | 2024-01-25 13:36:02 |                     | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/sub/query_split_aaaaq.sql |          20 |            0 |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
NodeJS

In the call the createPool in update-db.js, I added the parameter connectTimeout: 1 and got:

...
{"log.level":"info","@timestamp":"2024-01-25T11:58:32.194Z","process.pid":390,"host.hostname":"d5b18bbf7a0d","ecs.version":"8.10.0","message":"Importing /tmp/ipoid/sub/query_split_aaaaa.sql..."}
/srv/service/node_modules/mariadb/lib/misc/errors.js:64
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: (conn=-1, no: 45028, SQLState: HY000) retrieve connection from pool timeout after 10002ms
    (pool connections: active=0 idle=0 limit=10)
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:64:10)
    at Pool._requestTimeoutHandler (/srv/service/node_modules/mariadb/lib/pool.js:349:26)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7) {
  sqlMessage: 'retrieve connection from pool timeout after 10002ms\n' +
    '    (pool connections: active=0 idle=0 limit=10)',
  sql: null,
  fatal: false,
  errno: 45028,
  sqlState: 'HY000',
  code: 'ER_GET_CONNECTION_TIMEOUT'
}

Nothing in actor_data or import_status.

I added instead the parameter socketTimeout: 1 and got:

...
{"log.level":"info","@timestamp":"2024-01-25T12:52:53.647Z","process.pid":1446,"host.hostname":"d5b18bbf7a0d","ecs.version":"8.10.0","message":"Importing /tmp/ipoid/sub/query_split_aaaaa.sql..."}
/srv/service/node_modules/mariadb/lib/misc/errors.js:88
  return new SqlError(msg, sql, true, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: (conn=277, no: 45026, SQLState: 08S01) socket timeout
    at Object.module.exports.createFatalError (/srv/service/node_modules/mariadb/lib/misc/errors.js:88:10)
    at Connection.socketTimeoutReached (/srv/service/node_modules/mariadb/lib/connection.js:1042:24)
    at Object.onceWrapper (node:events:627:28)
    at Socket.emit (node:events:513:28)
    at Socket._onTimeout (node:net:526:8)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7) {
  sqlMessage: 'socket timeout',
  sql: null,
  fatal: true,
  errno: 45026,
  sqlState: '08S01',
  code: 'ER_SOCKET_TIMEOUT'
}

This time nothing in actor_data but in import_status:

+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       1 | 2024-01-25 12:52:53 |                     | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           1 |            0 |
+---------+---------------------+---------------------+-----------------------------+--------------------------------------+-------------+--------------+

I have also been doing successful imports, and then changing rows in import_status to simulate failed or incomplete batches.

If yesterday's import was from a clean DB, I get exceptions like the ones above (T344941#9485893) when it retries yesterday's failed batches.

If yesterday's import has a mix of failed (batch_status=2) and incomplete (batch_status=0) batches, when you update it will only retry the failed batches. You then need to run the update again to retry the incomplete batches.

If there are failed batches for both yesterday and today, it will retry for both days but I think it gets the batch for today wrong. For example:

SELECT * FROM import_status;
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday         | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       3 | 2024-01-25 13:54:47 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            2 |
|       4 | 2024-01-25 13:54:48 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           2 |            1 |
|       5 | 2024-01-25 13:55:27 | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/20240111.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            2 |
|       6 | 2024-01-25 13:55:29 | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/20240111.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           2 |            1 |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+

After re-running today's update:

+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday         | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       3 | 2024-01-25 13:54:47 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            3 |
|       4 | 2024-01-25 13:54:48 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           2 |            1 |
|       5 | 2024-01-25 13:55:27 | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/20240111.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            2 |
|       6 | 2024-01-25 13:55:29 | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/20240111.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           2 |            1 |
|       7 | 2024-01-25 13:56:41 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            1 |
|       8 | 2024-01-25 13:56:42 | /tmp/ipoid/20240109.json.gz | /tmp/ipoid/20240110.json.gz | /tmp/ipoid/sub/query_split_aaaab.sql |           2 |            1 |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+

Note that batchid=5 has not had its batch_status changed to a 3. Both batchid 7 and 8 have the same values for feed_file_yesterday and feed_file_today. I am not sure what state the data in the database is.

Would there ever be a case where yesterday's batches have failed but today's have succeeded? When you re-run the update for today it will retry yesterday's failed batches but not re-apply today's updates, leaving the data in an inconsistent state.

Batches are deterministic...

You do seem to need to make sure the batchsize parameter you used today is the same as yesterday. Perhaps we shouldn't make it a parameter.

The script is expected to run at a more frequent cadence than daily and should reach one conclusion on every run. With more frequent runs, that should be enough tries to recover from an error and continue that day's import before the next one.

What if yesterday's retries keep failing and we go into tomorrow with no rows in import_status for today (we haven't imported any data for today)?

What kinds of errors are we guarding against?

This script is less focused on guarding against errors (stopping them from ever happening) and more concerned with graceful recovery/preventing compounding errors. In the case of an error, it should stop importing new data and re-attempt old dates until it's successful. For instance:

  • If there are batches with errors, it should re-attempt those batches and only those batches until they succeed. This means if something fails consistently (eg. we didn't guard against bad data) then it should continue to attempt and fail and not import anything new until an engineer can resolve the bad data handling.
  • If the import is incomplete for whatever reason, it should attempt to import the batches that were never attempted and continue to do that until the import is considered complete (batch count of batches with complete status is equal to the expected total)

I don't know how realistic these types of errors are and whether we are concerned about them.

Ideally, they're states the script should recover from regardless. If you run a regular import afterwards, without any timeouts, the script should successfully recover/eventually finish importing.


Could you perhaps show me examples of real errors on production?

A real error we had was the connection dropping midway through an import. In this case, a lot of batches had been completed and no errors in batches were logged. Script was expected to (and eventually did) only import the remaining batches.

A potential error we could have is if an error happens on import, eg. we didn't guard against a value coming in. Expected state would be an batch with the error status but the other batches to have completed. Script should retry the error batch.

The two could be combined, as disconnects are independent of bad data in which case we'd see both the error status and a missing number of batches.


If yesterday's import has a mix of failed (batch_status=2) and incomplete (batch_status=0) batches, when you update it will only retry the failed batches. You then need to run the update again to retry the incomplete batches.

This is expected behavior. The script can only determine one state at a time and incomplete is considered a different state than errors.


If there are failed batches for both yesterday and today, it will retry for both days but I think it gets the batch for today wrong. For example:

This should never happen naturally. The script should always prevent additional imports if there are errors or incompletes to resolve first.


You do seem to need to make sure the batchsize parameter you used today is the same as yesterday. Perhaps we shouldn't make it a parameter.

I don't think this is a mandatory parameter for you to enter if you use main.sh as the entrypoint. This should default to 10000 which is what we expect on production and that doesn't change. Smaller batch sizes are very useful for development testing and it's documented in the README as a development-only parameter although there's nothing reinforcing that in the code.


What if yesterday's retries keep failing and we go into tomorrow with no rows in import_status for today (we haven't imported any data for today)?

It should stay stopped until it's resolved. After it's resolved, it'll use the day's date as the today date and diff accordingly. The script doesn't care if multiple days are skipped and from a product perspective, we also prefer to be stale a few days while import is resolved and then update to the most up to date data whenever the database is healthy again.

I don't think I can spend much more time on this. I think I have made some progress in simulating import errors and I have some observations below.

Adding a random socketTimeout to the pool object in update-db.js, I have been able to get errors on import of the form:

{"log.level":"info","@timestamp":"2024-02-09T14:02:19.215Z","process.pid":116297,"host.hostname":"4a2e0da12c3c","ecs.version":"8.10.0","message":"Importing /tmp/ipoid/sub/query_split_aaaag.sql..."}
{"@timestamp":"2024-02-09T14:02:20.097Z","ecs.version":"8.10.0","log.level":"info","message":"All updates complete"}
/srv/service/node_modules/mariadb/lib/misc/errors.js:64
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: Cannot execute new commands: connection closed
sql: UPDATE import_status SET batch_status = ? WHERE batchid = ? - parameters:[2,2]
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:64:10)
    at Query.throwUnexpectedError (/srv/service/node_modules/mariadb/lib/cmd/command.js:39:24)
    at Query.throwNewError (/srv/service/node_modules/mariadb/lib/cmd/command.js:69:22)
    at Connection.addCommandDisabled (/srv/service/node_modules/mariadb/lib/connection.js:1104:21)
    at Connection.query (/srv/service/node_modules/mariadb/lib/connection.js:1349:10)
    at new Promise (<anonymous>)
    at ConnectionPromise.query (/srv/service/node_modules/mariadb/lib/connection-promise.js:93:12)
    at /srv/service/update-db.js:164:29
    at processTicksAndRejections (node:internal/process/task_queues:96:5) {
  sqlMessage: 'Cannot execute new commands: connection closed',
  sql: 'UPDATE import_status SET batch_status = ? WHERE batchid = ? - parameters:[2,2]',
  fatal: true,
  errno: 45013,
  sqlState: '08S01',
  code: 'ER_CMD_CONNECTION_CLOSED'
}

Then, on subsequent retries of the import, I see lots of errors of the form:

{"@timestamp":"2024-02-09T14:02:26.270Z","ecs.version":"8.10.0","log.level":"info","message":"Duplicate entry found for ip, attempting to delete and re-insert"}

and

{"@timestamp":"2024-02-09T14:02:32.487Z","ecs.version":"8.10.0","log.level":"error","message":{"batchFile":"/tmp/ipoid/sub/query_split_aaaag.sql","err":{"code":"ER_DUP_ENTRY","errno":1062,"fatal":false,"name":"SqlError","sql":"INSERT INTO actor_data_proxies (actor_data_id, proxy_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = 'x.x.x.x'), (SELECT pkid FROM proxies WHERE proxy = 'FOO_PROXY' ) ); - parameters:[]","sqlMessage":"Duplicate entry '10679-1' for key 'PRIMARY'","sqlState":"23000"},"statement":"'INSERT INTO actor_data_proxies (actor_data_id, proxy_id) VALUES( (SELECT pkid FROM actor_data WHERE ip = \\'x.x.x.x\\'), (SELECT pkid FROM proxies WHERE proxy = \\'FOO_PROXY\\' ) );'","todayFile":"/tmp/ipoid/20240102.json.gz","yesterdayFile":"/tmp/ipoid/20240101.json.gz"}}

I am guessing this means the rollback() did not work, although I don't see it throw any errors or messages to that effect. It is difficult to tell the state of the data after this. It is possible this particular issue will be fixed with T356736, and I may do more investigation once that task is complete.

In the case that yesterday has only incomplete (batch_status=0) imports, require('./import-status-utils.js').getImportStatus() returns NOT_STARTED.

This does not appear to lead to any negative consequences when you run an update. The batches for yesterday appear to just be ignored and stay in the database with batch_status=0. However, I wonder if this behaviour will be confusing for future maintainers.

For example, with import_status:

+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
| batchid | timestamp           | feed_file_yesterday         | feed_file_today             | batch_file                           | batch_count | batch_status |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+
|       1 | 2024-02-12 07:41:01 |                             | /tmp/ipoid/20240101.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           1 |            1 |
|       2 | 2024-02-12 07:41:13 | /tmp/ipoid/20240101.json.gz | /tmp/ipoid/20240102.json.gz | /tmp/ipoid/sub/query_split_aaaaa.sql |           1 |            0 |
+---------+---------------------+-----------------------------+-----------------------------+--------------------------------------+-------------+--------------+

Running docker-compose exec web node -e "require('./import-status-utils.js').getImportStatus('20240102', '20240103');" returns:

{"status":"NOT_STARTED","yesterday":"20240102","today":"20240103","retryBatches":null,"completedBatches":null}

Latest ipoid commit tested: ad79ee0c90b9c088d7df3242b567e7633b07d2ff