Page MenuHomePhabricator

Respond to 17.01.2024 data import failure
Closed, ResolvedPublic

Description

Per https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.01.17?id=3FbeF40BFe9MHf51sx-V, the import today failed.

Looking at the pods, I see:

NAME                                            READY   STATUS      RESTARTS   AGE
ipoid-production-655b6bd8f-cn8rz                2/2     Running     0          4d21h
ipoid-production-daily-updates-28420508-5znjp   0/1     Completed   0          3d3h
ipoid-production-daily-updates-28421948-5k9f9   0/1     Completed   0          2d3h
ipoid-production-daily-updates-28423388-llsxv   0/1     Completed   0          27h
ipoid-production-daily-updates-28424828-nqdjv   0/1     Error       0          3h38m
ipoid-production-daily-updates-28424828-t4pc8   0/1     Error       0          66m

Then the specific error that occurred:

kharlan@deploy2002:~$ kubectl logs -f ipoid-production-daily-updates-28424828-nqdjv
...
{"log.level":"info","@timestamp":"2024-01-17T13:39:13.761Z","process.pid":1074,"host.hostname":"ipoid-production-daily-updates-28424828-nqdjv","ecs.version":"8.10.0","message":"Importing /tmp/ipoid/sub/query_split_aaavd.sql..."}
/srv/service/node_modules/mariadb/lib/misc/errors.js:64
  return new SqlError(msg, sql, fatal, info, sqlState, errno, additionalStack, addHeader);
         ^

SqlError: (conn=-1, no: 45028, SQLState: HY000) retrieve connection from pool timeout after 10011ms
    (pool connections: active=0 idle=0 limit=10)
    at Object.module.exports.createError (/srv/service/node_modules/mariadb/lib/misc/errors.js:64:10)
    at Pool._requestTimeoutHandler (/srv/service/node_modules/mariadb/lib/pool.js:349:26)
    at listOnTimeout (node:internal/timers:559:17)
    at processTimers (node:internal/timers:502:7) {
  sqlMessage: 'retrieve connection from pool timeout after 10011ms\n' +
    '    (pool connections: active=0 idle=0 limit=10)',
  sql: null,
  fatal: false,
  errno: 45028,
  sqlState: 'HY000',
  code: 'ER_GET_CONNECTION_TIMEOUT'
}

Because we have backoffLimit: 2 set for daily-updates, a retry occurred:

kharlan@deploy2002:~$ kubectl logs -f ipoid-production-daily-updates-28424828-t4pc8
{"log.level":"info","@timestamp":"2024-01-17T13:39:51.695Z","process.pid":26,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Update init already run, skipping..."}
{"log.level":"info","@timestamp":"2024-01-17T13:39:51.698Z","process.pid":26,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Update remove-tunnel-anonymous-property already run, skipping..."}
{"log.level":"info","@timestamp":"2024-01-17T13:39:51.698Z","process.pid":26,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Database updated!"}
{"log.level":"info","@timestamp":"2024-01-17T13:39:52.927Z","process.pid":49,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Feed for 20240117 exists, pulling now..."}
{"log.level":"info","@timestamp":"2024-01-17T13:40:19.934Z","process.pid":49,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Feed downloaded to /tmp/ipoid/20240117.json.gz"}
{"log.level":"info","@timestamp":"2024-01-17T13:40:20.059Z","process.pid":60,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Feed for 20240117 already exists, skipping attempt to download."}
{"log.level":"info","@timestamp":"2024-01-17T13:40:20.075Z","process.pid":71,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Removing potential stale files..."}
{"log.level":"info","@timestamp":"2024-01-17T13:40:20.118Z","process.pid":71,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Unzipping yesterday's file..."}
{"log.level":"info","@timestamp":"2024-01-17T13:41:23.455Z","process.pid":71,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Sorting yesterday's file..."}
{"log.level":"info","@timestamp":"2024-01-17T13:50:23.463Z","process.pid":71,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Unzipping today's file..."}
{"log.level":"info","@timestamp":"2024-01-17T13:51:30.846Z","process.pid":71,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Aggregating today's properties..."}
{"log.level":"info","@timestamp":"2024-01-17T13:55:59.852Z","process.pid":71,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Sorting today's file..."}
{"log.level":"info","@timestamp":"2024-01-17T14:05:01.177Z","process.pid":71,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Finding lines unique to old file..."}
{"log.level":"info","@timestamp":"2024-01-17T14:06:06.354Z","process.pid":71,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Finding lines unique to new file..."}
{"log.level":"info","@timestamp":"2024-01-17T14:07:12.962Z","process.pid":71,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Processing unique files..."}
{"log.level":"info","@timestamp":"2024-01-17T14:07:13.005Z","process.pid":71,"host.hostname":"ipoid-production-daily-updates-28424828-t4pc8","ecs.version":"8.10.0","message":"Joining and sorting unique files..."}
sort: cannot read: '/tmp/ipoid/chunk*': No such file or directory

But it is using January 17 as the "yesterday" file.

Details

TitleReferenceAuthorSource BranchDest Branch
Use a more permissive regex in date selectionrepos/mediawiki/services/ipoid!215stranmain-I3c6c63c394a19976433b8d9c169b54e2010c936fmain
Customize query in GitLab

Event Timeline

kostajh renamed this task from Response to 17.01.2024 data import failure to Respond to 17.01.2024 data import failure.Jan 17 2024, 4:53 PM

AIUI, T344941: Respond to data update and import errors would have helped avoid this problem.

We might want to consider T354944: Ensure database is up-to-date with latest data file if we could pull it together relatively quickly, to avoid needing to drop/re-import the DB.

We're requesting a dump of the current database from @jcrespo.

I've left 2 files on the directory: /home/stran/ipoid_backup @ deploy2002:

root@deploy2002:/home/stran/ipoid_backup$ ls -lha
total 524K
drwxr-xr-x 2 stran wikidev 4.0K Jan 18 10:50 .
drwxr-xr-x 5 stran wikidev 4.0K Jan 18 10:45 ..
-rw-r--r-- 1 stran wikidev  326 Jan 18 10:52 ipoid.import_status-schema.sql.gz
-rw-r--r-- 1 stran wikidev 509K Jan 18 10:49 ipoid.import_status.sql.gz
✔️

The first one contains a gzipped text file with the CREATE TABLE statement of the table (the table definition) and the second, on the same format, the contents on the table in INSERT syntax. This can be examined directly by uncompressing it and reading it with a text editor, or can be imported directly to a mysql server by doing zcat <file> | mysql ipoid

Change 991585 had a related patch set uploaded (by STran; author: STran):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/991585

Change 991585 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/991585

STran claimed this task.

Import for 20240117 resumed and completed and import for 20240118 finished as well. Incident seems to have been resolved.