In this task, we'll create the script for updating the ipoid database with $CURENT_DAY data, assuming that we already have a full set of data in the DB.
See T305724: Investigate database data invalidation questions and chunked/timed API to MySQL/MariaDB ETL and T325635: Investigate: Alternatives or improvements to data import method for previous discussion. Copying from T305724#8917809:
- Initial import, get the ~25m records into the DB (all INSERT queries). Tracked in T339331: Prepare for initial data import on production servers
- Daily ingest of data (cron job in Kubernetes):
- Download dump file for today as $CURRENT_DAY_DUMP_FILE
- Iterate over all lines in $CURRENT_DAY_DUMP_FILE and do a SELECT query against the DB.
- If we find a result:
- the metadata matches what is in $CURRENT_DAY_DUMP_FILE: do nothing, the data's current
- the metadata doesn't match: UPDATE the record and associated metadata records as needed
- If we don't find a result
- It's a new record, run an INSERT to add to the DB
- If we find a result:
- Download dump file for yesterday as $YESTERDAY_DUMP_FILE
- Iterate over all lines in $YESTERDAY_DUMP_FILE, find lines that do not exist in $CURRENT_DAY_DUMP_FILE
- Issue DELETE queries for all lines that are in $YESTERDAY_DUMP_FILE and not in $CURRENT_DAY_DUMP_FILE, as those are stale data
Note that the script processing the updates should add pauses to so that the entire process spans at least 6 hours. See T340516: Investigate diff expectations for updating imported data for details on expected number of new entries, updates, and deletions; tl;dr it's about 2M new entries, 6M updates, and 2M deletions.