Page MenuHomePhabricator

Duplicate entry 'X.X.X.X' for key 'ip' sql: INSERT INTO actor_data (ip,org,client_count,types,conc_city,conc_state,conc_country,countries,location_country,risks
Closed, ResolvedPublicBUG REPORT

Description

First seen on February 3 at 8:39 UTC

There are a variety of IP addresses that throw the "Duplicate entry" error. There are 19,983 instances of the error appearing since February 3.

In subsequent imports, we see "Found failed batches from a previous run; attempting to re-import batches from an update from 20240202 to 20240203" but this ends up triggering the same errors again.

In practice, I think that means the production DB contains a mix of data newer than February 3rd, and stale entries.

See also:

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
main: Use "yesterday" as the current import daterepos/mediawiki/services/ipoid!230kharlanmain-I5b39cb19101158085eb9db461c0665aace81d5ebmain
Prefere the original statement on retryrepos/mediawiki/services/ipoid!229stranmain-I17cda9998017cf572d324ff914e1f535f028ab24main
update-db: Use for statement instead of .everyrepos/mediawiki/services/ipoid!228kharlanmain-I2a1afffeaefd2ef0e3d9cfd6554a38a321cbed20main
update-db: Log error object seen in duplicate entry handlingrepos/mediawiki/services/ipoid!225kharlanmain-I0dc5702adced78c5728babf3b08cbf13fafd8ba7main
Recover from duplicate IP key SQL errorsrepos/mediawiki/services/ipoid!222stranmain-I56a6fceb221ec492067ac0d04a40ea7550ea553dmain
Customize query in GitLab

Event Timeline

kostajh changed the subtype of this task from "Task" to "Bug Report".Feb 6 2024, 8:12 AM

T354944: Ensure database is up-to-date with latest data file would help with this, in the longer term.

In the shorter term, I wonder if ON DUPLICATE KEY UPDATE would be a reasonable option. It's not an ideal solution, because in theory the SQL statement generation shouldn't result in INSERT queries that generate duplicate entries.

T354944: Ensure database is up-to-date with latest data file would help with this, in the longer term.

In the shorter term, I wonder if ON DUPLICATE KEY UPDATE would be a reasonable option. It's not an ideal solution, because in theory the SQL statement generation shouldn't result in INSERT queries that generate duplicate entries.

Another thought would be to modify generateInsertActorQueries to do a SELECT before generating the insert query. If it finds a matching IP, it would instead return an UPDATE, and would also give us the opportunity to log some more details about the stack trace in which this scenario occurs.

It seems like the feed file can be updated over the course of the day. I happened to have an older copy of the feed file and downloaded a new copy while investigating and noticed that my file was 1.01GB and the new file was 1.02GB. Checking the old file vs the new file, I saw that the IP exists in the new (and presumably now-static) file but wasn't in the feed that the db probably used:

% grep {IP} 20240202.new.json

% grep {IP} 20240202.old.json
{"ip":{IP},"Kar-Tel LLC"}

Checking the lc, the two are definitely different:

% wc -l 20240202.new.json
 36949570 20240202.new.json

% wc -l 20240202.old.json
 36598906 20240202.old.json

We discussed the following fix:


I downloaded the file for the date again today while writing this up and can see that the line count for this and the "final" file from the other day are the same:

% wc -l 20240202.new.json
 36949570 20240202.new.json

% wc -l 20240202.latest.json         
 36949570 20240202.latest.json

I'm running some diffs and will update this investigation if/when they succeed.

Change 999638 had a related patch set uploaded (by STran; author: STran):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/999638

Change 999638 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/999638

It seems like this is still an issue. Looking at the logs from today (example) we are stuck on importing February 2 - February 3. This is occurring despite us having 38,960 log entries for "Duplicate entry found for ip, attempting to delete and re-insert" in the last week.

Change 1002924 had a related patch set uploaded (by STran; author: STran):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1002924

Change 1002924 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1002924

Change 1003360 had a related patch set uploaded (by Kosta Harlan; author: STran):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1003360

Change 1003360 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1003360

Change 1005457 had a related patch set uploaded (by STran; author: STran):

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1005457

Change 1005457 merged by jenkins-bot:

[operations/deployment-charts@master] ipoid: Bump version

https://gerrit.wikimedia.org/r/1005457

@STran are there other patches that need review for this ticket? If not, can this be moved into QA?

dom_walden subscribed.

We appear to be recovering from duplicate IP errors without interrupting the import completely (i.e. other IPs in the feed are imported).

However, I have found a few puzzling issues with the accuracy of the data on update when there are duplicate IPs. Raised as T359384, T359181, T359157.

EDIT: I have found an issue that happens even without duplicate IPs: T359517.

Further testing of this should be done in those bugs.

Latest ipoid commit tested: 3fc04e78cb82ac031188446aed0aa1210d1200f0

Marking as resolved as I think there isn't anything else to do in this ticket. Please re-open if I am wrong.