There are, broadly, four classes of "duplicates" that may exist and need cleaning up:
- Cases where both a revision and an archive row exist for the same change.
- Cases where an archive row exists with the same revision ID as some other change in revision.
- Cases where multiple archive rows exist for the same change.
- Cases where multiple archive rows use the same revision ID for different changes.
We can probably define "same change" based on the title, sha1, timestamp, and user.
Fixing this should be reasonably straightforward: find the duplicates, classify them, delete the archive rows for the two "same change" cases, and assign new revision IDs (as in T182678) for the "different change" cases.