Page MenuHomePhabricator

File storage uses (title,timestamp) as a unique key, but this is not unique
Open, HighPublic

Description

FileRepo::findFile appears to use the (title,timestamp) pair as a unique key to identify a particular version of a file. However, this is not necessarily unique. See for example https://commons.wikimedia.org/wiki/File:Treppe_22_22_test_upload.jpg where two versions were both uploaded at 2014-05-05T13:00:59Z.

For proper operation, the API requires some ordered unique identifier for file revisions. I imagine there are other places in the code that also require unique identifiers.

This appears to depend on bug 15441 (for the oldimage table), although I'm not familiar enough with the FileBackend code to know for sure whether that's generally the case or if it only applies to certain backends.


Version: 1.23.0
Severity: normal

Details

Reference
bz65264

Related Objects

StatusSubtypeAssignedTask
OpenNone
OpenNone
ResolvedJdlrobson
OpenFeatureNone
OpenFeatureNone
OpenNone
Resolvedcscott
Duplicatecscott
DeclinedFeatureNone
OpenFeatureNone
OpenNone
OpenNone
OpenNone
ResolvedUmherirrender
DuplicateNone
OpenNone
DeclinedNone
DuplicateBUG REPORTNone
OpenNone
OpenFeatureNone
OpenFeatureNone
OpenNone
ResolvedNone
OpenNone
OpenNone
OpenLadsgroup
ResolvedKrinkle
ResolvedLadsgroup
Resolvedmforns
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedZabe
ResolvedDreamy_Jazz
ResolvedLadsgroup
ResolvedLadsgroup
DuplicateNone
ResolvedZabe
OpenLadsgroup
OpenZabe
OpenPRODUCTION ERRORNone
ResolvedPRODUCTION ERRORZabe
ResolvedPRODUCTION ERRORZabe
OpenPRODUCTION ERRORZabe

Event Timeline

bzimport raised the priority of this task from to High.Nov 22 2014, 3:24 AM
bzimport set Reference to bz65264.
bzimport added a subscriber: Unknown Object (MLST).

Change 133178 had a related patch set uploaded by Aaron Schulz:
Made LocalFile avoid duplicate (name,timestamp) pairs

https://gerrit.wikimedia.org/r/133178

Change 133178 merged by jenkins-bot:
Made LocalFile avoid duplicate (name,timestamp) pairs

https://gerrit.wikimedia.org/r/133178

This is probably a duplicate of some older bug.

So change 133178 fixed it going forward, which is good. Although I left a comment there.

We'll also need to do database cleanup for situations where this already exists, such as https://commons.wikimedia.org/wiki/File:Treppe_22_22_test_upload.jpg, to completely fix the bug. I also wonder if it's worth making the oi_name_timestamp index unique, although that would only detect the problem after the fact since the row for the current version isn't added to oldimage until a new version becomes current.

The existing duplicate data might make the migration to the new structure a bit complicated:

mysql:research@s1-analytics-replica.eqiad.wmnet [enwiki]> select oi_name, oi_timestamp, oi_archive_name  from oldimage where oi_archive_name in (select oi_archive_name from oldimage group by oi_archive_name havi
ng count(*) > 1) limit 50;
+----------------------------------------------------------------+----------------+------------------------------------------+
| oi_name                                                        | oi_timestamp   | oi_archive_name                          |
+----------------------------------------------------------------+----------------+------------------------------------------+
| Elas.gif                                                       | 20030120080513 | 20030122141949!Elas.gif                  |
| Elas.gif                                                       | 20030122142001 | 20030122141949!Elas.gif                  |
| Elas.gif                                                       | 20030122142227 | 20030122141949!Elas.gif                  |
| Squirell.500mm.5yards.jpg                                      | 20040821201010 | 20040821201046!Squirell.500mm.5yards.jpg |
| Squirell.500mm.5yards.jpg                                      | 20040821201217 | 20040821201046!Squirell.500mm.5yards.jpg |
| LPLW.JPG                                                       | 20070502003039 | 20070502003043!LPLW.JPG                  |
| LPLW.JPG                                                       | 20070502003043 | 20070502003043!LPLW.JPG                  |
| ConfederateLincoln.png                                         | 20061103224104 |                                          |
| Nearing-scott-1918.jpg                                         | 20090115091745 |                                          |
| Joe_Brooks_wiki.jpg                                            | 20090309235754 |                                          |
| RobertHIGoddard-1909.jpg                                       | 20090716151612 |                                          |
| SiobhanOwen_Singer_Harpist_2012.jpg                            | 20120907143842 |                                          |
| Arab_League_Languages.PNG                                      | 20121023151455 |                                          |
| Hines_Booth.jpg                                                | 20080827154940 | 20080827155054!Hines_Booth.jpg           |
| Hines_Booth.jpg                                                | 20080808042730 | 20080827154940!Hines_Booth.jpg           |
| Hines_Booth.jpg                                                | 20080803223030 | 20080808042730!Hines_Booth.jpg           |
| Hines_Booth.jpg                                                | 20080827154940 | 20080827155054!Hines_Booth.jpg           |
| Hines_Booth.jpg                                                | 20080808042730 | 20080827154940!Hines_Booth.jpg           |
| Hines_Booth.jpg                                                | 20080803223030 | 20080808042730!Hines_Booth.jpg           |
| Malcolm_Glazer.jpg                                             | 20160801213343 | 20180103064117!Malcolm_Glazer.jpg        |
| Malcolm_Glazer.jpg                                             | 20160801213343 | 20180103064117!Malcolm_Glazer.jpg        |
| KKNO_logo.jpg                                                  | 20080223183936 |                                          |
| Logo_of_the_International_Practical_Shooting_Confederation.png | 20210823171756 |                                          |
| Maquinista,_diptico_2008.jpg                                   | 20220513034020 |                                          |
| Charles_Xavier_(Patrick_Stewart).jpg                           | 20220824010434 |                                          |
| BBCAN11_titlecard.jpg                                          | 20230308232159 |                                          |
| Msk_novohrad_lucenec_logo.png                                  | 20230309085314 |                                          |
| Orlen_Lietuva_logo.jpg                                         | 20230919140811 |                                          |
| Seattle_Fire_Department_Official_Logo.png                      | 20240818205749 |                                          |
+----------------------------------------------------------------+----------------+------------------------------------------+
29 rows in set (3.767 sec)

We get empty ones too and they are not fixed.

combination of oi_name and oi_timestamp is much cleaner though:

mysql:research@s1-analytics-replica.eqiad.wmnet [enwiki]> select oi_name, oi_timestamp, count(*) from oldimage group by oi_name, oi_timestamp having count(*) > 1;
+--------------------------------+----------------+----------+
| oi_name                        | oi_timestamp   | count(*) |
+--------------------------------+----------------+----------+
| Hines_Booth.jpg                | 20080803223030 |        2 |
| Hines_Booth.jpg                | 20080808042730 |        2 |
| Hines_Booth.jpg                | 20080827154940 |        2 |
| Joe_Brooks_wiki.jpg            | 20090309235754 |        2 |
| Malcolm_Glazer.jpg             | 20160801213343 |        2 |
| Me_at_Pizza_Hut.jpg            | 20070305143001 |        2 |
| New_English_Bible_cover.jpg    | 20060506015320 |        2 |
| State_police_Crawford_Erie.jpg | 20090215041007 |        2 |
+--------------------------------+----------------+----------+
8 rows in set (0.771 sec)

All cases are identical files being uploaded:

mysql:research@s1-analytics-replica.eqiad.wmnet [enwiki]> select oi_name, oi_timestamp, oi_archive_name, oi_sha1  from oldimage where concat(oi_name, '!', oi_timestamp) in (select concat(oi_name, '!', oi_timestamp) from oldimage group by oi_name, oi_timestamp having count(*) > 1) order by oi_name,oi_timestamp limit 50;
+--------------------------------+----------------+-----------------------------------------------+---------------------------------+
| oi_name                        | oi_timestamp   | oi_archive_name                               | oi_sha1                         |
+--------------------------------+----------------+-----------------------------------------------+---------------------------------+
| Hines_Booth.jpg                | 20080803223030 | 20080808042730!Hines_Booth.jpg                | dx3q3qn05m8vdjh534zoxg2t7j426bs |
| Hines_Booth.jpg                | 20080803223030 | 20080808042730!Hines_Booth.jpg                | dx3q3qn05m8vdjh534zoxg2t7j426bs |
| Hines_Booth.jpg                | 20080808042730 | 20080827154940!Hines_Booth.jpg                | 8t9bk58rrfo1trqs510n3u237tozv7d |
| Hines_Booth.jpg                | 20080808042730 | 20080827154940!Hines_Booth.jpg                | 8t9bk58rrfo1trqs510n3u237tozv7d |
| Hines_Booth.jpg                | 20080827154940 | 20080827155054!Hines_Booth.jpg                | k3axmxtrj8zvopksvf5fi8kbl5atn5y |
| Hines_Booth.jpg                | 20080827154940 | 20080827155054!Hines_Booth.jpg                | k3axmxtrj8zvopksvf5fi8kbl5atn5y |
| Joe_Brooks_wiki.jpg            | 20090309235754 |                                               | qu8ndqxgaue9u135l9sydn9s9vv8hv7 |
| Joe_Brooks_wiki.jpg            | 20090309235754 | 20090310002032!Joe_Brooks_wiki.jpg            | qu8ndqxgaue9u135l9sydn9s9vv8hv7 |
| Malcolm_Glazer.jpg             | 20160801213343 | 20180103064117!Malcolm_Glazer.jpg             | d13rs6nopm9k6q8o21rfid4zc3533l9 |
| Malcolm_Glazer.jpg             | 20160801213343 | 20180103064117!Malcolm_Glazer.jpg             | d13rs6nopm9k6q8o21rfid4zc3533l9 |
| Me_at_Pizza_Hut.jpg            | 20070305143001 | 20070305143001!Me_at_Pizza_Hut.jpg            | b79qyiw1es9cts54gqgmn493flphkxr |
| Me_at_Pizza_Hut.jpg            | 20070305143001 | 20070305143046!Me_at_Pizza_Hut.jpg            | b79qyiw1es9cts54gqgmn493flphkxr |
| New_English_Bible_cover.jpg    | 20060506015320 | 20110925002150!New_English_Bible_cover.jpg    | q8086om09p23sfk7yauyptfbvvsr890 |
| New_English_Bible_cover.jpg    | 20060506015320 | 20190628054743!New_English_Bible_cover.jpg    | q8086om09p23sfk7yauyptfbvvsr890 |
| State_police_Crawford_Erie.jpg | 20090215041007 | 20090215041007!State_police_Crawford_Erie.jpg | fap8pvagamdyfv51ozkmnptj9tf344i |
| State_police_Crawford_Erie.jpg | 20090215041007 | 20150514060412!State_police_Crawford_Erie.jpg | fap8pvagamdyfv51ozkmnptj9tf344i |
+--------------------------------+----------------+-----------------------------------------------+---------------------------------+
16 rows in set (2.687 sec)

I suggest simply dropping those rows

I've decided in the migration script just ignore inserting any row that has the same timestamp and sha1. If the files are the same and the upload time is the same, it was a duplicate entry and one of them should politely bite the dust.