querycache needs a primary key; see T17441
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Add primary key qc_type and qc_value | mediawiki/core | master | +32 -4 |
Status | Subtype | Assigned | Task | |
---|---|---|---|---|
· · · | ||||
Resolved | LSobanski | T17441 Some tables lack unique or primary keys, may allow confusing duplicate data | ||
Stalled | None | T146571 Add a primary key to querycache table | ||
Open | None | T174513 Maintenance script to cleanup querycache | ||
· · · |
Event Timeline
Change 345968 had a related patch set (by Paladox) published:
[mediawiki/core@master] Add primary key qc_type and qc_value
We can't use qc_type, qc_value as a PK
mysql:wikiadmin@db1083 [enwiki]> select qc_type, qc_value, count(*) as cnt from querycache group by qc_type, qc_value HAVING cnt > 1 ORDER BY cnt DESC LIMIT 10; +------------------------+----------+-------+ | qc_type | qc_value | cnt | +------------------------+----------+-------+ | | 0 | 10000 | | Unwatchedpages | 0 | 5000 | | Fewestrevisions | 1 | 5000 | | Listredirects | 0 | 5000 | | Lonelypages | 0 | 4970 | | Unusedtemplates | 0 | 4602 | | Uncategorizedtemplates | 0 | 3699 | | Wantedtemplates | 2 | 2744 | | Wantedcategories | 1 | 2245 | | Wantedtemplates | 3 | 996 | +------------------------+----------+-------+ 10 rows in set (0.06 sec)
Is this intended, or makes sense?
$ mysql -h dbstore2002.codfw.wmnet -P3311 enwiki -e "SELECT qc_type, qc_namespace, qc_title, count(*) as count FROM querycache qc1 JOIN querycache qc2 USING (qc_type, qc_namespace, qc_title) GROUP BY qc_type, qc_namespace, qc_title HAVING count(*) > 1" ... | | 0 | "All-American_Boys" | 100 | | | 0 | "All_My_Trials" | 100 | | | 0 | "All_You_Zombies" | 100 | | | 0 | "Allie"_Murray | 100 | | | 0 | "Alon_Cohen" | 100 | | | 0 | "Alpha_Male''_Monty_Brown | 100 | | | 0 | "Always" | 100 | ... | Disambiguations | 0 | IDB | 9 | | Disambiguations | 0 | Inflation_(disambiguation) | 4 | | Disambiguations | 0 | Isostatic | 4 | | Disambiguations | 0 | Isotherm | 4 | | Disambiguations | 0 | Italian_dialects | 4 | | Disambiguations | 0 | Ndebele_people | 4 | | Disambiguations | 0 | Pythagorean | 4 | | Disambiguations | 0 | Red_Army_Choir | 4 | | Disambiguations | 108 | Chemical_elements | 4 | +-----------------+--------------+-------------------------------------------------------------------------------+-------+
... given the last col is the number of duplicates, not the value?
The qc_type being '' is definitely wrong, and I guess is an artefact of some historic bug, it doesn't seem to be on newer wikis. Filed T174513: Maintenance script to cleanup querycache to make MW clean this up, and then I'll look at sorting WMF production too (one way or another)
qc_value is rather arbitrary, per tables.sql Some sort of stored value. Sizes, counts... which isn't helpful
I'd presume that it makes no sense for any page to be listed under a specific qc_type multiple times
For example, for Disambiguations it's https://en.wikipedia.org/wiki/Special:DisambiguationPages - a page should only be listed once..
For some, it may make more sense, but I can't think of any at the moment...
Little bit more sane now
mysql:wikiadmin@db1055 [enwiki]> select distinct qc_type from querycache; +-------------------------+ | qc_type | +-------------------------+ | Ancientpages | | BrokenRedirects | | CrossNamespaceLinks | | Deadendpages | | DisambiguationPages | | Disambiguations | | DoubleRedirects | | Fewestrevisions | | GadgetUsage | | ListDuplicatedFiles | | Listredirects | | Lonelypages | | Longpages | | MediaStatistics | | Mostcategories | | Mostimages | | Mostinterwikis | | Mostlinked | | Mostlinkedcategories | | Mostlinkedtemplates | | Mostrevisions | | OrphanedTimedText | | Popularpages | | Shortpages | | Uncategorizedcategories | | Uncategorizedpages | | Uncategorizedtemplates | | Unusedcategories | | Unusedimages | | Unusedtemplates | | Unwatchedpages | | Wantedcategories | | Wantedfiles | | Wantedpages | | Wantedtemplates | | Withoutinterwiki | | fr_unreviewedpages | | fr_unreviewedpages_q | +-------------------------+ 38 rows in set (0.15 sec)
mysql:wikiadmin@db1055 [enwiki]> select qc_type, qc_value, count(*) as cnt from querycache group by qc_type, qc_value HAVING cnt > 1 ORDER BY cnt DESC LIMIT 10; +------------------------+----------+------+ | qc_type | qc_value | cnt | +------------------------+----------+------+ | Unwatchedpages | 0 | 5000 | | Listredirects | 0 | 5000 | | Fewestrevisions | 1 | 5000 | | Lonelypages | 0 | 4970 | | Unusedtemplates | 0 | 4601 | | Uncategorizedtemplates | 0 | 3694 | | Wantedtemplates | 2 | 2744 | | Wantedcategories | 1 | 2475 | | Wantedtemplates | 3 | 996 | | Withoutinterwiki | 0 | 854 | +------------------------+----------+------+ 10 rows in set (0.84 sec)
I'm confused what this query is doing.. It's probably just tiredness... But why are you joining querycache against itself?
lang=sqlmysql:wikiadmin@db1055 [enwiki]> SELECT * FROM querycache qc1 JOIN querycache qc2 USING (qc_type, qc_namespace, qc_title) WHERE qc_title='Alexis_(disambiguation)'; +-----------------+--------------+-------------------------+----------+----------+ | qc_type | qc_namespace | qc_title | qc_value | qc_value | +-----------------+--------------+-------------------------+----------+----------+ | Disambiguations | 0 | Alexis_(disambiguation) | 1601 | 1601 | | Disambiguations | 0 | Alexis_(disambiguation) | 1601 | 1614 | | Disambiguations | 0 | Alexis_(disambiguation) | 1601 | 1887 | | Disambiguations | 0 | Alexis_(disambiguation) | 1614 | 1601 | | Disambiguations | 0 | Alexis_(disambiguation) | 1614 | 1614 | | Disambiguations | 0 | Alexis_(disambiguation) | 1614 | 1887 | | Disambiguations | 0 | Alexis_(disambiguation) | 1887 | 1601 | | Disambiguations | 0 | Alexis_(disambiguation) | 1887 | 1614 | | Disambiguations | 0 | Alexis_(disambiguation) | 1887 | 1887 | +-----------------+--------------+-------------------------+----------+----------+ 9 rows in set (0.10 sec) mysql:wikiadmin@db1055 [enwiki]> SELECT * FROM querycache WHERE qc_title='Alexis_(disambiguation)'; +-----------------+----------+--------------+-------------------------+ | qc_type | qc_value | qc_namespace | qc_title | +-----------------+----------+--------------+-------------------------+ | Disambiguations | 1601 | 0 | Alexis_(disambiguation) | | Disambiguations | 1614 | 0 | Alexis_(disambiguation) | | Disambiguations | 1887 | 0 | Alexis_(disambiguation) | +-----------------+----------+--------------+-------------------------+ 3 rows in set (0.08 sec)
Sorry, I do not know what I had in mind with the query (I think initialy I wanted to show different values of qc_value or something, but anyway, what I meant to show, at least right now, was:
root@neodymium:~$ mysql -h dbstore2002.codfw.wmnet -P3311 enwiki -e "SELECT qc_type, qc_namespace, qc_title, count(*), GROUP_CONCAT(qc_value) FROM querycache GROUP BY qc_type, qc_namespace, qc_title HAVING count(*) > 1" +-----------------+--------------+-------------------------------------------+----------+------------------------+ | qc_type | qc_namespace | qc_title | count(*) | GROUP_CONCAT(qc_value) | +-----------------+--------------+-------------------------------------------+----------+------------------------+ | Disambiguations | 0 | A_cappella_(disambiguation) | 2 | 2411,7619 | | Disambiguations | 0 | Abba_(disambiguation) | 2 | 880,2470 | | Disambiguations | 0 | Alberta_(disambiguation) | 2 | 717,1504 | | Disambiguations | 0 | Alexis_(disambiguation) | 3 | 1887,1614,1601 | | Disambiguations | 0 | Algol_(disambiguation) | 2 | 1453,1394 | | Disambiguations | 0 | Athens_(disambiguation) | 2 | 1184,1216 | | Disambiguations | 0 | Atom_(disambiguation) | 2 | 1194,902 | | Disambiguations | 0 | Babel_fish | 2 | 3818,3824 | | Disambiguations | 0 | Baku_(disambiguation) | 2 | 4571,4566 | | Disambiguations | 0 | Bee_(disambiguation) | 2 | 3710,4654 | | Disambiguations | 0 | Beryl_(disambiguation) | 2 | 4074,4910 | | Disambiguations | 0 | Brewing_(disambiguation) | 2 | 4410,8748 | | Disambiguations | 0 | Citadel_(disambiguation) | 2 | 6695,7603 | | Disambiguations | 0 | Citation_(disambiguation) | 2 | 7603,6201 | | Disambiguations | 0 | Cloning_(disambiguation) | 2 | 6910,7800 | | Disambiguations | 0 | Court_of_Criminal_Appeal_(disambiguation) | 2 | 5257,643 | | Disambiguations | 0 | Critical_density | 2 | 5382,4116 | | Disambiguations | 0 | DES | 2 | 7927,5715 | | Disambiguations | 0 | Datum | 2 | 8495,6829 | | Disambiguations | 0 | Daugava | 2 | 3335,3536 | | Disambiguations | 0 | Diaspora_(disambiguation) | 2 | 8613,8591 | | Disambiguations | 0 | Drum_(disambiguation) | 2 | 9079,7950 | | Disambiguations | 0 | Electron_tube | 2 | 6944,2392 | | Disambiguations | 0 | England_(disambiguation) | 2 | 9322,9316 | | Disambiguations | 0 | Entropy_(disambiguation) | 2 | 7363,9891 | | Disambiguations | 0 | Europe_(disambiguation) | 2 | 9239,9240 | | Disambiguations | 0 | Existence_(disambiguation) | 2 | 9302,9593 | | Disambiguations | 0 | IDB | 3 | 6003,5451,3679 | | Disambiguations | 0 | Inflation_(disambiguation) | 2 | 5382,7958 | | Disambiguations | 0 | Isostatic | 2 | 6056,5236 | | Disambiguations | 0 | Isotherm | 2 | 4512,1418 | | Disambiguations | 0 | Italian_dialects | 2 | 8128,7578 | | Disambiguations | 0 | Ndebele_people | 2 | 3612,3464 | | Disambiguations | 0 | Pythagorean | 2 | 1485,9553 | | Disambiguations | 0 | Red_Army_Choir | 2 | 4567,8994 | | Disambiguations | 108 | Chemical_elements | 2 | 5659,904 | +-----------------+--------------+-------------------------------------------+----------+------------------------+
Haha. Glad it wasn't just me.
FWIW, the "Disambiguations" qc_type is going to be deleted. It (along with some others) is a value that's a few years old, and isn't used anymore. So I don't know if it's really representative of the rest of the tables contents
https://gerrit.wikimedia.org/r/#/c/376072 is WIP... :)
If it's not feasible to weed out duplicates, possibly even a (qc_type,qc_value,qc_namespace,qc_title) PK could be an improvement. For MySQL at least, it'd avoid bug 76252, and reduce overall storage space usage when using InnoDB.[1]
[1] https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html
Change 345968 abandoned by Paladox:
[mediawiki/core@master] Add primary key qc_type and qc_value
Reason: