We want to run repo/maintenance/rebuildTermSqlIndex.php for www.wikidata.org.
Description
Details
Event Timeline
Hi,
Thanks for the heads up.
I assume this script has the proper throttling measures: ie, wait for replication?
I would suggest you agreed on a running window for your script with Release-Engineering-Team and add it to https://wikitech.wikimedia.org/wiki/Deployments so we, DBAs, can get to know when it will run so we can plan for maintenances on s5 (if any) accordingly and keep it in mind for monitoring, graph checking etc
Thanks!
Mentioned in SAL (#wikimedia-operations) [2017-08-08T07:21:33Z] <Amir1> start of ladsgroup@terbium:~$ time mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki=wikidatawiki --entity-type=property --deduplicate-terms (T171460)
Mentioned in SAL (#wikimedia-operations) [2017-08-08T07:34:17Z] <Amir1> stopped the script and re-running without --deduplicate-terms (T171460)
Properties are done now, since the number was small, I thought let's run it with "--deduplicate-terms" flag but it caused the terms in Wikidata to disappear temporarily (as was noted in #wikidata irc channel). So I stopped and ran it without that flag.
ladsgroup@terbium:~$ time mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki=wikidatawiki --entity-type=property --deduplicate-terms Processed up to page 18348389 (P1289) ^C real 9m53.732s user 0m47.020s sys 0m3.364s ladsgroup@terbium:~$ time mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki=wikidatawiki --entity-type=property Processed up to page 18348389 (P1289) Processed up to page 23894588 (P2425) Processed up to page 30059627 (P3490) Processed up to page 36567073 (P4153) Done rebuilding property terms Done. real 15m1.195s user 1m48.776s sys 0m11.948s
This is also worth noting:
mysql:wikiadmin@db1092 [wikidatawiki]> select count(*) from wb_terms where term_entity_type = 'property' and term_full_entity_id is not null; +----------+ | count(*) | +----------+ | 138755 | +----------+ 1 row in set (36.51 sec)
Since there 854M rows in wb_terms right now, my estimation is that it will take 64 days.
Change 370626 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: Add puppetized cronjob for rebuildTermSqlIndex
Mentioned in SAL (#wikimedia-operations) [2017-08-08T11:22:12Z] <Amir1> start of ladsgroup@terbium:~$ timeout 3500s /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=item >>/tmp/rebuildTermSqlIndex.log 2>&1 (T171460)
T172776: Property labels missing on some items needs to be resolved before moving on, otherwise this will make a mess.
Mentioned in SAL (#wikimedia-operations) [2017-08-16T12:25:53Z] <Amir1> ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki testwikidatawiki --entity-type=property (T172776, T171460)
I confirm after deploying wmf.14 the maintainance script becomes way faster and also won't remove any labels from testwikidata. i.e. we can run the script in prod starting tomorrow.
Mentioned in SAL (#wikimedia-operations) [2017-08-18T09:46:57Z] <Amir1> ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=property (T171460)
All properties have labels and:
ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=property Processed up to page 18348389 (P1289) Processed up to page 23894588 (P2425) Processed up to page 30059627 (P3490) Processed up to page 38094625 (P4171) Done rebuilding property terms Done. real 0m5.712s user 0m1.136s sys 0m0.156s
So It will probably take around 8.6 hours instead of sixty days.
Mentioned in SAL (#wikimedia-operations) [2017-08-18T09:55:33Z] <Amir1> one small pass of ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=entity (T171460)
Change 372533 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] Add sleep option to the rebuildTermSqlIndex maintenance script
Now it has been done up to Q200,000
Processed up to page 191988 (Q194046) Processed up to page 192997 (Q195126) Processed up to page 194002 (Q196205) Processed up to page 195008 (Q197508) Processed up to page 196029 (Q198703) Processed up to page 197056 (Q200160)
I will continue working on it on Monday, It was slow and put pressure on the dispatcher, my patch might help and also when the Q number gets bigger, the average number of terms per entity decrease (and the script works on batches of 1000 entities). So it will be faster.
Change 372533 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add sleep option to the rebuildTermSqlIndex maintenance script
Mentioned in SAL (#wikimedia-operations) [2017-08-22T10:21:38Z] <Amir1> another run of rebuildTermSqlIndex (T171460)
Mentioned in SAL (#wikimedia-operations) [2017-08-24T07:24:56Z] <Amir1> starting the run for rebuildTermIndex (T171460)
Change 370626 merged by Jcrespo:
[operations/puppet@production] mediawiki: Add puppetized cronjob for rebuildTermSqlIndex
Change 373507 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] wikidata-maintenance: Emergency stop of rebuildTermSqlIndex
So this is deployed into production, we did a test run and it seems to work as intended.
I left a "disable" patch https://gerrit.wikimedia.org/r/373507 and instructions to deploy there, in case fellow ops have to do some emergency thing to disable, so it is already ready and only the instructions have to be followed.
Change 373507 merged by Jcrespo:
[operations/puppet@production] wikidata-maintenance: Emergency stop of rebuildTermSqlIndex
@Ladsgroup: This should be easy to fix:
root@terbium:/var/log/wikidata$ ls -lha rebuildTermSql* -rw-rw-r-- 1 www-data www-data 1.7K Aug 25 06:30 rebuildTermSqlIndex.log -rw-r--r-- 1 www-data www-data 130 Aug 8 13:15 rebuildTermSqlIndex.log-20170810.gz -rw-rw-r-- 1 www-data www-data 71K Aug 25 06:27 rebuildTermSqlIndex.log-20170825 root@terbium:/var/log/wikidata$ cat rebuildTermSqlIndex.log ERROR: from-id parameter needs a value after it Rebuild the index in the wb terms table (among other things populating term_full_entity_id). Usage: php rebuildTermSqlIndex.php [--batch-size|--conf|--dbpass|--dbuser|--deduplicate-terms|--entity-type|--from-id|--globals|--help|--memory-limit|--profiler|--quiet|--rebuild-all-terms|--server|--sleep|--wiki] Generic maintenance parameters: --help (-h): Display this help message --quiet (-q): Whether to supress non-error output --conf: Location of LocalSettings.php, if not default --wiki: For specifying the wiki ID --globals: Output globals at the end of processing for debugging --memory-limit: Set a specific memory limit for the script, "max" for no limit or "default" to avoid changing it --server: The protocol and server name to use in URLs, e.g. http://en.wikipedia.org. This is sometimes necessary because server name detection may fail in command line scripts. --profiler: Profiler output format (usually "text") Script dependant parameters: --dbuser: The DB user to use for this script --dbpass: The password to use for this script Script specific parameters: --batch-size: Number of rows to update per batch (Default: 1000) --deduplicate-terms: Remove duplicate entries in the index (might slow the run down).Redundant when rebuild-all-terms option is specified. --entity-type: Only rebuild terms for specified entity type (e.g. 'item', 'property') --from-id: First row (page id) to start updating from --rebuild-all-terms: Rebuilds all terms of the entity (requires loading data of each processed entity) --sleep: Sleep time (in seconds) between every batch
Change 373854 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: fix logrotating in wikidata cronjob
Change 373854 merged by Volans:
[operations/puppet@production] mediawiki: fix logrotating in wikidata cronjob
Change 374342 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] mediawiki: fix logrotating in wikidata cronjob (2)
Change 374342 merged by Volans:
[operations/puppet@production] mediawiki: fix logrotating in wikidata cronjob (2)
Change 375352 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Fix minor issues recently introduced in TermSqlIndexBuilder and test
Change 375741 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: make the wikidata wb_terms rebuild a little bit faster
Change 375352 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Fix minor issues recently introduced in TermSqlIndexBuilder and test
Change 375741 merged by Jcrespo:
[operations/puppet@production] mediawiki: make the wikidata wb_terms rebuild a little bit faster
Change 381421 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: stop rebuilding wb_terms table
Change 381421 merged by Jcrespo:
[operations/puppet@production] mediawiki: stop rebuilding wb_terms table
Not yet done:
mysql:wikiadmin@db1070 [wikidatawiki]> SELECT COUNT(*) FROM wb_terms WHERE term_full_entity_id IS NULL; +----------+ | COUNT(*) | +----------+ | 674 | +----------+ 1 row in set (0.00 sec)
For that see T176852: Redirects should not be in wb_terms table The maintenance script can't clean them (probably will jump over them)