Page MenuHomePhabricator

Populate term_full_entity_id on www.wikidata.org
Closed, ResolvedPublic

Description

We want to run repo/maintenance/rebuildTermSqlIndex.php for www.wikidata.org.

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenFeatureNone
OpenFeatureNone
DuplicateNone
ResolvedNone
ResolvedNone
ResolvedNone
DuplicateNone
InvalidLydia_Pintscher
OpenNone
OpenNone
StalledNone
OpenNone
ResolvedAddshore
Resolvedthiemowmde
ResolvedAddshore
DeclinedNone
OpenNone
Resolvedhoo
ResolvedLydia_Pintscher
ResolvedNone
DeclinedNone
InvalidLydia_Pintscher
ResolvedLadsgroup
ResolvedAddshore
ResolvedLadsgroup
DeclinedNone
ResolvedLadsgroup
Resolvedaude
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup

Event Timeline

Marostegui subscribed.

Hi,

Thanks for the heads up.
I assume this script has the proper throttling measures: ie, wait for replication?
I would suggest you agreed on a running window for your script with Release-Engineering-Team and add it to https://wikitech.wikimedia.org/wiki/Deployments so we, DBAs, can get to know when it will run so we can plan for maintenances on s5 (if any) accordingly and keep it in mind for monitoring, graph checking etc

Thanks!

I assume this script has the proper throttling measures: ie, wait for replication?

It does - if anything, it waits for replication more often than necessary.

Mentioned in SAL (#wikimedia-operations) [2017-08-08T07:21:33Z] <Amir1> start of ladsgroup@terbium:~$ time mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki=wikidatawiki --entity-type=property --deduplicate-terms (T171460)

Mentioned in SAL (#wikimedia-operations) [2017-08-08T07:34:17Z] <Amir1> stopped the script and re-running without --deduplicate-terms (T171460)

Properties are done now, since the number was small, I thought let's run it with "--deduplicate-terms" flag but it caused the terms in Wikidata to disappear temporarily (as was noted in #wikidata irc channel). So I stopped and ran it without that flag.

ladsgroup@terbium:~$ time mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki=wikidatawiki --entity-type=property --deduplicate-terms
Processed up to page 18348389 (P1289)
^C
real	9m53.732s
user	0m47.020s
sys	0m3.364s
ladsgroup@terbium:~$ time mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki=wikidatawiki --entity-type=property
Processed up to page 18348389 (P1289)
Processed up to page 23894588 (P2425)
Processed up to page 30059627 (P3490)
Processed up to page 36567073 (P4153)
Done rebuilding property terms
Done.

real	15m1.195s
user	1m48.776s
sys	0m11.948s

This is also worth noting:

mysql:wikiadmin@db1092 [wikidatawiki]> select count(*) from wb_terms where term_entity_type = 'property' and term_full_entity_id is not null;
+----------+
| count(*) |
+----------+
|   138755 |
+----------+
1 row in set (36.51 sec)

Since there 854M rows in wb_terms right now, my estimation is that it will take 64 days.

Change 370626 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: Add puppetized cronjob for rebuildTermSqlIndex

https://gerrit.wikimedia.org/r/370626

Mentioned in SAL (#wikimedia-operations) [2017-08-08T11:22:12Z] <Amir1> start of ladsgroup@terbium:~$ timeout 3500s /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=item >>/tmp/rebuildTermSqlIndex.log 2>&1 (T171460)

Mentioned in SAL (#wikimedia-operations) [2017-08-16T12:25:53Z] <Amir1> ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki testwikidatawiki --entity-type=property (T172776, T171460)

I confirm after deploying wmf.14 the maintainance script becomes way faster and also won't remove any labels from testwikidata. i.e. we can run the script in prod starting tomorrow.

Mentioned in SAL (#wikimedia-operations) [2017-08-18T09:46:57Z] <Amir1> ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=property (T171460)

All properties have labels and:

ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=property
Processed up to page 18348389 (P1289)
Processed up to page 23894588 (P2425)
Processed up to page 30059627 (P3490)
Processed up to page 38094625 (P4171)
Done rebuilding property terms
Done.

real	0m5.712s
user	0m1.136s
sys	0m0.156s

So It will probably take around 8.6 hours instead of sixty days.

Mentioned in SAL (#wikimedia-operations) [2017-08-18T09:55:33Z] <Amir1> one small pass of ladsgroup@terbium:~$ time /usr/local/bin/mwscript extensions/Wikidata/extensions/Wikibase/repo/maintenance/rebuildTermSqlIndex.php --wiki wikidatawiki --entity-type=entity (T171460)

Change 372533 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/Wikibase@master] Add sleep option to the rebuildTermSqlIndex maintenance script

https://gerrit.wikimedia.org/r/372533

Now it has been done up to Q200,000

Processed up to page 191988 (Q194046)
Processed up to page 192997 (Q195126)
Processed up to page 194002 (Q196205)
Processed up to page 195008 (Q197508)
Processed up to page 196029 (Q198703)
Processed up to page 197056 (Q200160)

I will continue working on it on Monday, It was slow and put pressure on the dispatcher, my patch might help and also when the Q number gets bigger, the average number of terms per entity decrease (and the script works on batches of 1000 entities). So it will be faster.

Change 372533 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add sleep option to the rebuildTermSqlIndex maintenance script

https://gerrit.wikimedia.org/r/372533

Mentioned in SAL (#wikimedia-operations) [2017-08-22T10:21:38Z] <Amir1> another run of rebuildTermSqlIndex (T171460)

Mentioned in SAL (#wikimedia-operations) [2017-08-24T07:24:56Z] <Amir1> starting the run for rebuildTermIndex (T171460)

Change 370626 merged by Jcrespo:
[operations/puppet@production] mediawiki: Add puppetized cronjob for rebuildTermSqlIndex

https://gerrit.wikimedia.org/r/370626

Change 373507 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] wikidata-maintenance: Emergency stop of rebuildTermSqlIndex

https://gerrit.wikimedia.org/r/373507

So this is deployed into production, we did a test run and it seems to work as intended.

I left a "disable" patch https://gerrit.wikimedia.org/r/373507 and instructions to deploy there, in case fellow ops have to do some emergency thing to disable, so it is already ready and only the instructions have to be followed.

Change 373507 merged by Jcrespo:
[operations/puppet@production] wikidata-maintenance: Emergency stop of rebuildTermSqlIndex

https://gerrit.wikimedia.org/r/373507

@Ladsgroup: This should be easy to fix:

root@terbium:/var/log/wikidata$ ls -lha rebuildTermSql*
-rw-rw-r-- 1 www-data www-data 1.7K Aug 25 06:30 rebuildTermSqlIndex.log
-rw-r--r-- 1 www-data www-data  130 Aug  8 13:15 rebuildTermSqlIndex.log-20170810.gz
-rw-rw-r-- 1 www-data www-data  71K Aug 25 06:27 rebuildTermSqlIndex.log-20170825
root@terbium:/var/log/wikidata$ cat rebuildTermSqlIndex.log

ERROR: from-id parameter needs a value after it


Rebuild the index in the wb terms table (among other things populating
term_full_entity_id).

Usage: php rebuildTermSqlIndex.php [--batch-size|--conf|--dbpass|--dbuser|--deduplicate-terms|--entity-type|--from-id|--globals|--help|--memory-limit|--profiler|--quiet|--rebuild-all-terms|--server|--sleep|--wiki]

Generic maintenance parameters:
    --help (-h): Display this help message
    --quiet (-q): Whether to supress non-error output
    --conf: Location of LocalSettings.php, if not default
    --wiki: For specifying the wiki ID
    --globals: Output globals at the end of processing for debugging
    --memory-limit: Set a specific memory limit for the script, "max"
        for no limit or "default" to avoid changing it
    --server: The protocol and server name to use in URLs, e.g.
        http://en.wikipedia.org. This is sometimes necessary because server name
        detection may fail in command line scripts.
    --profiler: Profiler output format (usually "text")

Script dependant parameters:
    --dbuser: The DB user to use for this script
    --dbpass: The password to use for this script

Script specific parameters:
    --batch-size: Number of rows to update per batch (Default: 1000)
    --deduplicate-terms: Remove duplicate entries in the index (might
        slow the run down).Redundant when rebuild-all-terms option is specified.
    --entity-type: Only rebuild terms for specified entity type (e.g.
        'item', 'property')
    --from-id: First row (page id) to start updating from
    --rebuild-all-terms: Rebuilds all terms of the entity (requires
        loading data of each processed entity)
    --sleep: Sleep time (in seconds) between every batch

Change 373854 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: fix logrotating in wikidata cronjob

https://gerrit.wikimedia.org/r/373854

Change 373854 merged by Volans:
[operations/puppet@production] mediawiki: fix logrotating in wikidata cronjob

https://gerrit.wikimedia.org/r/373854

Change 374342 had a related patch set uploaded (by Volans; owner: Volans):
[operations/puppet@production] mediawiki: fix logrotating in wikidata cronjob (2)

https://gerrit.wikimedia.org/r/374342

Change 374342 merged by Volans:
[operations/puppet@production] mediawiki: fix logrotating in wikidata cronjob (2)

https://gerrit.wikimedia.org/r/374342

Change 375352 had a related patch set uploaded (by Thiemo Mättig (WMDE); owner: Thiemo Mättig (WMDE)):
[mediawiki/extensions/Wikibase@master] Fix minor issues recently introduced in TermSqlIndexBuilder and test

https://gerrit.wikimedia.org/r/375352

Change 375741 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: make the wikidata wb_terms rebuild a little bit faster

https://gerrit.wikimedia.org/r/375741

Change 375352 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Fix minor issues recently introduced in TermSqlIndexBuilder and test

https://gerrit.wikimedia.org/r/375352

Change 375741 merged by Jcrespo:
[operations/puppet@production] mediawiki: make the wikidata wb_terms rebuild a little bit faster

https://gerrit.wikimedia.org/r/375741

Change 381421 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: stop rebuilding wb_terms table

https://gerrit.wikimedia.org/r/381421

Change 381421 merged by Jcrespo:
[operations/puppet@production] mediawiki: stop rebuilding wb_terms table

https://gerrit.wikimedia.org/r/381421

Not yet done:

mysql:wikiadmin@db1070 [wikidatawiki]> SELECT COUNT(*) FROM wb_terms WHERE term_full_entity_id IS NULL;
+----------+
| COUNT(*) |
+----------+
|      674 |
+----------+
1 row in set (0.00 sec)

For that see T176852: Redirects should not be in wb_terms table The maintenance script can't clean them (probably will jump over them)