Page MenuHomePhabricator

Populate term_full_entity_id on test.wikidata.org
Closed, ResolvedPublic

Description

We want to run repo/maintenance/rebuildTermSqlIndex.php for test.wikidata.org before running it on www.wikidata.org.

Please time the run, so it give us some idea of how long a complete rebuild of the table will need on the production site.

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenNone
OpenNone
DuplicateNone
OpenNone
OpenNone
OpenNone
DuplicateNone
InvalidLydia_Pintscher
OpenNone
OpenNone
StalledNone
OpenNone
ResolvedAddshore
Resolvedthiemowmde
ResolvedAddshore
DeclinedNone
OpenNone
Resolvedhoo
ResolvedLydia_Pintscher
ResolvedNone
DeclinedNone
InvalidLydia_Pintscher
ResolvedLadsgroup
ResolvedAddshore
ResolvedLadsgroup
DeclinedNone
ResolvedLadsgroup
Resolvedaude
ResolvedLadsgroup

Event Timeline

Marostegui triaged this task as Medium priority.Jul 24 2017, 12:22 PM
Marostegui added a subscriber: Marostegui.

Hi,

My answer here is pretty much the one I gave at: T171460#3465634

Hi,

My answer here is pretty much the one I gave at: T171460#3465634

The idea was to run this on the test site soon, and get some timing info, so we can decide whether we can use this script on the production site.

Do you think we need a maintenance window for running this on the test site?

aude added a comment.Jul 24 2017, 1:09 PM

i can maybe do this later today / this evening (US time)

Hi,

My answer here is pretty much the one I gave at: T171460#3465634

The idea was to run this on the test site soon, and get some timing info, so we can decide whether we can use this script on the production site.

Do you think we need a maintenance window for running this on the test site?

Probably not, but I would suggest you !log it on SAL, so we can know from when it started and when it ended (useful in case we have to investigate things by looking at the graphs, so we can know that something apart from usual traffic was being run)

Thanks!

Per RelEng policy, maintenance scripts that take more than one hour to finish must be reserved via a deployment window beforehand. My suggestion is to start running it and if it took more than one hour, stop it and get a window. Definitely needs to be logged in SAL though.

If it takes an hour for test.wikidata.org, I guess we know that we have to improve the script before we can run it on the live site.

i can maybe do this later today / this evening (US time)

This wasn't run in the end, no? Just to confirm :)

aude added a comment.Jul 25 2017, 12:22 PM

done. (took 35 min... test.wikidata has 74000 items and 37000 properties)

aude closed this task as Resolved.Jul 25 2017, 12:25 PM
aude claimed this task.

@aude As far as I can see, Items on the test site have only one label and description each, right? The live site has 500 times as many items, and about two labels and descriptions per item (I guess, the dashboard seems dead).

This leads to a naive estimate of factor 1000, so 35k minutes, that's about 24 days. That's probably acceptable, but if we can do better, we should...

I'm curious how this would perform with https://gerrit.wikimedia.org/r/#/c/358531/ applied and deduplication disabled.

A totally unrepresentative benchmark on my laptop indicates that populating term_full_entity_id without the full rebuild is about 20x faster. That would bring this down to less than two days for the live site. I think we should consider doing that.

But then, we still have to get rid of duplicates somehow. A specialized script is probably the best approach for that.

But then, we still have to get rid of duplicates somehow. A specialized script is probably the best approach for that.

The thing is that finding and fixing the duplicates is a very resource/time-consuming task. The select query to find them took 21 hours and this clean up needs to be done eventually, so IMO this is a good opportunity to get it done and over, even with cost of running a script for twenty days.

@Ladsgroup yey, but I think there's a middle way that is much faster than a complete rebuild, and more robust than a mega-query. I imagine an algorithm like this:

Declare an empty list of row-ids to delete.
Iterate over all entities. For each entity:
  Load all terms into an array.
  In that array, find all duplicates
    and add their row-ids to the deletion list.
  When the deletion list hits some limit:
    delete the rows that are in the deletion list
    call commitAndWaitForReplication. 
    reset the deletion list

This can be stopped and continues at any time, does batched insert and wait, and only runs small, trivial select queries.

Yeah, writing something like that won't be hard.