Populate term_full_entity_id on test.wikidata.org
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	daniel
	Jul 24 2017, 11:23 AM

Description

We want to run repo/maintenance/rebuildTermSqlIndex.php for test.wikidata.org before running it on www.wikidata.org.

Please time the run, so it give us some idea of how long a complete rebuild of the table will need on the production site.

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		dchen	T118706 Conduct heuristic evaluation of image upload and insert flow in VisualEditor
Open		None	T115858 Design improvements for mw.ForeignStructuredUpload.BookletLayout
Open		None	T115865 Insert image in content immediately after it's uploaded, skipping the "General settings" step
Duplicate		None	T115864 Figure out if the description of the image can be used as the caption on-wiki
Open	Feature	None	T53032 When inserting an image, set its caption by default to be the Commons image description
Open	Feature	None	T39534 Wikimedia Commons should support searching by color
Duplicate		None	T39535 Wikimedia Commons should support filtering by color
Resolved		None	T19503 Provide metadata support on Wikimedia Commons
Resolved		None	T51662 VisualEditor: Use Multimedia/Wikidata's proposed rich structured meta-data in the image insertion dialog
Resolved		None	T68108 [Epic] Store media information for files on Wikimedia Commons as structured data
Duplicate		None	T66288 basic support for structured data on mediawiki files
Invalid		Lydia_Pintscher	T76012 make use of new entity type for multimedia / structured data of media files
Open		None	T109579 [Epic] Give more sister projects access to Wikidata
Open		None	T187900 There is no way to reference a specific quote on Wikiquote
Stalled		None	T71753 [Story] Wikibase / Wikidata support on Wikiquote
Open		None	T67626 [Epic] Support for queries on-wiki (automated list generation)
Resolved		Addshore	T76019 [Story] Support new types of Entities in Wikibase Client
Resolved		thiemowmde	T135650 [Task] Migrate PropertySuggester away from assuming all entities are numeric
Resolved		Addshore	T75496 [Epic] Support new types of Entities in Wikibase Repository
Declined		None	T58711 [Task] Update the wb_terms table so it does not have a numeric entity id
Open		None	T30599 Deadlock tracking bug (tracking)
Resolved		hoo	T111535 Wikibase\Repo\Store\SQL\EntityPerPageTable::{closure} creating high number of deadlocks
Resolved		Lydia_Pintscher	T146637 Wikidata 2016 Q4 goals
Resolved		None	T86530 Replace wb_terms table with more specialized mechanisms for terms (tracking)
Declined		None	T51982 Add missing wb_entity_per_page entries on LinksUpdate
Invalid		Lydia_Pintscher	T70176 EntityPerPageTable class should be usable from the client
Resolved		Ladsgroup	T67333 Wikibase\EntityPerPageTable::getItemsWithoutSitelinks slow query with large LIMIT offset
Resolved		Addshore	T114902 Remove numeric entity IDs from database schema
Resolved		Ladsgroup	T95685 Drop wb_entity_per_page table
Declined		None	T114903 Migrate wb_terms to using prefixed entity IDs instead of numeric IDs
Resolved		Ladsgroup	T171460 Populate term_full_entity_id on www.wikidata.org
Resolved		aude	T171461 Populate term_full_entity_id on test.wikidata.org
Resolved		Ladsgroup	T165197 Change configuration of test Wikidata to write term_full_entity_id

Event Timeline

daniel created this task.Jul 24 2017, 11:23 AM

daniel added a subtask: T165197: Change configuration of test Wikidata to write term_full_entity_id.

Hi,

My answer here is pretty much the one I gave at: T171460#3465634

• Marostegui moved this task from Triage to Blocked external/Not db team on the DBA board.Jul 24 2017, 12:24 PM

In T171461#3465640, @Marostegui wrote:

Hi,

My answer here is pretty much the one I gave at: T171460#3465634

The idea was to run this on the test site soon, and get some timing info, so we can decide whether we can use this script on the production site.

Do you think we need a maintenance window for running this on the test site?

i can maybe do this later today / this evening (US time)

In T171461#3465813, @daniel wrote:

In T171461#3465640, @Marostegui wrote:

Hi,

My answer here is pretty much the one I gave at: T171460#3465634

The idea was to run this on the test site soon, and get some timing info, so we can decide whether we can use this script on the production site.

Do you think we need a maintenance window for running this on the test site?

Probably not, but I would suggest you !log it on SAL, so we can know from when it started and when it ended (useful in case we have to investigate things by looking at the graphs, so we can know that something apart from usual traffic was being run)

Thanks!

Per RelEng policy, maintenance scripts that take more than one hour to finish must be reserved via a deployment window beforehand. My suggestion is to start running it and if it took more than one hour, stop it and get a window. Definitely needs to be logged in SAL though.

If it takes an hour for test.wikidata.org, I guess we know that we have to improve the script before we can run it on the live site.

In T171461#3465817, @aude wrote:

i can maybe do this later today / this evening (US time)

This wasn't run in the end, no? Just to confirm :)

done. (took 35 min... test.wikidata has 74000 items and 37000 properties)

aude closed this task as Resolved.Jul 25 2017, 12:25 PM

aude claimed this task.

@aude As far as I can see, Items on the test site have only one label and description each, right? The live site has 500 times as many items, and about two labels and descriptions per item (I guess, the dashboard seems dead).

This leads to a naive estimate of factor 1000, so 35k minutes, that's about 24 days. That's probably acceptable, but if we can do better, we should...

I'm curious how this would perform with https://gerrit.wikimedia.org/r/#/c/358531/ applied and deduplication disabled.

A totally unrepresentative benchmark on my laptop indicates that populating term_full_entity_id without the full rebuild is about 20x faster. That would bring this down to less than two days for the live site. I think we should consider doing that.

But then, we still have to get rid of duplicates somehow. A specialized script is probably the best approach for that.

In T171461#3470167, @daniel wrote:

But then, we still have to get rid of duplicates somehow. A specialized script is probably the best approach for that.

The thing is that finding and fixing the duplicates is a very resource/time-consuming task. The select query to find them took 21 hours and this clean up needs to be done eventually, so IMO this is a good opportunity to get it done and over, even with cost of running a script for twenty days.

@Ladsgroup yey, but I think there's a middle way that is much faster than a complete rebuild, and more robust than a mega-query. I imagine an algorithm like this:

Declare an empty list of row-ids to delete.
Iterate over all entities. For each entity:
  Load all terms into an array.
  In that array, find all duplicates
    and add their row-ids to the deletion list.
  When the deletion list hits some limit:
    delete the rows that are in the deletion list
    call commitAndWaitForReplication. 
    reset the deletion list

This can be stopped and continues at any time, does batched insert and wait, and only runs small, trivial select queries.

Yeah, writing something like that won't be hard.

Ladsgroup closed subtask T165197: Change configuration of test Wikidata to write term_full_entity_id as Resolved.Aug 1 2017, 10:17 PM

Populate term_full_entity_id on test.wikidata.orgClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Populate term_full_entity_id on test.wikidata.org
Closed, ResolvedPublic
Actions

Related Objects
Search...