Investigate orphan rows in the new terms tables after T237984
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Addshore
	Dec 20 2019, 8:34 AM

Description

We suspect that T237984 might have left quite a few orphan rows in the tables.
We should investigate how many there are and if we think it is worth removing them.
We might want to remove them in beta, test and production.
We might want to look for them using either SQL or hadoop

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Addshore	T208425 [EPIC] Kill the wb_terms table
		Resolved		Ladsgroup	T241212 Investigate orphan rows in the new terms tables after T237984

Event Timeline

Addshore created this task.Dec 20 2019, 8:34 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 20 2019, 8:34 AM

Addshore triaged this task as Low priority.Dec 20 2019, 8:34 AM

Addshore added a parent task: T208425: [EPIC] Kill the wb_terms table.

Addshore mentioned this in T208425: [EPIC] Kill the wb_terms table.

Addshore added a project: Wikidata-Campsite.Dec 20 2019, 8:37 AM

Addshore moved this task from Incoming to Ready to pick up on the Wikidata-Campsite board.Jan 7 2020, 10:31 AM

Addshore moved this task from Ready to pick up to Wikidata-Campsite-Iteration-∞ (On Hold) on the Wikidata-Campsite board.Jan 7 2020, 1:30 PM

Addshore edited projects, added Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)); removed Wikidata-Campsite.

Maintenance_bot moved this task from incoming to in progress on the Wikidata board.Jan 7 2020, 2:16 PM

Fun thing is that we basically can't be 100% sure if any row is orphan unless we lock both tables and won't let any write queries happening on them which is not possible but given that these tables are huge and queries to find orphan rows would long time we can't say for sure there has been writes in the mean time making us think some rows are orphan (the easiest way to mitigate most of the issue is to trimming all rows that have a high PK id

I put this query on screen on stat1007: SELECT wbtl_id FROM wikidatawiki.wbt_term_in_lang LEFT JOIN wikidatawiki.wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id WHERE wbxl_id is NULL; Let's start with this and then we see how it goes.

So this query returns 9000 results only. We don't need to care about those. Let's see if there's anything else. I think for orphan cases, I needed to flip the join.

Flipping the join gave me 10K rows, Now I don't think this is worth doing unless we want to do a check on orphans in another table (like "property terms table")

Let's try the outer join of text in lang and text tables.

Aand select * from wikidatawiki.wbt_text left join wikidatawiki.wbt_text_in_lang on wbx_id = wbxl_text_id where wbxl_id is null; returns 972 results only. I think the reason is that we have unique index enforced to make sure we don't add orphan rows or it gets fixed when you rebuild holes.

Ladsgroup claimed this task.Jan 9 2020, 8:04 PM

Ladsgroup moved this task from To Do (prioritised from top to bottom) to Peer Review on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJan 9 2020, 8:04 PM

Maintenance_bot moved this task from Incoming to In progress on the User-Ladsgroup board.Jan 9 2020, 8:15 PM

Let's try term in lang + (item property) set too.

Last but not least:

mysql -h s8-analytics-replica.eqiad.wmnet -P 3318 -A -e "select * from wikidatawiki.wbt_item_terms right join wikidatawiki.wbt_term_in_lang on wbit_term_in_lang_id = wbtl_id where wbit_id is null and wbtl_id not in (select wbpt_term_in_lang_id from wikidatawiki.wbt_property_terms);" > T241212-4.results

Gives 5K rows, I think we are good to go with this. Probably the cleaner bit fixed every part.

In T241212#5784995, @Ladsgroup wrote:

Fun thing is that we basically can't be 100% sure if any row is orphan unless we lock both tables and won't let any write queries happening on them which is not possible but given that these tables are huge and queries to find orphan rows would long time we can't say for sure there has been writes in the mean time making us think some rows are orphan (the easiest way to mitigate most of the issue is to trimming all rows that have a high PK id

Could have a look at the hadoop snapshots (as they dont get updated live ;))

Addshore closed this task as Resolved.Jan 14 2020, 10:56 AM

Addshore moved this task from Peer Review to Done on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

Maintenance_bot moved this task from In progress to Done on the User-Ladsgroup board.Jan 14 2020, 11:15 AM

Investigate orphan rows in the new terms tables after T237984Closed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Investigate orphan rows in the new terms tables after T237984
Closed, ResolvedPublic
Actions

Related Objects
Search...