Page MenuHomePhabricator

Investigate orphan rows in the new terms tables after T237984
Closed, ResolvedPublic

Description

We suspect that T237984 might have left quite a few orphan rows in the tables.
We should investigate how many there are and if we think it is worth removing them.
We might want to remove them in beta, test and production.
We might want to look for them using either SQL or hadoop

Event Timeline

Fun thing is that we basically can't be 100% sure if any row is orphan unless we lock both tables and won't let any write queries happening on them which is not possible but given that these tables are huge and queries to find orphan rows would long time we can't say for sure there has been writes in the mean time making us think some rows are orphan (the easiest way to mitigate most of the issue is to trimming all rows that have a high PK id

I put this query on screen on stat1007: SELECT wbtl_id FROM wikidatawiki.wbt_term_in_lang LEFT JOIN wikidatawiki.wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id WHERE wbxl_id is NULL; Let's start with this and then we see how it goes.

So this query returns 9000 results only. We don't need to care about those. Let's see if there's anything else. I think for orphan cases, I needed to flip the join.

Flipping the join gave me 10K rows, Now I don't think this is worth doing unless we want to do a check on orphans in another table (like "property terms table")

Let's try the outer join of text in lang and text tables.

Aand select * from wikidatawiki.wbt_text left join wikidatawiki.wbt_text_in_lang on wbx_id = wbxl_text_id where wbxl_id is null; returns 972 results only. I think the reason is that we have unique index enforced to make sure we don't add orphan rows or it gets fixed when you rebuild holes.

Last but not least:

mysql -h s8-analytics-replica.eqiad.wmnet -P 3318 -A -e "select * from wikidatawiki.wbt_item_terms right join wikidatawiki.wbt_term_in_lang on wbit_term_in_lang_id = wbtl_id where wbit_id is null and wbtl_id not in (select wbpt_term_in_lang_id from wikidatawiki.wbt_property_terms);" > T241212-4.results

Gives 5K rows, I think we are good to go with this. Probably the cleaner bit fixed every part.

Fun thing is that we basically can't be 100% sure if any row is orphan unless we lock both tables and won't let any write queries happening on them which is not possible but given that these tables are huge and queries to find orphan rows would long time we can't say for sure there has been writes in the mean time making us think some rows are orphan (the easiest way to mitigate most of the issue is to trimming all rows that have a high PK id

Could have a look at the hadoop snapshots (as they dont get updated live ;))