Page MenuHomePhabricator

Drop wb_entity_per_page table
Closed, ResolvedPublic

Description

wb_entity_per_page maps entity IDs to wiki page titles. Since we have, and very likely always will have, a programmatic mapping from entity id to page title, this is not needed. Also, wb_entity_per_page in it's current form does not work with non-numeric entity IDs.

If we remove this table, we can no longer use it to iterate over all entities. We would need to rely on iterating wiki pages by namespace and/or content model. This does not seem to be much of a problem though.

Related Objects

StatusAssignedTask
OpenNone
Resolvedhoo
DeclinedNone
InvalidLydia_Pintscher
ResolvedLadsgroup
Resolved Addshore
ResolvedLadsgroup
Resolvedadrianheine
Resolveddaniel
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
Resolvedhoo
Resolvedhoo
Resolvedhoo
Resolvedhoo
DeclinedNone
ResolvedLydia_Pintscher
ResolvedNone
ResolvedWMDE-leszek
DeclinedNone
DeclinedNone
DeclinedNone
ResolvedLadsgroup
Resolvedaude
Resolved Marostegui
ResolvedLadsgroup
ResolvedAndrew
ResolvedLadsgroup
Resolvedaude
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedLadsgroup
Resolved Marostegui
ResolvedAndrew

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
daniel added a subscriber: daniel.Apr 10 2015, 1:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 10 2015, 1:19 PM
hoo added a subscriber: hoo.Apr 19 2015, 11:45 PM

Also we need to use MediaWiki's redirect table to resolve redirects (instead of the redundant epp_redirect_target field).

Lydia_Pintscher triaged this task as High priority.Apr 22 2015, 2:20 PM
hoo added a comment.Aug 6 2015, 11:36 PM

I looked at this a bit today and this is quite doable, except for a few things which are really awry (eg fix EntityPerPageTable::getEntitiesWithoutTerm is not going to be super nice).

We should start by (step by step) removing the usages of the table, announce that the table is going away (so that tool authors etc. can adopt their tools) and after a certain grace period stop updating the table and then drop it.

Change 230035 had a related patch set uploaded (by Hoo man):
Introduce WikiPageEntityRedirectLookup

https://gerrit.wikimedia.org/r/230035

Change 230036 had a related patch set uploaded (by Hoo man):
Remove EntityPerPageTable::getPageIdForEntityId

https://gerrit.wikimedia.org/r/230036

Magnus added a subscriber: Magnus.Aug 8 2015, 7:30 PM

This will require changing just about all of my Wikidata-related tools.

If you just drop the table without replacing it, I will have to switch to concat-based queries for the page table (item IDs in the wb_ tables, page_title text in page), which will be massively slower.

Good times.

Change 230035 merged by jenkins-bot:
Introduce WikiPageEntityRedirectLookup

https://gerrit.wikimedia.org/r/230035

Change 230036 merged by jenkins-bot:
Remove EntityPerPageTable::getPageIdForEntityId

https://gerrit.wikimedia.org/r/230036

@Magnus If we don't drop the table, we'll have to change the schema to allow for non-numeric IDs. So that would break the tools anyway.

I'd prefer to drop the table and replace the joins with a programmatic mapping from ID to title. That would generally just mean adding a prefix (namespace). Do you see any disadvantage to dropping the table, if the schema has to change anyway?

So the wb_ tables will still have numeric IDs, which will then have to be CONCAT('Q',numid) to compare with page_title, using page_namespace=0/120?

That would, obviously, be much slower than the current system. If there is a better way, please tell me now!

Or will the wb_ tables also switch to non-numeric IDs? It would be helpful to see what changes will be made across wb_ unless dropping wb_entity_per_page is the only change?

We will not switch anything *to* numeric IDs. Currently, you already have to use CONCAT, since epp_entity_id is an int. This is bad. We would either introduce epp_entity_full_id (or some such) and drop epp_entity_id, or we drop the table alltogether. The point is - you don't have to emulate the table, or concatenate anything. You *know* that the page title is *always* the same as the ID (with prefix). You just need to know the correct namespace.

wb_items_per_site will probably keep the numeric IDs, since it is for items only, not for other types of entities.

We will not switch anything *to* numeric IDs. Currently, you already have to use CONCAT, since epp_entity_id is an int. This is bad.

No I don't. Say I want to get the page for the item for the German Wikipedia page about [[Wikipedia]]:

select * from wb_entity_per_page,wb_items_per_site,page WHERE page_id=epp_page_id and epp_entity_type='item' and epp_entity_id=ips_item_id and ips_site_id='dewiki' and ips_site_page='Wikipedia';

No CONCAT here. Yet, if wb_entity_per_page is dropped, I have to use concat:

select * from wb_items_per_site,page WHERE page_namespace=0 and page_title=concat('Q',ips_item_id) and ips_site_id='dewiki' and ips_site_page='Wikipedia'

It would not matter for this example, but when I have 10K sitelinks to check, performance will degrade. Worse, Labs replicas are already timing out under my feet as it is; this will only become worse with slower queries.

I know you are not switching *to* numeric IDs. I never said you should. You are currently using numeric IDs, and you plan to switch to a mixed numeric/string system. Which is bad. It would be less bad if you were to switch all numeric IDs to strings, then I could do string comparisons without CONCAT.

We would either introduce epp_entity_full_id (or some such) and drop epp_entity_id, or we drop the table alltogether. The point is - you don't have to emulate the table, or concatenate anything. You *know* that the page title is *always* the same as the ID (with prefix). You just need to know the correct namespace.

Your statement is wrong. Again. Without the wb_entity_per_page table, and other wb_* tables still using numeric IDs, please show me an example of how I can link up page and wb_* without CONCAT.

wb_items_per_site will probably keep the numeric IDs, since it is for items only, not for other types of entities.

Yes, that's what will be causing the problem. (other than me having to rewrite a dozen tools)

We agreed in sprint planning, that we should do an investigation where this table is actually used.

Also, we need to check if the actual subtasks can be removed since they seem to be no blockers.

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptJul 25 2017, 8:14 AM

Change 382694 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] labs: do not replicate wb_entity_per_page table

https://gerrit.wikimedia.org/r/382694

Change 382694 merged by Jcrespo:
[operations/puppet@production] labs: do not replicate wb_entity_per_page table

https://gerrit.wikimedia.org/r/382694

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptOct 24 2017, 12:55 PM
Lydia_Pintscher closed this task as Resolved.Oct 24 2017, 3:50 PM

Was this ever announced? I woke up today getting reports that XTool's ArticleInfo tool was broken. I searched my email and the only place I saw mention of wb_entity_per_page was in the Scrum of Scrum meeting notes, and not any sort of targeted, formal announcement. Apologies if it was in fact announced and I overlooked it.

I have since commented out the code that queries wb_entity_per_page, so we're okay for now. XTools used it to see if high-level wikidata fields were not filled out, such as the description, which we would then report as a "bug". I don't actually understand how that query works, it was just copied from the old XTools. Those queries have been there for years, so it seems T140758 neglected Toolforge tools (Tool Labs, at the time).

Hey, It has been announced in June 1st and we waited for several months and then reminded in early October.

Hey, It has been announced in June 1st and we waited for several months and then reminded in early October.

Ah I see. I was not subscribed to wikidata-tech, but I am now :) Dropping a table seems like a major change. Perhaps wikitech-l and/or labs-l deserves a note? No hard feelings from me, by the way :) There are so many mailing lists... it's difficult to know which ones I need to subscribe to in order to stay informed!