Page MenuHomePhabricator

Extend mw.wikibase.getEntity lua function to allow accessing Structured Data on Commons items
Closed, ResolvedPublic

Description

Extend mw.wikibase.getEntity lua function to allow accessing Structured Data on Commons items (M-codes) in addition to Wikidata items (Q-codes).

Details

Related Gerrit Patches:
operations/mediawiki-config : masterEnable Wikibase client access on commonswiki
operations/mediawiki-config : masterEnable Wikibase client access on testcommonswiki
mediawiki/extensions/WikibaseMediaInfo : wmf/1.35.0-wmf.3Also use custom PrefetchingTermLookup in SingleEntitySourceServices
mediawiki/extensions/Wikibase : wmf/1.35.0-wmf.3Allow defining entity-type-specific PrefetchingTermLookup
mediawiki/extensions/Wikibase : masterAllow defining entity-type-specific PrefetchingTermLookup
mediawiki/extensions/WikibaseMediaInfo : masterAlso use custom PrefetchingTermLookup in SingleEntitySourceServices
mediawiki/extensions/WikibaseMediaInfo : masterProvide an alternative PrefetchingTermLookup for MediaInfo entities
mediawiki/extensions/WikibaseMediaInfo : masterStart writing entities into wb_terms

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Jarekt added a comment.Jul 1 2019, 2:19 PM

I do not speak php, but looking at the code Multichill linked to, it seems like the source code for getEntity can be found at EntityAccessor.php (line 139) which calls entityIdParser to convert string version of entity ID to an integer. The code for entityIdParser can be found here. Interestingly ItemId.php function checks that entity ID string fits /^Q[1-9]\d{0,9}\z/i regex pattern and than strips the "Q" from the begging of the string to get the integer. So in this code the entity IDs starting with "Q" are hardwired.

Keegan added a comment.Jul 1 2019, 5:13 PM

Is there an opportunity for a retrospective here to figure out why it wasn't included in the original plan? It seems quite a basic component for SDC, and it's been requested for quite a while (or at least, I remember asking about it at last year's Wikimania!).

I know I'll certainly cover it when we do a project retro towards the end of the year. It's come up from time to time, we've talked about it, we're not surprised by this. At some point the work didn't get revisited after last fall when it probably should have. We'll figure it out.

I think the team is going to make a spike to see how long it will take them to do this internally.

Cparle added a comment.Jul 2 2019, 4:20 PM
This comment was removed by Cparle.

It looks like the Wikibase Lua support can mostly deal with MediaInfo already, except that it can't look up the MediaInfo entities.
Unlike Wikibase properties & items, MediaInfo entities are not currently stored in wb_terms.
If MediaInfo starts writing to wb_terms, Lua's functions also start working.
wb_terms is in the process of being redesigned/migrated, though, and I have yet to look into that.

Change 522355 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Start writing entities into wb_terms

https://gerrit.wikimedia.org/r/522355

It looks like the Wikibase Lua support can mostly deal with MediaInfo already, except that it can't look up the MediaInfo entities.

What kind of look up is this - matching the exact caption? Or partial matching the caption? mw.wikibase.getEntity doesn't seem to need any lookups, so which Lua function would use that?

It looks like the Wikibase Lua support can mostly deal with MediaInfo already, except that it can't look up the MediaInfo entities.

What kind of look up is this - matching the exact caption? Or partial matching the caption? mw.wikibase.getEntity doesn't seem to need any lookups, so which Lua function would use that?

No, it's not like that, it's something like mw.wikibase.getEntity( 'Q20489172' ) and returns the whole entity in LUA (like https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q20489172 ) so you can work with the data.

So as a LUA user I want to do mw.wikibase.getEntity( 'M62798946' ) and get in LUA like https://commons.wikimedia.org/w/api.php?action=wbgetentities&ids=M80505417 .
See also https://www.mediawiki.org/wiki/Extension:Wikibase_Client/Lua#mw.wikibase.entity

So as a LUA user I want to do mw.wikibase.getEntity( 'M62798946' ) and get in LUA like

But for this you don't need any index at all, 62798946 is literally the page id. Am I missing something here? Why there's talk about indexes?

So as a LUA user I want to do mw.wikibase.getEntity( 'M62798946' ) and get in LUA like

But for this you don't need any index at all, 62798946 is literally the page id. Am I missing something here? Why there's talk about indexes?

I was kind of wondering the same.

@matthiasmullie why do we need an index?

So, unless access is needed via a media info caption, we do not need an index.
So if we just want lookup by media info id we can close T227847 and T227848.

Is the desire here to have client access to commons media info entities?
If so the whole client access system probably just needs hooking up.

$wgWBClientSettings['repositories'] for wikimedia clients currently only defined item property and lexeme for client access.
etc.
As far as I know none of the client functionality has been turned on or hooked up, or tested in beta or on test yet?

Yann added a subscriber: Yann.Jul 22 2019, 4:32 PM

@Addshore et al., not to make your lives more difficult here, but assuming caption lookup is not desired (because I can't imagine it would be), would an index of some sort be needed in order to support something like mw.wikibase.getEntity( 'File:Blah.png' ) - i.e. using the filename instead of the M-ID?

(also FWIW I think the MediaInfo entities are *currently* the filepage ID but almost none of the code assumes that, at least the last time I tinkered with things)

would an index of some sort be needed in order to support something like mw.wikibase.getEntity( 'File:Blah.png' )

You'd have to just access page table to look up page by title, then go to the appropriate revision & slot (AFAIR there are services already in Wikibase that do that) and load the Wikibase data from there. I don't think you'd need anything beyond existing page-revision-slot-text tables.

MediaInfo entities are *currently* the filepage ID but almost none of the code assumes that

AFAIR the code that does lookups and title/entityId conversions assumes that. Of course that code can be changed, but with this large change it would be natural to expect that. I would tend to write such change as YAGNI, but I think using standard lookup interfaces which go from Title to EntityId and then load the data would allow to avoid hardcoding for this particular use case, and if anything changes, these lookups would have to be changed.

Just realized https://commons.wikimedia.org/w/api.php?action=wbgetentities&props=labels&format=json&languagefallback=1&sites=commonswiki&titles=File:Charles%20P.%20Gruppe%20-%20Meadow%20Brook%20-%201912.7.1%20-%20Smithsonian%20American%20Art%20Museum.jpg already works at the moment.

The main use case for us is to use the LUA on the same file page as where the structured data is located. In that context the pageid (80505417), mediainfo id (M80505417) and filename (File:Charles P. Gruppe - Meadow Brook - 1912.7.1 - Smithsonian American Art Museum.jpg) are known. So no lookup by caption, we don't want that.

Change 525794 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Provide an alternative TermIndex, that doesn't require wb_terms

https://gerrit.wikimedia.org/r/525794

Change 522355 abandoned by Matthias Mullie:
Start writing entities into wb_terms

Reason:
I assumed we didn't get around to implementing this because we hadn't needed it yet - was not aware there were actual reasons for killing it.

https://gerrit.wikimedia.org/r/522355

I was thinking it'd make most sense to use the existing code & structure already used by Wikidata.
But I've since learned about the reasons for not using wb_terms (I figured it was just a case of not having implemented it because there had not been a need for it yet)

I've also been playing around with the idea of just not using an index altogether ATM, and I get the idea most of the people involved in this ticket tend to agree?

I have a small POC patch up (https://gerrit.wikimedia.org/r/525794) that seems to work well enough for find the entity & exposing its data to Lua.
There's still a lot of work to be done (optimize, make sure other wikibase properties & items can still be retrieved, ...), but can someone with more knowledge of this stuff take a look and see if this is a direction worth pursuing? (or could this be problematic for reasons I'm not yet aware of?)

(I'm going to decline the other 2 wb_terms related tickets, since it looks like there are good reasons not to pursue that right now)

Restricted Application added a project: Multimedia. · View Herald TranscriptJul 26 2019, 12:08 PM
Ramsey-WMF moved this task from Untriaged to Next up on the Multimedia board.
Ramsey-WMF moved this task from To Do to Doing on the Structured Data Engineering board.

Looking at the code it looks like indeed either a new TermIndex type thing would be needed for media info, or the fetching of terms, as currently done in https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/84e2062770467eacbb42e8a55bdf77e11141834f/lib/includes/Store/Sql/TermSqlIndex.php#L638-L686, would need to be factored out in some way.

It feels like there should be a cleaner way of doing tihs but I might have to sit down and stare at it all for a bit longer.

I don’t think implementing TermIndex itself, as I51bc8c9703 currently does, is a good idea. It’s not a great interface (combining lookup, search and modification), and for the new term store in Wikibase we did not write a new implementation of it, but instead implementations of several different interfaces – so I would be wary of any code that really needs a TermIndex (because that would likely be broken on Wikidata already as we migrate away from wb_terms). I think what you need to implement for WikibaseMediaInfo Lua support is PrefetchingTermLookup – implementing TermIndex gets you that (via BufferingTermLookup), but it would be better to do it directly.

@matthiasmullie I hear you’re also in Stockholm, so we can also discuss this in person if you want :)

@LucasWerkmeister Yes - I'm in Stockholm. I'll take a look at your suggestion and then come find you!

Tpt added a subscriber: Tpt.Sep 24 2019, 7:15 AM

Change 525794 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Provide an alternative PrefetchingTermLookup for MediaInfo entities

https://gerrit.wikimedia.org/r/525794

Update: while above patches got merged, Lua still won't work by next week's deployment (at least not until some more issues are straightened out, but they may require some more work)

Beta (and the same should be true for production) doesn't seem to run the MediaInfo specific PrefetchingTermLookup alternative (that doesn't lookup in wb_terms)

As far as I can tell up to now, it looks like this is because of $wgWBClientSettings['useEntitySourceBasedFederation'] = true; config, which, AFAICT, was a solution to deal with T211800.
useEntitySourceBasedFederation results in MultipleEntitySourceServices & SingleEntitySourceServices being used instead of MultipleRepositoryAwareWikibaseServices.
It appears that SingleEntitySourceServices doesn’t use the service wiring files. Instead, most of what’s in the service wiring seems to be duplicated in that class.
Unfortunately, that means the MediaInfo-specific PrefetchingTermLookup is never used, but the hardcoded original is (https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/data-access/src/SingleEntitySourceServices.php#L270)

Investigating further...

I should be able to help with it @matthiasmullie . Looking into it...

I should be able to help with it @matthiasmullie . Looking into it...

I was wondering if you're aware of any pre-existing solution to use an alternative PrefetchingTermLookup with SingleEntitySourceServices (extending it, injecting it, anything?)
If not: I'm wondering whether it'd be possible to make SingleEntitySourceServices read from the service wiring files, or is there a reason that it must not? (might be a stupid questions, but I'm thinking the repo-code is mostly the same as what's in the service wiring files, except that MultipleEntitySourceServices will triage it per source?)

@WMDE-leszek What would be the best place to discuss this? I was looking for you on IRC just now but couldn't find you - I'm matthiasmullie on freenode, but can also do mail/slack/video chat, or here in phab if that's more convenient!

Change 544972 had a related patch set uploaded (by WMDE-leszek; owner: WMDE-leszek):
[mediawiki/extensions/Wikibase@master] Allow defining entity-type-specific PrefetchingTermLookup

https://gerrit.wikimedia.org/r/544972

Change 544973 had a related patch set uploaded (by WMDE-leszek; owner: WMDE-leszek):
[mediawiki/extensions/WikibaseMediaInfo@master] Also use custom PrefetchingTermLookup in SingleEntitySourceServices

https://gerrit.wikimedia.org/r/544973

Apologies for being hard to reach recently @matthiasmullie.
I've submitted a change pair https://gerrit.wikimedia.org/r/544972 + https://gerrit.wikimedia.org/r/544973 with a simple proof-of-concept of what would be my idea how to solve this. I am not certain how to verify that all lookup operations are working as expected, so I would appreciate if you could have a look.

Per discussing this, and possible further topics: my is quite packed in the next days. I should be to have a chat e.g. tomorrow from 17:00 CEST (not sure what timezone you're in?). IRC would work then (leszek_wmde), but would also be happy to have quick call if that seemed more efficient for you.

Apologies for being hard to reach recently @matthiasmullie.
I've submitted a change pair https://gerrit.wikimedia.org/r/544972 + https://gerrit.wikimedia.org/r/544973 with a simple proof-of-concept of what would be my idea how to solve this. I am not certain how to verify that all lookup operations are working as expected, so I would appreciate if you could have a look.

getLabel seems to work just fine (on my machine, with useEntitySourceBasedFederation = true`).
Will test more thoroughly tomorrow, but I believe we should be all set - thanks!

Per discussing this, and possible further topics: my is quite packed in the next days. I should be to have a chat e.g. tomorrow from 17:00 CEST (not sure what timezone you're in?). IRC would work then (leszek_wmde), but would also be happy to have quick call if that seemed more efficient for you.

I'm in Belgium, CEST is convenient. Have to leave at 17:30, though, but I'll be around at 17:00 (also around during the rest of CEST daytime)

AFAICT, most operations seem to be working fine.
The one that don't are either:

  • irrelevant (e.g. resolvePropertyId - MediaInfo's aren't properties; or getDescription - it works, but MediaInfo's don't tend to have descriptions...)
  • use SiteLinkLookup (getEntityIdForTitle and getSitelink) - I'm not even sure we'll need these for MediaInfo's - if we do, we can implement that later on!

Other lookups seem to work just fine!

jeblad added a subscriber: jeblad.Oct 21 2019, 8:00 PM

Make a new lib and new function calls if the models are dissimilar, please, please, please,… Swiching on some kind of magic (like the initial char) will only create a mess.

There is no magic involved.
Term lookups are performed differently for different entity types (because the data lives elsewhere), but in a non-hacky fashion (and it doesn't really matter where the data comes from or how it's retrieved anyway)

Yes, we could make an entirely different lib (currently actively being worked on by WMDE: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/544204), but I'm not convinced there's much benefit in doing so at this point (though others with more knowledge of Lua might be able to convince me otherwise)
IMO it would be yet another library to maintain (or fall behind on fixes/updates) to support something that's essentially the same data model (but applied differently)

WMDE-leszek added a comment.EditedOct 22 2019, 3:46 PM

@matthiasmullie I think there is no need to have a direct chat at this point?

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/544972 and co are being reviewed here at WMDE. As we missed the branch there will be a need for the backport. Let's figure out tomorrow who (WMDE or SD) will take care of it. For now train seems to have some issues departing as I read in wikitech-l. Looks like we should get it through the door this week, in any case.
Which deployment group is Commons again? Sorry I keep forgetting about it.

Re the new lib for different model - that'd also be my preferred approach, but I imagine that's something that you might intend to do after the initial release.

Commons is group 1. Let me know if you'd like me to deploy these patches once merged, or help in any way!

Change 544972 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Allow defining entity-type-specific PrefetchingTermLookup

https://gerrit.wikimedia.org/r/544972

Change 544973 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Also use custom PrefetchingTermLookup in SingleEntitySourceServices

https://gerrit.wikimedia.org/r/544973

@matthiasmullie repeating myself from the IRC

I believe those three patches https://gerrit.wikimedia.org/r/#/q/topic:mediainfo-pref-termlookup+(status:open+OR+status:merged) would need to be backported onto the deployment branch

Will you be able to do this? I am not sure if we can manage to do the backport before train rolls further to group1 (should start in an hour?) but maybe we will?
If you won't be able to do the backporting yourself, we (WMDE) could help with this, although we'd still need you for testing everything on the Commons end :)

@matthiasmullie repeating myself from the IRC

Mmh - didn't see anything - what channel was that in? (or DM?)

I believe those three patches https://gerrit.wikimedia.org/r/#/q/topic:mediainfo-pref-termlookup+(status:open+OR+status:merged) would need to be backported onto the deployment branch
Will you be able to do this? I am not sure if we can manage to do the backport before train rolls further to group1 (should start in an hour?) but maybe we will?
If you won't be able to do the backporting yourself, we (WMDE) could help with this, although we'd still need you for testing everything on the Commons end :)

Won't be able to get it out today, but I can do it in tomorrow's Euro mid-day SWAT window!

@matthiasmullie repeating myself from the IRC

Mmh - didn't see anything - what channel was that in? (or DM?)

#wikimedia-dev. It is a pretty busy channel, so I might have better chosen some other one.

I believe those three patches https://gerrit.wikimedia.org/r/#/q/topic:mediainfo-pref-termlookup+(status:open+OR+status:merged) would need to be backported onto the deployment branch
Will you be able to do this? I am not sure if we can manage to do the backport before train rolls further to group1 (should start in an hour?) but maybe we will?
If you won't be able to do the backporting yourself, we (WMDE) could help with this, although we'd still need you for testing everything on the Commons end :)

Won't be able to get it out today, but I can do it in tomorrow's Euro mid-day SWAT window!

Sounds good! There will be someone from WMDE to assist if necessary.

Change 545572 had a related patch set uploaded (by Matthias Mullie; owner: WMDE-leszek):
[mediawiki/extensions/WikibaseMediaInfo@wmf/1.35.0-wmf.3] Also use custom PrefetchingTermLookup in SingleEntitySourceServices

https://gerrit.wikimedia.org/r/545572

Change 545800 had a related patch set uploaded (by Matthias Mullie; owner: WMDE-leszek):
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.3] Allow defining entity-type-specific PrefetchingTermLookup

https://gerrit.wikimedia.org/r/545800

Change 545800 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.3] Allow defining entity-type-specific PrefetchingTermLookup

https://gerrit.wikimedia.org/r/545800

Change 545572 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@wmf/1.35.0-wmf.3] Also use custom PrefetchingTermLookup in SingleEntitySourceServices

https://gerrit.wikimedia.org/r/545572

Change 545862 had a related patch set uploaded (by WMDE-leszek; owner: WMDE-leszek):
[operations/mediawiki-config@master] Enable Wikibase client access on testcommonswiki

https://gerrit.wikimedia.org/r/545862

Change 545862 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable Wikibase client access on testcommonswiki

https://gerrit.wikimedia.org/r/545862

Mentioned in SAL (#wikimedia-operations) [2019-10-24T15:04:02Z] <addshore@deploy1001> Synchronized wmf-config/InitialiseSettings.php: testcommonswiki, Enable Wikibase client access T223792 (duration: 00m 53s)

Change 545903 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[operations/mediawiki-config@master] Enable Wikibase client access on commonswiki

https://gerrit.wikimedia.org/r/545903

Change 545903 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable Wikibase client access on commonswiki

https://gerrit.wikimedia.org/r/545903

Mentioned in SAL (#wikimedia-operations) [2019-10-24T18:29:58Z] <urbanecm@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: 263fd0f: Enable Wikibase client access on commonswiki (T223792) (duration: 00m 52s)

There is no magic involved.
Term lookups are performed differently for different entity types (because the data lives elsewhere), but in a non-hacky fashion (and it doesn't really matter where the data comes from or how it's retrieved anyway)
Yes, we could make an entirely different lib (currently actively being worked on by WMDE: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/544204), but I'm not convinced there's much benefit in doing so at this point (though others with more knowledge of Lua might be able to convince me otherwise)
IMO it would be yet another library to maintain (or fall behind on fixes/updates) to support something that's essentially the same data model (but applied differently)

I would say this is not only overloaded methods, but overloaded data. When even the datamodel is different it should not be advertised as similar at all. Sorry, but I don't think this is wise at all.

matthiasmullie closed this task as Resolved.Oct 28 2019, 2:33 PM

Currently being used in production: https://commons.wikimedia.org/w/index.php?title=File%3AJodrell_Bank_Mark_II_5.jpg&type=revision&diff=371864746&oldid=149735587
Closing this one - let's open other more specific tickets for additional work (T236691, T236692) or issues once they arise.