Page MenuHomePhabricator

Extend mw.wikibase.getEntity lua function to allow accessing Structured Data on Commons items
Open, Needs TriagePublic

Description

Extend mw.wikibase.getEntity lua function to allow accessing Structured Data on Commons items (M-codes) in addition to Wikidata items (Q-codes).

Event Timeline

Jarekt created this task.May 19 2019, 10:03 AM
Restricted Application added a project: Wikidata. · View Herald TranscriptMay 19 2019, 10:03 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Cparle added a subscriber: Cparle.May 19 2019, 10:07 AM
Jheald added a subscriber: Jheald.Jun 19 2019, 4:32 PM
Keegan added a subscriber: Keegan.Jun 21 2019, 5:40 PM

I should have some sort of update on Monday relating to this.

Sorry about the delay in the update, but I'm currently meeting with the SDC team in person and we're discussing this task in real-time.

The SDC development team is going to have to look into what it takes to get this done - or if not this, some other technical solution that would allow the dynamic updating of templates from structured data. The team currently is unable to do this on their own, so it will require some external partnership to make this happen and WMDE does not have the resources to assist with this task.

So, this is going to take some time to complete. I have no idea how long for a timeframe, the team will have to talk to other teams and figure out how to make this happen. But they do know absolutely how important this is to the community's plans for the project, and the work is definitely something to do once resourcing gets worked out.

Thank you for the update. Since SDC and Wikidata are based on the same underlying software, I was (naively) imagining that this task was more like extending mw.wikibase.getEntity to go to www.wikidata.org if input string starts with "Q" and go to commons.wikimedia.org if input string starts with "M". However apparently there is much more to it than that.

Thank you for the update. Since SDC and Wikidata are based on the same underlying software, I was (naively) imagining that this task was more like extending mw.wikibase.getEntity to go to www.wikidata.org if input string starts with "Q" and go to commons.wikimedia.org if input string starts with "M". However apparently there is much more to it than that.

@Jarekt It might not be incredibly hard, it might be as you describe. The problem right now is that the team doesn't know what it'll take, and with the rest of the SDC features coming out over the next six months someone has to find time to sit down and learn about it and scope out the work.

It'll happen, we just don't know exactly when.

Is there an opportunity for a retrospective here to figure out why it wasn't included in the original plan? It seems quite a basic component for SDC, and it's been requested for quite a while (or at least, I remember asking about it at last year's Wikimania!).

Also, T212843 (adding Lua support for lexemes) is very relevant, and perhaps the work from the two could be combined.

Jarekt added a comment.Jul 1 2019, 2:19 PM

I do not speak php, but looking at the code Multichill linked to, it seems like the source code for getEntity can be found at EntityAccessor.php (line 139) which calls entityIdParser to convert string version of entity ID to an integer. The code for entityIdParser can be found here. Interestingly ItemId.php function checks that entity ID string fits /^Q[1-9]\d{0,9}\z/i regex pattern and than strips the "Q" from the begging of the string to get the integer. So in this code the entity IDs starting with "Q" are hardwired.

Keegan added a comment.Jul 1 2019, 5:13 PM

Is there an opportunity for a retrospective here to figure out why it wasn't included in the original plan? It seems quite a basic component for SDC, and it's been requested for quite a while (or at least, I remember asking about it at last year's Wikimania!).

I know I'll certainly cover it when we do a project retro towards the end of the year. It's come up from time to time, we've talked about it, we're not surprised by this. At some point the work didn't get revisited after last fall when it probably should have. We'll figure it out.

I think the team is going to make a spike to see how long it will take them to do this internally.

Cparle added a comment.Jul 2 2019, 4:20 PM
This comment was removed by Cparle.

It looks like the Wikibase Lua support can mostly deal with MediaInfo already, except that it can't look up the MediaInfo entities.
Unlike Wikibase properties & items, MediaInfo entities are not currently stored in wb_terms.
If MediaInfo starts writing to wb_terms, Lua's functions also start working.
wb_terms is in the process of being redesigned/migrated, though, and I have yet to look into that.

Change 522355 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Start writing entities into wb_terms

https://gerrit.wikimedia.org/r/522355

It looks like the Wikibase Lua support can mostly deal with MediaInfo already, except that it can't look up the MediaInfo entities.

What kind of look up is this - matching the exact caption? Or partial matching the caption? mw.wikibase.getEntity doesn't seem to need any lookups, so which Lua function would use that?

It looks like the Wikibase Lua support can mostly deal with MediaInfo already, except that it can't look up the MediaInfo entities.

What kind of look up is this - matching the exact caption? Or partial matching the caption? mw.wikibase.getEntity doesn't seem to need any lookups, so which Lua function would use that?

No, it's not like that, it's something like mw.wikibase.getEntity( 'Q20489172' ) and returns the whole entity in LUA (like https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q20489172 ) so you can work with the data.

So as a LUA user I want to do mw.wikibase.getEntity( 'M62798946' ) and get in LUA like https://commons.wikimedia.org/w/api.php?action=wbgetentities&ids=M80505417 .
See also https://www.mediawiki.org/wiki/Extension:Wikibase_Client/Lua#mw.wikibase.entity

So as a LUA user I want to do mw.wikibase.getEntity( 'M62798946' ) and get in LUA like

But for this you don't need any index at all, 62798946 is literally the page id. Am I missing something here? Why there's talk about indexes?

So as a LUA user I want to do mw.wikibase.getEntity( 'M62798946' ) and get in LUA like

But for this you don't need any index at all, 62798946 is literally the page id. Am I missing something here? Why there's talk about indexes?

I was kind of wondering the same.

@matthiasmullie why do we need an index?

So, unless access is needed via a media info caption, we do not need an index.
So if we just want lookup by media info id we can close T227847 and T227848.

Is the desire here to have client access to commons media info entities?
If so the whole client access system probably just needs hooking up.

$wgWBClientSettings['repositories'] for wikimedia clients currently only defined item property and lexeme for client access.
etc.
As far as I know none of the client functionality has been turned on or hooked up, or tested in beta or on test yet?

Yann added a subscriber: Yann.Mon, Jul 22, 4:32 PM

@Addshore et al., not to make your lives more difficult here, but assuming caption lookup is not desired (because I can't imagine it would be), would an index of some sort be needed in order to support something like mw.wikibase.getEntity( 'File:Blah.png' ) - i.e. using the filename instead of the M-ID?

(also FWIW I think the MediaInfo entities are *currently* the filepage ID but almost none of the code assumes that, at least the last time I tinkered with things)

would an index of some sort be needed in order to support something like mw.wikibase.getEntity( 'File:Blah.png' )

You'd have to just access page table to look up page by title, then go to the appropriate revision & slot (AFAIR there are services already in Wikibase that do that) and load the Wikibase data from there. I don't think you'd need anything beyond existing page-revision-slot-text tables.

MediaInfo entities are *currently* the filepage ID but almost none of the code assumes that

AFAIR the code that does lookups and title/entityId conversions assumes that. Of course that code can be changed, but with this large change it would be natural to expect that. I would tend to write such change as YAGNI, but I think using standard lookup interfaces which go from Title to EntityId and then load the data would allow to avoid hardcoding for this particular use case, and if anything changes, these lookups would have to be changed.

Just realized https://commons.wikimedia.org/w/api.php?action=wbgetentities&props=labels&format=json&languagefallback=1&sites=commonswiki&titles=File:Charles%20P.%20Gruppe%20-%20Meadow%20Brook%20-%201912.7.1%20-%20Smithsonian%20American%20Art%20Museum.jpg already works at the moment.

The main use case for us is to use the LUA on the same file page as where the structured data is located. In that context the pageid (80505417), mediainfo id (M80505417) and filename (File:Charles P. Gruppe - Meadow Brook - 1912.7.1 - Smithsonian American Art Museum.jpg) are known. So no lookup by caption, we don't want that.

Change 525794 had a related patch set uploaded (by Matthias Mullie; owner: Matthias Mullie):
[mediawiki/extensions/WikibaseMediaInfo@master] Provide an alternative TermIndex, that doesn't require wb_terms

https://gerrit.wikimedia.org/r/525794

Change 522355 abandoned by Matthias Mullie:
Start writing entities into wb_terms

Reason:
I assumed we didn't get around to implementing this because we hadn't needed it yet - was not aware there were actual reasons for killing it.

https://gerrit.wikimedia.org/r/522355

I was thinking it'd make most sense to use the existing code & structure already used by Wikidata.
But I've since learned about the reasons for not using wb_terms (I figured it was just a case of not having implemented it because there had not been a need for it yet)

I've also been playing around with the idea of just not using an index altogether ATM, and I get the idea most of the people involved in this ticket tend to agree?

I have a small POC patch up (https://gerrit.wikimedia.org/r/525794) that seems to work well enough for find the entity & exposing its data to Lua.
There's still a lot of work to be done (optimize, make sure other wikibase properties & items can still be retrieved, ...), but can someone with more knowledge of this stuff take a look and see if this is a direction worth pursuing? (or could this be problematic for reasons I'm not yet aware of?)

(I'm going to decline the other 2 wb_terms related tickets, since it looks like there are good reasons not to pursue that right now)

Restricted Application added a project: Multimedia. · View Herald TranscriptFri, Jul 26, 12:08 PM
Ramsey-WMF moved this task from Untriaged to Next up on the Multimedia board.
Ramsey-WMF moved this task from To Do to Doing on the Structured Data Engineering board.

Looking at the code it looks like indeed either a new TermIndex type thing would be needed for media info, or the fetching of terms, as currently done in https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/84e2062770467eacbb42e8a55bdf77e11141834f/lib/includes/Store/Sql/TermSqlIndex.php#L638-L686, would need to be factored out in some way.

It feels like there should be a cleaner way of doing tihs but I might have to sit down and stare at it all for a bit longer.

I don’t think implementing TermIndex itself, as I51bc8c9703 currently does, is a good idea. It’s not a great interface (combining lookup, search and modification), and for the new term store in Wikibase we did not write a new implementation of it, but instead implementations of several different interfaces – so I would be wary of any code that really needs a TermIndex (because that would likely be broken on Wikidata already as we migrate away from wb_terms). I think what you need to implement for WikibaseMediaInfo Lua support is PrefetchingTermLookup – implementing TermIndex gets you that (via BufferingTermLookup), but it would be better to do it directly.

@matthiasmullie I hear you’re also in Stockholm, so we can also discuss this in person if you want :)

@LucasWerkmeister Yes - I'm in Stockholm. I'll take a look at your suggestion and then come find you!