Page MenuHomePhabricator

RDF output of an entity loads all referenced entities, it shouldn't
Closed, DuplicatePublic

Description

Found in T243915, For example, RDF output of Q7251, loads 252 other entities called from Wikibase\Rdf\RdfBuilder::resolveMentionedEntities. This wouldn't scale, most of the time responding for RDF is being spent on loading those, not to mention the huge memory footprint caused by it. the entities are in ExternalStorage, so loading can't be batched (due to the nature of ES using consistent hashing), the I/O needed for it is wild while it doesn't need all of the other entities, it only needs labels, property info, and statements like "formatter url" that can be also put in some cache here and there. After talking to devs and PM, it doesn't seem to be intentional.

Event Timeline

Lucas_Werkmeister_WMDE renamed this task from RDF output of an entity loads all of refrecned entities, it shouldn't to RDF output of an entity loads all referenced entities, it shouldn't .Jan 30 2020, 11:50 AM

the entities are in ExternalStorage

One note about that, these are likely retrieve from wan cache not actually external storage.

the entities are in ExternalStorage

One note about that, these are likely retrieve from wan cache not actually external storage.

Yeah and that's not great either, it just moves the problem from one part of the infrastructure to another part. Network, bandwidth, IO and other things are still pretty high.

So I did a quick check on this on my volunteer capacity. It looks really interesting. If you remove $this->mentionedEntityTracker->entityReferenceMentioned from EntityIdRdfBuilder::addValue(), the RDF output stays completely the same (tried it in mwdebug in production on Q7251, even the hash is the same) but the time spent to produce it gets cut to 1/20th of it and the memory used to one fifth

I make a patch for this and ask @daniel and @Tpt to take a look. Maybe I'm missing something obvious here.

I also tested it with random string in the URL to bypass varnish.

Change 572491 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Do not try to load the whole entity because their id is mentioned in RDF

https://gerrit.wikimedia.org/r/572491

Okay, Apparently some caching was indeed not letting me see the actual difference. They are different now, with this patch we don't get things like this:

> wd:Q15442776 a wikibase:Item ;
> 	rdfs:label "cryptographer"@en ;
> 	skos:prefLabel "cryptographer"@en ;
> 	schema:name "cryptographer"@en ;
> 	rdfs:label "cryptographe"@fr ;
> 	skos:prefLabel "cryptographe"@fr ;
> 	schema:name "cryptographe"@fr ;
> 	rdfs:label "Kryptograph"@de ;
> 	skos:prefLabel "Kryptograph"@de ;
> 	schema:name "Kryptograph"@de ;
> 	rdfs:label "криптограф"@ru ;
> 	skos:prefLabel "криптограф"@ru ;
> 	schema:name "криптограф"@ru ;
> 	rdfs:label "crittografo"@it ;
> 	skos:prefLabel "crittografo"@it ;
> 	schema:name "crittografo"@it ;
> 	rdfs:label "κρυπτογράφος"@el ;
> 	skos:prefLabel "κρυπτογράφος"@el ;
> 	schema:name "κρυπτογράφος"@el ;
> 	rdfs:label "criptógrafo"@es ;
> 	skos:prefLabel "criptógrafo"@es ;
> 	schema:name "criptógrafo"@es ;
> 	rdfs:label "cryptograaf"@nl ;
> 	skos:prefLabel "cryptograaf"@nl ;
> 	schema:name "cryptograaf"@nl ;
> 	rdfs:label "criptógrafo"@pt ;
> 	skos:prefLabel "criptógrafo"@pt ;
> 	schema:name "criptógrafo"@pt ;
> 	rdfs:label "криптограф"@sr ;
> 	skos:prefLabel "криптограф"@sr ;
> 	schema:name "криптограф"@sr ;
> 	rdfs:label "криптограф"@sr-ec ;
> 	skos:prefLabel "криптограф"@sr-ec ;
> 	schema:name "криптограф"@sr-ec ;
> 	rdfs:label "kriptograf"@sr-el ;
> 	skos:prefLabel "kriptograf"@sr-el ;
> 	schema:name "kriptograf"@sr-el ;
> 	rdfs:label "kryptograf"@cs ;
> 	skos:prefLabel "kryptograf"@cs ;
> 	schema:name "kryptograf"@cs ;
> 	rdfs:label "kryptograf"@da ;
> 	skos:prefLabel "kryptograf"@da ;
> 	schema:name "kryptograf"@da ;
> 	rdfs:label "kryptograf"@sv ;
> 	skos:prefLabel "kryptograf"@sv ;
> 	schema:name "kryptograf"@sv ;
> 	rdfs:label "kriptograf"@sl ;
> 	skos:prefLabel "kriptograf"@sl ;
> 	schema:name "kriptograf"@sl ;
> 	rdfs:label "գաղտնագիր"@hy ;
> 	skos:prefLabel "գաղտնագիր"@hy ;
> 	schema:name "գաղտնագիր"@hy ;
> 	rdfs:label "criptògraf"@ca ;
> 	skos:prefLabel "criptògraf"@ca ;
> 	schema:name "criptògraf"@ca ;
> 	rdfs:label "criptograf"@ro ;
> 	skos:prefLabel "criptograf"@ro ;
> 	schema:name "criptograf"@ro ;
> 	rdfs:label "kriptográfus"@hu ;
> 	skos:prefLabel "kriptográfus"@hu ;
> 	schema:name "kriptográfus"@hu ;
> 	rdfs:label "عالم تعمية"@ar ;
> 	skos:prefLabel "عالم تعمية"@ar ;
> 	schema:name "عالم تعمية"@ar ;
> 	rdfs:label "криптограф"@uk ;
> 	skos:prefLabel "криптограф"@uk ;
> 	schema:name "криптограф"@uk ;
> 	rdfs:label "密碼學家"@zh-hk ;
> 	skos:prefLabel "密碼學家"@zh-hk ;
> 	schema:name "密碼學家"@zh-hk ;
> 	rdfs:label "密碼學家"@yue ;
> 	skos:prefLabel "密碼學家"@yue ;
> 	schema:name "密碼學家"@yue ;
> 	rdfs:label "密碼學家"@zh ;
> 	skos:prefLabel "密碼學家"@zh ;
> 	schema:name "密碼學家"@zh ;
> 	rdfs:label "密码学家"@zh-cn ;
> 	skos:prefLabel "密码学家"@zh-cn ;
> 	schema:name "密码学家"@zh-cn ;
> 	rdfs:label "密码学家"@zh-hans ;
> 	skos:prefLabel "密码学家"@zh-hans ;
> 	schema:name "密码学家"@zh-hans ;
> 	rdfs:label "密碼學家"@zh-hant ;
> 	skos:prefLabel "密碼學家"@zh-hant ;
> 	schema:name "密碼學家"@zh-hant ;
> 	rdfs:label "密碼學家"@zh-mo ;
> 	skos:prefLabel "密碼學家"@zh-mo ;
> 	schema:name "密碼學家"@zh-mo ;
> 	rdfs:label "密码学家"@zh-my ;
> 	skos:prefLabel "密码学家"@zh-my ;
> 	schema:name "密码学家"@zh-my ;
> 	rdfs:label "密码学家"@zh-sg ;
> 	skos:prefLabel "密码学家"@zh-sg ;
> 	schema:name "密码学家"@zh-sg ;
> 	rdfs:label "密碼學家"@zh-tw ;
> 	skos:prefLabel "密碼學家"@zh-tw ;
> 	schema:name "密碼學家"@zh-tw ;
> 	rdfs:label "jüfavan"@vo ;
> 	skos:prefLabel "jüfavan"@vo ;
> 	schema:name "jüfavan"@vo ;
> 	rdfs:label "крыптограф"@be ;
> 	skos:prefLabel "крыптограф"@be ;
> 	schema:name "крыптограф"@be ;
> 	rdfs:label "kriptografisto"@io ;
> 	skos:prefLabel "kriptografisto"@io ;
> 	schema:name "kriptografisto"@io ;
> 	rdfs:label "kriptografo"@eu ;
> 	skos:prefLabel "kriptografo"@eu ;
> 	schema:name "kriptografo"@eu ;
> 	rdfs:label "kryptograf"@pl ;
> 	skos:prefLabel "kryptograf"@pl ;
> 	schema:name "kryptograf"@pl ;
> 	rdfs:label "קריפטוגרף"@he ;
> 	skos:prefLabel "קריפטוגרף"@he ;
> 	schema:name "קריפטוגרף"@he ;
> 	rdfs:label "kryptograf"@nb ;
> 	skos:prefLabel "kryptograf"@nb ;
> 	schema:name "kryptograf"@nb ;
> 	rdfs:label "kryptografi"@fi ;
> 	skos:prefLabel "kryptografi"@fi ;
> 	schema:name "kryptografi"@fi ;
> 	rdfs:label "criptógrafu"@ast ;
> 	skos:prefLabel "criptógrafu"@ast ;
> 	schema:name "criptógrafu"@ast ;
> 	rdfs:label "Kryptograph"@lb ;
> 	skos:prefLabel "Kryptograph"@lb ;
> 	schema:name "Kryptograph"@lb ;
> 	rdfs:label "cryptograffwr"@cy ;
> 	skos:prefLabel "cryptograffwr"@cy ;
> 	schema:name "cryptograffwr"@cy ;
> 	rdfs:label "криптограф"@mk ;
> 	skos:prefLabel "криптограф"@mk ;
> 	schema:name "криптограф"@mk ;
> 	rdfs:label "criptografiste"@lfn ;
> 	skos:prefLabel "criptografiste"@lfn ;
> 	schema:name "criptografiste"@lfn ;
> 	rdfs:label "kriptologo"@eo ;
> 	skos:prefLabel "kriptologo"@eo ;
> 	schema:name "kriptologo"@eo ;
> 	rdfs:label "cripteagrafaí"@ga ;
> 	skos:prefLabel "cripteagrafaí"@ga ;
> 	schema:name "cripteagrafaí"@ga ;
> 	schema:description "spécialiste de la cryptographie"@fr,
> 		"Beruf, der das Verschlüsseln vertrauenswürdiger Information beinhaltet"@de,
> 		"professione"@it,
> 		"persona que se especializa en la criptografía"@es,
> 		"specialist on techniques for secure communication in the presence of third parties"@en,
> 		"specialista na počítačové šifrování"@cs,
> 		"ekspert i kryptografi"@da,
> 		"специалист в области криптографии"@ru,
> 		"kriptografian aditua dena"@eu,
> 		"ekspert i kryptografi"@nb,
> 		"henkilö, joka työkseen tekee uusia salakirjoitusmenetelmiä ja suojaa viestejä salakirjoitukselta"@fi,
> 		"在第三方面前進行安全通信技術的專家"@zh .

Is this intentional? if so, I need to make a better solution.

The intent is to provide "stub" information for all mentioned entities, such as their type, label, and description (but maybe not in all languages?). Basically, this is the information needed to generate a minimal human readable representation of the entity. This is skipped in "dump" mode, and could generally be made optional. Though it seems a good default behavior

All info needed for the stubs should be available in database tables (like the terms table), and it should be able to (pre-)fetch them using bulk queries. My approach would be to rewrite resolveMentionedEntities() based on EntityInfo. This is however complicated by the fact that stubs for some kinds of entities expose additional information by implementing EntityRdfBuilder::addEntityStub(). I think in practice, this is only needed for properties. Could be a hard-coded special case.

Change 572491 abandoned by Ladsgroup:
Do not try to load the whole entity because their id is mentioned in RDF

Reason:
Not this way.

https://gerrit.wikimedia.org/r/572491

I'm aware this task is older than T281272
However T281272 has more details and is already picked up on a team workboard, so I am opting to merge this task into that ticket.