Page MenuHomePhabricator

[Task] Implement a limit for entities accessed via arbitrary access features and mark as expensive
Closed, ResolvedPublic1 Estimated Story Points

Description

So that users can only access so many entities during a page parse. This should apply to both Lua and the parser function whenever one of them needs to load a new entity.

This is needed to make sure that users can't load so many entities that things time out while rendering or even worse result in a php fatal. Also this could be needed as an emergency switch in case arbitrary access causes trouble to network traffic or the external storage.

Also we would be marking all such function calls as expensive (https://www.mediawiki.org/wiki/Manual:$wgExpensiveParserFunctionLimit).

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenFeatureNone
OpenFeatureNone
DuplicateNone
ResolvedNone
ResolvedNone
ResolvedNone
OpenNone
OpenNone
StalledNone
InvalidNone
OpenNone
ResolvedTpt
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedLydia_Pintscher
ResolvedJarekt
ResolvedNone
Resolvedhoo
Resolvedhoo
Resolvedhoo

Event Timeline

hoo claimed this task.
hoo raised the priority of this task from to High.
hoo updated the task description. (Show Details)
hoo added subscribers: Rschen7754, Candalua, adrianheine and 56 others.

One thing, I want to use Wikidata for, is translating species names on pages like this one ( https://commons.wikimedia.org/wiki/Fringillidae ) in the language the reader uses. And of course more than 200 species are listed there and all have names which should be translated. Therefore I think

  1. the solution with expensive parser funktion will be better, because it is not a good idea to translate and update the translations by hand or leave such pages untranslated.
  2. when I save such a page I need the answer that wikimedia commons has got the data I saved and will process it later. Even now Commons often needs much time to save pages with many translation templates and I get an error instead of a view of the page, but some minutes later notice that commons got the data and saved it correctly, while the page view ist still the old one. This often leeds to saving the same page version more than once.

One thing, I want to use Wikidata for, is translating species names on pages like this one ( https://commons.wikimedia.org/wiki/Fringillidae ) in the language the reader uses. And of course more than 200 species are listed there and all have names which should be translated. Therefore I think

  1. the solution with expensive parser funktion will be better, because it is not a good idea to translate and update the translations by hand or leave such pages untranslated.
  2. when I save such a page I need the answer that wikimedia commons has got the data I saved and will process it later. Even now Commons often needs much time to save pages with many translation templates and I get an error instead of a view of the page, but some minutes later notice that commons got the data and saved it correctly, while the page view ist still the old one. This often leeds to saving the same page version more than once.
  1. The other solutions would also allow you to do that, as accessing labels is excluded from this (this is only about arbitrary access features, not about accessing labels or descriptions)
  2. If the pages need to long to render, you need to simplify the templates, improving page rendering speed is not within the scope of Wikidata. And accessing a lot of data from Wikidata will also add up on the page rendering time... hopefully not to much, but of course it will.

We talked about this and I think we will do both: Mark the functionality as expensive and have an upper limit within Wikibase.

hoo renamed this task from Decide on and implement a limit for entities accessed via arbitrary access features to Implement a limit for entities accessed via arbitrary access features and mark as expensive.Mar 26 2015, 9:52 AM
hoo updated the task description. (Show Details)

hoo: improving page rendering speed is not within the scope of Wikidata

I know. I simply wantet, that you take in account this use case while thinking about Wikidata limits. I guessed, if this size makes problems within Wikimedia Commons, it may cause problems in wikidata use too.

Change 199969 had a related patch set uploaded (by Hoo man):
Mark accessing arbitrary items as expensive

https://gerrit.wikimedia.org/r/199969

The most difficult use case will probably be lists. We love our lists on Wikipedia. Lists generally contain redundant information. Would be great to store at least part of this data in Wikidata items and grab it from there.

Take for example https://en.wikipedia.org/wiki/National_Register_of_Historic_Places_listings_in_North_Dakota . This list has 235 items. Wikibase should be able to support lists this size.

This is the current output from the page:

NewPP limit report
Parsed by mw1009
CPU time usage: 4.687 seconds
Real time usage: 5.003 seconds
Preprocessor visited node count: 115649/1000000
Preprocessor generated node count: 0/1500000
Post‐expand include size: 1575267/2097152 bytes
Template argument size: 143301/2097152 bytes
Highest expansion depth: 12/40
Expensive parser function count: 1/500
Lua time usage: 0.942/10.000 seconds
Lua memory usage: 3.47 MB/50 MB

Transclusion expansion time report (%,ms,calls,template)
100.00% 3425.845      1 - -total
 78.69% 2695.839    226 - Template:NRHP_row
 19.35%  662.903    418 - Template:First_word
 19.19%  657.334    270 - Template:Designation/color
 18.88%  646.960    566 - Template:NRHP_color
 14.03%  480.707    192 - Template:Coord
 11.31%  387.481    209 - Template:NRHP_Focus
 10.84%  371.255     43 - Template:NRHP_header
  9.11%  312.093    234 - Template:Dts
  5.52%  189.245    234 - Template:Dts/out0

So if each item is one expensive call that would be about 235 out of 500 possible calls.

@Multichill for lists, the plan is to have pre-defined queries in wikidata, and cache the result (and update it periodically). So far, the idea was to just cache the list of entities the query finds, but I'm getting around to believing we should allow users to specify which properties they are interested in, and cache the values of those as part of the materialized result set.

A list page on wikipedia would then not be accessing Items at all, but a single QueryResult object. The original plan of providing just the list of IDs and let the client wiki do the rest via Lua does not seem feasible, considering the performance implications.

Just another use case for your consideration, but there's been some discussion regarding using wikidata inside citation templates.

So on a worse case scenario like Barack Obama, you might see > 200 citations there.

It gets even worse when you consider that each citation might contain fields from multiple entities. For instance, if you are citing a journal article, you might need information about

  1. The article entity
  2. The journal entity
  3. Entity data about every linked author or other creator

In the case of requesting data about the author entities, this could go bad rather quickly. Assuming an average of 10 linked entities per paper (high, but some papers have 50 authors or so), you could easily see something like 2000 (200 * 10) entity requests in a page.

Just another use case for your consideration, but there's been some discussion regarding using wikidata inside citation templates.

So on a worse case scenario like Barack Obama, you might see > 200 citations there.

It gets even worse when you consider that each citation might contain fields from multiple entities. For instance, if you are citing a journal article, you might need information about

  1. The article entity
  2. The journal entity
  3. Entity data about every linked author or other creator

In the case of requesting data about the author entities, this could go bad rather quickly. Assuming an average of 10 linked entities per paper (high, but some papers have 50 authors or so), you could easily see something like 2000 (200 * 10) entity requests in a page.

For now we will also have an own limit just for the entities access and that will be well below 500.

I can see that there are some use cases that require accessing an immense amount of entities in one way or another, but that's not inside the scope of arbitrary access as it's implemented now. We will have a solution for that at some point (see also what Daniel said about lists, as that's kind of related), but that solution will not involve loading the whole entity (which is what arbitrary access is about right now) but rather be a more specialized access functionality (that would eg. only return Statements with a certain PropertyId from a list of Items).

@Mvolz: also note that as long as only the title and local page name of the entity are accessed, special optimizations apply. Such access would not count towards the arbitrary access limit.

@daniel Thanks for the info!

So it would be possible to do limited citations then, but not full citations. For instance, the publication name would be free from the journal, but not the location or journal editors. For authors, a typical citation template has separate fields for first name and last name, which wouldn't be included.

But you could probably do something not half bad with just the publication title and full author name since those don't count towards the limit.

The cases are very various and choose some stable limits seems very hard.

How to adapt and reduce the charge of servers ? Which part in real time ? Which part in pre-computed time ?
The way could be : For each title Qnnnn and each property Pnnnn compute a mean delay between 2 acces and a mean charge of an acces.
From these mean charges, select which Qnnnn and Pnnnn pre-computed work we keep in "cache" to reduce the global charge of servers.
The real time work is then reduced to the too unpredictible asks.

This perpetual optimization process could use 1% of the charge of servers, but greatly reduce the global charge.
The question become : which elementary charge we compute to optimize the global process ?

Then the limits can be bigger and less meaningfull.

@Rical: We already have entity caching in place, this is (mostly) about other pain points like in process memory usage, in process cpu time usage, used network bandwidth and such things.

Change 199969 merged by jenkins-bot:
Mark accessing arbitrary items as expensive

https://gerrit.wikimedia.org/r/199969

Is there more to do in this or can it be closed?

@Lydia_Pintscher if we want an additional, separate limit for accessing wikibase items / properties specifically, in addition to the general expensive limit, then that is not implemented yet.

we might want both, in case there are performance or other issues with arbitrary access. then we can adjust the wikibase limit specifically.

Change 225474 had a related patch set uploaded (by Hoo man):
Introduce a RestrictedEntityLookup for client DataAccess

https://gerrit.wikimedia.org/r/225474

Change 225474 merged by jenkins-bot:
Introduce a RestrictedEntityLookup for client DataAccess

https://gerrit.wikimedia.org/r/225474

With Change 225474 merged, we now report the number of entities loaded, but the limit we impose is INT_MAX.
I suppose this should remain open until we have a configurable limited implemented and a good default defined.

Jonas renamed this task from Implement a limit for entities accessed via arbitrary access features and mark as expensive to [Task] Implement a limit for entities accessed via arbitrary access features and mark as expensive.Aug 19 2015, 11:00 AM

Change 234587 merged by jenkins-bot:
Introduce entityAccessLimit setting (for sane arbitrary access)

https://gerrit.wikimedia.org/r/234587

the default setting is 250 different entities (accessing the full entity object) Use of convenience functions in lua, such as mw.wikibase.label, do not count towards this limit. (e.g. it uses TermLookup instead of loading full entities)

looking at usage tracking, I see only 2 wikis where more than this is used on a page and they are test / user sandbox pages.

https://phabricator.wikimedia.org/P1946

if needed, we could set a higher limit for these wikis (especially wikidata, probably) or adjust the default value.

Blocked by T106190, thus reopening unless that's fixed.

I think this is done now: Everything that's supposed to be expensive has been marked as such and we enforce the limits everywhere.