[Task] Implement a limit for entities accessed via arbitrary access features and mark as expensive
Closed, ResolvedPublic1 Estimated Story Points
Actions

Description

So that users can only access so many entities during a page parse. This should apply to both Lua and the parser function whenever one of them needs to load a new entity.

This is needed to make sure that users can't load so many entities that things time out while rendering or even worse result in a php fatal. Also this could be needed as an emergency switch in case arbitrary access causes trouble to network traffic or the external storage.

Also we would be marking all such function calls as expensive (https://www.mediawiki.org/wiki/Manual:$wgExpensiveParserFunctionLimit).

Details

Subject	Repo	Branch	Lines +/-
Introduce a RestrictedEntityLookup for client DataAccess	mediawiki/extensions/Wikibase	master	+457 -3
Introduce entityAccessLimit setting (for sane arbitrary access)	mediawiki/extensions/Wikibase	master	+82 -24
Mark accessing arbitrary items as expensive	mediawiki/extensions/Wikibase	master	+17 -9

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		dchen	T118706 Conduct heuristic evaluation of image upload and insert flow in VisualEditor
Open		None	T115858 Design improvements for mw.ForeignStructuredUpload.BookletLayout
Open		None	T115865 Insert image in content immediately after it's uploaded, skipping the "General settings" step
Duplicate		None	T115864 Figure out if the description of the image can be used as the caption on-wiki
Open	Feature	None	T53032 When inserting an image, set its caption by default to be the Commons image description
Open	Feature	None	T39534 Wikimedia Commons should support searching by color
Duplicate		None	T39535 Wikimedia Commons should support filtering by color
Resolved		None	T19503 Provide metadata support on Wikimedia Commons
Resolved		None	T51662 VisualEditor: Use Multimedia/Wikidata's proposed rich structured meta-data in the image insertion dialog
Resolved		None	T68108 [Epic] Store media information for files on Wikimedia Commons as structured data
Open		None	T109579 [Epic] Give more sister projects access to Wikidata
Open		None	T187900 There is no way to reference a specific quote on Wikiquote
Stalled		None	T71753 [Story] Wikibase / Wikidata support on Wikiquote
Invalid		None	T112073 Lua in Wikibase (tracking)
Open		None	T142093 Decide how to do usage tracking for strings used to lookup entities (page titles, external ids, …)
Resolved		Tpt	T74815 [Task] Add Lua function to get Wikibase entity by site link (title)
Resolved		Lydia_Pintscher	T146637 Wikidata 2016 Q4 goals
Resolved		Lydia_Pintscher	T150179 Wikidata 2017 Q1 goals
Resolved		Lydia_Pintscher	T76007 [Epic] add ability to link/refer to foreign items and properties (federation)
Resolved		Lydia_Pintscher	T89594 Use the arbitrary access to Wikidata feature on Commons (tracking)
Resolved		Jarekt	T89599 Convert Template:Institution to Lua and use Wikidata
Resolved		None	T49930 [Epic] Allow accessing data from a Wikidata item not connected to the current page - arbitrary access
Resolved		hoo	T93885 [Task] Implement a limit for entities accessed via arbitrary access features and mark as expensive
Resolved		hoo	T106190 [Task] {{#property:…}} parser function should enforce the parser's restricted function limit
Resolved		hoo	T106192 {{#property:…}} parser function should only increase the parser's restricted function limit once per entity

Event Timeline

hoo created this task.Mar 25 2015, 2:21 PM

hoo claimed this task.

hoo raised the priority of this task from to High.

hoo updated the task description. (Show Details)

hoo added projects: Wikidata, Tracking-Neverending, MediaWiki-extensions-WikibaseClient.

hoo added subscribers: Rschen7754, Candalua, adrianheine and 56 others.

hoo removed a project: Tracking-Neverending.Mar 25 2015, 4:14 PM

hoo set Security to None.

One thing, I want to use Wikidata for, is translating species names on pages like this one ( https://commons.wikimedia.org/wiki/Fringillidae ) in the language the reader uses. And of course more than 200 species are listed there and all have names which should be translated. Therefore I think

the solution with expensive parser funktion will be better, because it is not a good idea to translate and update the translations by hand or leave such pages untranslated.
when I save such a page I need the answer that wikimedia commons has got the data I saved and will process it later. Even now Commons often needs much time to save pages with many translation templates and I get an error instead of a view of the page, but some minutes later notice that commons got the data and saved it correctly, while the page view ist still the old one. This often leeds to saving the same page version more than once.

In T93885#1152437, @Kersti wrote:

One thing, I want to use Wikidata for, is translating species names on pages like this one ( https://commons.wikimedia.org/wiki/Fringillidae ) in the language the reader uses. And of course more than 200 species are listed there and all have names which should be translated. Therefore I think

the solution with expensive parser funktion will be better, because it is not a good idea to translate and update the translations by hand or leave such pages untranslated.

when I save such a page I need the answer that wikimedia commons has got the data I saved and will process it later. Even now Commons often needs much time to save pages with many translation templates and I get an error instead of a view of the page, but some minutes later notice that commons got the data and saved it correctly, while the page view ist still the old one. This often leeds to saving the same page version more than once.

The other solutions would also allow you to do that, as accessing labels is excluded from this (this is only about arbitrary access features, not about accessing labels or descriptions)
If the pages need to long to render, you need to simplify the templates, improving page rendering speed is not within the scope of Wikidata. And accessing a lot of data from Wikidata will also add up on the page rendering time... hopefully not to much, but of course it will.

We talked about this and I think we will do both: Mark the functionality as expensive and have an upper limit within Wikibase.

hoo renamed this task from Decide on and implement a limit for entities accessed via arbitrary access features to Implement a limit for entities accessed via arbitrary access features and mark as expensive.Mar 26 2015, 9:52 AM

hoo updated the task description. (Show Details)

hoo: improving page rendering speed is not within the scope of Wikidata

I know. I simply wantet, that you take in account this use case while thinking about Wikidata limits. I guessed, if this size makes problems within Wikimedia Commons, it may cause problems in wikidata use too.

hoo mentioned this in T76156: [Story] mw.wikibase: Use __index to lazy load entity contents.Mar 26 2015, 12:05 PM

hoo mentioned this in T70029: allow arbitrary data access on Wikidata (parser function).Mar 26 2015, 6:49 PM

Change 199969 had a related patch set uploaded (by Hoo man):
Mark accessing arbitrary items as expensive

https://gerrit.wikimedia.org/r/199969

gerritbot added a project: Patch-For-Review.Mar 26 2015, 7:06 PM

The most difficult use case will probably be lists. We love our lists on Wikipedia. Lists generally contain redundant information. Would be great to store at least part of this data in Wikidata items and grab it from there.

Take for example https://en.wikipedia.org/wiki/National_Register_of_Historic_Places_listings_in_North_Dakota . This list has 235 items. Wikibase should be able to support lists this size.

This is the current output from the page:

NewPP limit report
Parsed by mw1009
CPU time usage: 4.687 seconds
Real time usage: 5.003 seconds
Preprocessor visited node count: 115649/1000000
Preprocessor generated node count: 0/1500000
Post‐expand include size: 1575267/2097152 bytes
Template argument size: 143301/2097152 bytes
Highest expansion depth: 12/40
Expensive parser function count: 1/500
Lua time usage: 0.942/10.000 seconds
Lua memory usage: 3.47 MB/50 MB

Transclusion expansion time report (%,ms,calls,template)
100.00% 3425.845      1 - -total
 78.69% 2695.839    226 - Template:NRHP_row
 19.35%  662.903    418 - Template:First_word
 19.19%  657.334    270 - Template:Designation/color
 18.88%  646.960    566 - Template:NRHP_color
 14.03%  480.707    192 - Template:Coord
 11.31%  387.481    209 - Template:NRHP_Focus
 10.84%  371.255     43 - Template:NRHP_header
  9.11%  312.093    234 - Template:Dts
  5.52%  189.245    234 - Template:Dts/out0

So if each item is one expensive call that would be about 235 out of 500 possible calls.

@Multichill for lists, the plan is to have pre-defined queries in wikidata, and cache the result (and update it periodically). So far, the idea was to just cache the list of entities the query finds, but I'm getting around to believing we should allow users to specify which properties they are interested in, and cache the values of those as part of the materialized result set.

A list page on wikipedia would then not be accessing Items at all, but a single QueryResult object. The original plan of providing just the list of IDs and let the client wiki do the rest via Lua does not seem feasible, considering the performance implications.

Lydia_Pintscher moved this task from incoming to ready to go on the Wikidata board.Mar 30 2015, 12:41 PM

Just another use case for your consideration, but there's been some discussion regarding using wikidata inside citation templates.

So on a worse case scenario like Barack Obama, you might see > 200 citations there.

It gets even worse when you consider that each citation might contain fields from multiple entities. For instance, if you are citing a journal article, you might need information about

The article entity
The journal entity
Entity data about every linked author or other creator

In the case of requesting data about the author entities, this could go bad rather quickly. Assuming an average of 10 linked entities per paper (high, but some papers have 50 authors or so), you could easily see something like 2000 (200 * 10) entity requests in a page.

In T93885#1162819, @Mvolz wrote:

Just another use case for your consideration, but there's been some discussion regarding using wikidata inside citation templates.

So on a worse case scenario like Barack Obama, you might see > 200 citations there.

It gets even worse when you consider that each citation might contain fields from multiple entities. For instance, if you are citing a journal article, you might need information about

The article entity

The journal entity

Entity data about every linked author or other creator

In the case of requesting data about the author entities, this could go bad rather quickly. Assuming an average of 10 linked entities per paper (high, but some papers have 50 authors or so), you could easily see something like 2000 (200 * 10) entity requests in a page.

For now we will also have an own limit just for the entities access and that will be well below 500.

I can see that there are some use cases that require accessing an immense amount of entities in one way or another, but that's not inside the scope of arbitrary access as it's implemented now. We will have a solution for that at some point (see also what Daniel said about lists, as that's kind of related), but that solution will not involve loading the whole entity (which is what arbitrary access is about right now) but rather be a more specialized access functionality (that would eg. only return Statements with a certain PropertyId from a list of Items).

@Mvolz: also note that as long as only the title and local page name of the entity are accessed, special optimizations apply. Such access would not count towards the arbitrary access limit.

@daniel Thanks for the info!

So it would be possible to do limited citations then, but not full citations. For instance, the publication name would be free from the journal, but not the location or journal editors. For authors, a typical citation template has separate fields for first name and last name, which wouldn't be included.

But you could probably do something not half bad with just the publication title and full author name since those don't count towards the limit.

greg unsubscribed.Mar 30 2015, 8:42 PM

The cases are very various and choose some stable limits seems very hard.

How to adapt and reduce the charge of servers ? Which part in real time ? Which part in pre-computed time ?
The way could be : For each title Qnnnn and each property Pnnnn compute a mean delay between 2 acces and a mean charge of an acces.
From these mean charges, select which Qnnnn and Pnnnn pre-computed work we keep in "cache" to reduce the global charge of servers.
The real time work is then reduced to the too unpredictible asks.

This perpetual optimization process could use 1% of the charge of servers, but greatly reduce the global charge.
The question become : which elementary charge we compute to optimize the global process ?

Then the limits can be bigger and less meaningfull.

@Rical: We already have entity caching in place, this is (mostly) about other pain points like in process memory usage, in process cpu time usage, used network bandwidth and such things.

Change 199969 merged by jenkins-bot:
Mark accessing arbitrary items as expensive

https://gerrit.wikimedia.org/r/199969

Diffusion mentioned this in rMEXTe7fb89152c29: Updated mediawiki/extensions Project: mediawiki/extensions/Wikibase….Apr 2 2015, 2:23 PM

hoo mentioned this in rEWBA34fdec0786df: Mark accessing arbitrary items as expensive.Apr 2 2015, 2:23 PM

Is there more to do in this or can it be closed?

@Lydia_Pintscher if we want an additional, separate limit for accessing wikibase items / properties specifically, in addition to the general expensive limit, then that is not implemented yet.

we might want both, in case there are performance or other issues with arbitrary access. then we can adjust the wikibase limit specifically.

Change 225474 had a related patch set uploaded (by Hoo man):
Introduce a RestrictedEntityLookup for client DataAccess

https://gerrit.wikimedia.org/r/225474

JanZerebecki added a project: Wikidata-Sprint-2015-06-30.Aug 4 2015, 8:38 PM

JanZerebecki moved this task from Backlog to Review on the Wikidata-Sprint-2015-06-30 board.

Tobi_WMDE_SW edited a custom field.Aug 5 2015, 9:05 AM

Change 225474 merged by jenkins-bot:
Introduce a RestrictedEntityLookup for client DataAccess

https://gerrit.wikimedia.org/r/225474

Diffusion mentioned this in rMEXT3304c10ffa01: Updated mediawiki/extensions Project: mediawiki/extensions/Wikibase….Aug 6 2015, 4:56 PM

Addshore mentioned this in rEWBAfbdd8a80203b: Introduce a RestrictedEntityLookup for client DataAccess.Aug 6 2015, 4:56 PM

With Change 225474 merged, we now report the number of entities loaded, but the limit we impose is INT_MAX.
I suppose this should remain open until we have a configurable limited implemented and a good default defined.

hoo added a subtask: T106190: [Task] {{#property:…}} parser function should enforce the parser's restricted function limit.Aug 6 2015, 11:52 PM

JanZerebecki added a project: Wikidata-Sprint-2015-08-11.Aug 13 2015, 1:16 AM

JanZerebecki moved this task from Backlog to Review on the Wikidata-Sprint-2015-08-11 board.

Lydia_Pintscher added a project: Wikidata-Sprint-2015-08-18.Aug 18 2015, 1:21 PM

Lydia_Pintscher removed a subscriber: Wikidata-bugs.

JanZerebecki moved this task from Backlog to Review on the Wikidata-Sprint-2015-08-18 board.Aug 18 2015, 1:23 PM

aude moved this task from Review to Doing on the Wikidata-Sprint-2015-08-18 board.Aug 18 2015, 3:28 PM

aude mentioned this in T107319: [Bug] EntityUsageTable::touchUsages slow query.

• Jonas renamed this task from Implement a limit for entities accessed via arbitrary access features and mark as expensive to [Task] Implement a limit for entities accessed via arbitrary access features and mark as expensive.Aug 19 2015, 11:00 AM

Lydia_Pintscher moved this task from Doing to Backlog on the Wikidata-Sprint-2015-08-18 board.Aug 19 2015, 3:44 PM

Lydia_Pintscher moved this task from ready to go to in progress on the Wikidata board.Aug 20 2015, 9:08 AM

aude moved this task from Backlog to Review on the Wikidata-Sprint-2015-08-18 board.Aug 31 2015, 9:28 AM

Change 234587 merged by jenkins-bot:
Introduce entityAccessLimit setting (for sane arbitrary access)

https://gerrit.wikimedia.org/r/234587

Diffusion mentioned this in rMEXT00b8e6f55701: Updated mediawiki/extensions Project: mediawiki/extensions/Wikibase….Aug 31 2015, 11:23 AM

aude mentioned this in rEWBAd9e9246eba4b: Introduce entityAccessLimit setting (for sane arbitrary access).Aug 31 2015, 11:23 AM

aude closed this task as Resolved.Aug 31 2015, 3:22 PM

the default setting is 250 different entities (accessing the full entity object) Use of convenience functions in lua, such as mw.wikibase.label, do not count towards this limit. (e.g. it uses TermLookup instead of loading full entities)

looking at usage tracking, I see only 2 wikis where more than this is used on a page and they are test / user sandbox pages.

https://phabricator.wikimedia.org/P1946

if needed, we could set a higher limit for these wikis (especially wikidata, probably) or adjust the default value.

aude moved this task from Review to Done on the Wikidata-Sprint-2015-08-18 board.Aug 31 2015, 3:27 PM

Blocked by T106190, thus reopening unless that's fixed.

JanZerebecki moved this task from in progress to consider for next sprint on the Wikidata board.Sep 18 2015, 4:02 PM

HenkvD unsubscribed.Sep 27 2015, 4:27 PM

matej_suchanek removed a project: Patch-For-Review.Apr 29 2016, 3:49 PM

I think this is done now: Everything that's supposed to be expensive has been marked as such and we enforce the limits everywhere.

Liuxinyu970226 unsubscribed.Aug 3 2016, 8:23 AM

matej_suchanek unsubscribed.Aug 3 2016, 2:03 PM

Danmichaelo unsubscribed.Aug 4 2016, 10:35 AM

Rical unsubscribed.Jul 21 2017, 9:49 AM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:41 PM

[Task] Implement a limit for entities accessed via arbitrary access features and mark as expensiveClosed, ResolvedPublic1 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

[Task] Implement a limit for entities accessed via arbitrary access features and mark as expensive
Closed, ResolvedPublic1 Estimated Story Points
Actions

Related Objects
Search...