Performance issues with tags
Open, HighPublic

Description

Since the ListActiveTags hook was implemented, I noticed a significant increase in loading time at Special:Tags. Extensions use this hook in addition to the ListDefinedTags hook. This means they must hit their database twice for many tags.
In order to implement T89553, we need a way for extensions to specify which tags should be 'patrollable', or for T91425, which tags are considered to indicate a 'problem'. If we used the same scheme, this means we would have to create yet another hook listing these.
This would only increase the performance issues, but in that case it would have a greater impact since the list of 'patrollable' tags must be fetched when displaying recent changes to users who can patrol (it would be cached, but still).

I think we should completely change the approach and instead use a single hook "ListDefinedTagsWithParams" that provides the tag names as keys and an array of params as values. 'active' would be one such param.

To get the list of all defined tags, we use array_keys. To check if a specific tag is defined, we use isset instead of in_array. We would need to hit the database only once and could get all the relevant data in one go. Having these keys is also going to be faster when making lists, since array_intersect and array_diff are just as inefficient as in_array. I think this will significantly increase loading time at Special:Tags and make it possible to efficiently fetch patrollable tags for display in revisions.

Updated table after comments 1, 2, 3. The active status is passed as a param and the (int) param for "importance" is replaced with a (bool) needcheck. Also added (string) source for name of the extension and (array) refs for passing additional params like abuse filter ids.
Updated again after patch set upload. 'needcheck' replaced with 'problem', 'refs' with 'msgParams'. Adds 'canDelete'.

key (tag name)value (array of params)result
visualeditorarray( 'source' => 'VisualEditor', 'active' => true )active, not patrollable, not listed at Special:ProblemChanges
visualeditor-needcheckarray( 'source' => 'VisualEditor', 'active' => true, 'problem' => true )active, patrollable, listed at Special:ProblemChanges
possible vandalismarray( 'source' => 'AbuseFilter', 'active' => true, 'needcheck' => true, 'msgParams' => array( 12, 45) )active, used by abuse filters 12 and 45, patrollable, listed at Special:ProblemChanges
OAuth CID : 457array()inactive
OAuth CID : 509array( 'canDelete' => true )inactive, can be deleted by sysops

Hook name : ChangeTagsRegister
Available params : (bool) active, (bool) problem, (string) source, (array) msgParams, (bool) canDelete
The hooks ListDefinedTags, ChangeTagsListActive and ChangeTagCanDelete should be deprecated eventually.
ChangeTagCanCreate and ChangeTagAfterDelete are preserved.

Related Objects

StatusAssignedTask
ResolvedCenarium
OpenNone
OpenNone
OpenCenarium
ResolvedCenarium
ResolvedNone
OpenCenarium
OpenLadsgroup
OpenNone
OpenNone
OpenNone
ResolvedLadsgroup
ResolvedLadsgroup
ResolvedMarostegui
OpenBstorm
ResolvedLadsgroup
ResolvedLadsgroup
OpenNone
OpenNone
OpenNone
OpenNone
OpenLadsgroup
OpenLadsgroup
OpenLadsgroup
OpenNone
Cenarium created this task.Mar 4 2015, 7:34 PM
Cenarium updated the task description. (Show Details)
Cenarium raised the priority of this task from to Normal.
Cenarium added subscribers: Cenarium, TTO.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2015, 7:34 PM
TTO added a comment.Mar 4 2015, 8:52 PM

I agree, we need to do this. We ought to call the new hook ChangeTagsRegister, ChangeTagsDefine or something similar.

However, I would keep it simple to begin with and have only 3 keys initially:

  • active (bool)
  • defined-by (the name of the extension defining the tag, to be used in a system message as per T20670: Defined by the $1 extension)
  • defined-by-msg (a message key whose content is similar to "Defined by the edit filter")

I don't think "importance" is really necessary. Without wanting to sound harsh, I worry that in general, you would like to make the change tagging system too complex and difficult to understand. If it's too had to understand, people will be afraid of it and won't use it properly, or at all.

It indeed makes sense to have 'active' as a sub-key since we might want to still pass params for inactive tags.

In T91535#1090058, @TTO wrote:

I don't think "importance" is really necessary. Without wanting to sound harsh, I worry that in general, you would like to make the change tagging system too complex and difficult to understand. If it's too had to understand, people will be afraid of it and won't use it properly, or at all.

I'm still convinced the only way to properly implement requests like "articles expanded from redirects should appear as in Special:NewPages" is to selectively enable recent changes patrolling for some tags. It doesn't have to be an integer param, we can do with a boolean 'patrollable' and have Special:ProblemChanges use the same. Or would 'needcheck' be more understandable ?

In T91535#1090058, @TTO wrote:

However, I would keep it simple to begin with and have only 3 keys initially:

  • active (bool)
  • defined-by (the name of the extension defining the tag, to be used in a system message as per T20670: Defined by the $1 extension)
  • defined-by-msg (a message key whose content is similar to "Defined by the edit filter")

Interesting ideas, we definitely need to get the abuse filters applying a specific tag. And we should pass the params provided by the hook to the appropriate messages at SpecialTags.php.

Couldn't the message key be simply tags-description-extension-$extensionName though ? So we don't need to pass it too, just the extension's name.

We could do like this :

  • tags-source-extension uses the $extensionName (provided by the hook)
  • a default system message tags-description-extension-$extensionName is placed just below the generic tag description message
  • numeric parameters are passed to this message from a param $refs provided by the hook (an array of numbers)
  • extensions define their default tag description message and pass the $refs in the hook

For AbuseFilter, $refs would be the ids of filters applying the tags and tags-description-extension-AbuseFilter could contain Tagged by filter $1(hist · log). These would be removed from custom tag descriptions, so the final result should render exactly as it does now on enwiki.

Back on the subject of performance, I've noticed that every time the tag filter selector is built, a check !count( self::listDefinedTags() ) is made, cf line 266 of ChangeTags.php. I wonder why this is needed ? Wikis may have no currently defined tags but still have some previously applied (a check that would make more sense theoretically is of tagUsageStatistics). Getting the list of all defined tags and counting is inefficient (and tagUsageStatistics would be even worse). We shouldn't make this check, it's too expensive for what it's worth, incomplete, and we have a config variable $wgUseTagFilter for this.

Cenarium updated the task description. (Show Details)Mar 6 2015, 5:57 PM
Cenarium set Security to None.
Cenarium updated the task description. (Show Details)Mar 6 2015, 6:07 PM
Cenarium claimed this task.Mar 11 2015, 3:44 PM

I'll make a patch set for this, which includes an overhaul of the caching.
We should also merge the ChangeTagCanDelete hook in the single hook as an 'extensionDeletable' param.

The note in 188543 about allowing extensions to define tags that can be applied by users sounds like what I had in mind with 'optionally activated' extension defined tags (I may have presented it badly :).
This could be put as a param "optionally activated" in the hook. However, this would require a new vt_active field in the ChangeTag table. Does that sound OK ?

I'll keep this completely separate from tags defined in core (I'll go back to this after dealing with the perf issues).

Change 201905 had a related patch set uploaded (by Cenarium):
Use single hook to register tags and performance improvements

https://gerrit.wikimedia.org/r/201905

Cenarium updated the task description. (Show Details)Apr 4 2015, 9:24 PM

Change 218265 had a related patch set uploaded (by Cenarium):
Overhaul caching of tag usage statistics

https://gerrit.wikimedia.org/r/218265

Change 218776 had a related patch set uploaded (by Cenarium):
Don't include never-applied defined tags in tagUsageStatistics function

https://gerrit.wikimedia.org/r/218776

Change 218776 merged by jenkins-bot:
Don't include never-applied defined tags in tagUsageStatistics function

https://gerrit.wikimedia.org/r/218776

TTO added a subscriber: aaron.Oct 7 2015, 5:18 AM

@aaron might be interested in some of the patches linked above, and at Cenarium's account at Gerrit. At the very least, he can say yay or nay as to whether the patch is solving an actual problem and is taking the right approach.

aaron added a comment.Oct 7 2015, 7:44 AM

Some profiling of https://en.wikipedia.org/wiki/Special:Tags?forceprofile=1 :

100.00% 7745.827      1 - main()
 99.72% 7723.827      1 - MediaWiki::run
 99.71% 7723.675      1 - MediaWiki::main
 97.81% 7576.048      1 - MediaWiki::performRequest
 97.58% 7558.297      1 - SpecialPageFactory::executePath
 97.57% 7557.402      1 - SpecialPage::run
 97.53% 7554.324      1 - SpecialTags::execute
 97.51% 7553.103      1 - SpecialTags::showTagList
 77.53% 6005.335    559 - DatabaseBase::select
 76.86% 5953.625    559 - DatabaseBase::query
 76.32% 5911.841    560 - section.DatabaseBase::query
 74.98% 5808.055    567 - DatabaseMysqli::doQuery
 74.21% 5747.997    336 - WANObjectCache::getWithSetCallback
 71.43% 5532.785    107 - ChangeTags::tagUsageStatistics
 71.22% 5516.503      1 - ChangeTags::__invoke
 71.12% 5509.073      1 - section.query: SELECT ct_tag,count(*) AS hitcount FROM `change_tag` GROUP BY ct_tag ORDER BY hitcount DESC  [TRX#f8ff7465a544]
 24.98% 1935.126     97 - SpecialTags::doTagRow
 16.19% 1253.704   1232 - Message::toString
 13.97% 1082.083    197 - Message::parse
 13.72% 1062.707    200 - Message::parseText
 13.69% 1060.455    200 - MessageCache::parse
 13.53% 1047.728    201 - Parser::parse

If the statistics table was in ObjectCache::getMainStashInstance() instead of the WAN cache, it could be updated via the jobqueue (a simple de-duplicated job).

listExplicitlyDefinedTags/listExtensionDefinedTags need a process cache, which I'll just make a patch for.

Change 248281 had a related patch set uploaded (by Aaron Schulz):
Fix slow callbacks in getWithSetCallback() using lockTSE

https://gerrit.wikimedia.org/r/248281

Change 248281 merged by jenkins-bot:
Fix slow callbacks in getWithSetCallback() using lockTSE

https://gerrit.wikimedia.org/r/248281

In T91535#1707837, @TTO wrote:

@aaron might be interested in some of the patches linked above, and at Cenarium's account at Gerrit. At the very least, he can say yay or nay as to whether the patch is solving an actual problem and is taking the right approach.

The problem as I see it: at the moment, on enwiki Special:Tags takes about 7-8 seconds to load. It's a lot already for a special page, but it's not all.
We need a list of tags for drop down menus in recent changes, contributions, etc. Those pages load almost instantly, so we really need to increase performance.

The cache is constantly invalidated by the frequent tags mobile edit, visual editor, etc. The db queries are slow and we can't do much about that.
With 218265, tags with more than a given number of hits will no longer cause a cache purge when updated.
So the cache will be used much more often (at the expense of a slight delay in updates for tags with many hits).
For drop down menus, adding another cache further reduces the probability of a db query being required.

listExplicitlyDefinedTags/listExtensionDefinedTags need a process cache, which I'll just make a patch for.

This is done in 201905 (in the ChangeTagsContext class).

Change 218265 abandoned by Cenarium:
Overhaul caching of tag usage statistics

Reason:
merged back into https://gerrit.wikimedia.org/r/#/c/211497

https://gerrit.wikimedia.org/r/218265

Change 334337 had a related patch set uploaded (by Cenarium):
Add change_tag_statistics table to speed up tag hitcount retrieval

https://gerrit.wikimedia.org/r/334337

It takes several minutes to load Special:Tags at wikidata, the way we're retrieving hitcounts is not efficient or sustainable.
It's also a concern for the dropdown (T27909) on cache misses.
My suggestion, implemented in the above patch set, is to create a table holding tag hitcounts, updated when adding/removing tags.

Tbayer added a subscriber: Tbayer.Feb 2 2017, 10:34 PM
Cenarium raised the priority of this task from Normal to High.Feb 10 2017, 1:21 AM

I tried to load Special:Tags at Wikidata. After waiting almost 25 minutes, this is what I got: 504 Gateway Time-out nginx/1.11.6.

Looks like this is broken since January 31, see here.

daniel added a subscriber: daniel.

Adding this to the Wikidata radar, since this is a scalability problem for us. It's not super urgent since it's not critical functionality, but it would certainly be nice if this would work.

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Dec 7 2017, 4:41 PM

Lydia linked me here. In my current draft for a living persons policy (https://www.wikidata.org/wiki/Wikidata:Living_persons_(draft)) I have a portion that suggests that for every bot edit that a bots that doesn't have a specific "living persons" flag, the bot is supposed to check whether the item is about a living person and if so check whether the property that gets added subclasses Q44597997 or Q44601380.

I don't want to write policy that produces major scalability problems. Do you think it's doable to change the software in a way that such a check isn't a performance problem or would you recommend to skip that part of the policy?

TTO added a comment.Dec 15 2017, 1:15 AM

I have a portion that suggests that for every bot edit that a bots that doesn't have a specific "living persons" flag, the bot is supposed to check whether the item is about a living person

It's not obvious that this has to do with change tagging. As far as I know, checking whether a particular edit has a certain change tag attached to it is a very fast operation, thanks to the indexes on the change_tag table. Is that what you are referring to?

I'm not sure why Lydia linked to this task. If you think that the proposed check causes no performance that would be great. If it produces other performance issues it would be good to know.