Page MenuHomePhabricator

On examine of abuse filters the variable old_links contains duplicate protocol-relative URL
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • Add a protocol-relative URL to a page (Do not link to your site, such links are not stored)
  • Create an abuse filter acting with the variable old_links, be sure the variable is evaluated to trigger the abuse filter (for example by having it the first condition, for example create a warning abuse filter with '//test.org' in old_links | action == 'edit', which gets triggered on every edit)
  • Do another edit on the page with the protocol-relative URL
  • Be sure that abuse filter give you a warning about a bad action
  • Go to Special:AbuseLog and open "examine" for the recent item (page should be linked to Special:AbuseFilter/examine/log/<new id>)

What happens?:
The protocol-relative URL is shown twice in the list for the variable old_links

What should have happened instead?:
The url should be listed only once

Software version (skip for WMF-hosted wikis like Wikipedia): master

Other information (browser name/version, screenshots, etc.): not needed

Event Timeline

Change 818610 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/extensions/AbuseFilter@master] Use DISTINCT on LazyVariableComputer::getLinksFromDB

https://gerrit.wikimedia.org/r/818610

Umherirrender added a subscriber: matej_suchanek.

@DBA, @Performance Team:

The above patch set would change:

SELECT  el_to AS `value`  FROM `externallinks`    WHERE el_from = 14

to

SELECT DISTINCT el_to AS `value`  FROM `externallinks`    WHERE el_from = 42

while evaluting abuse filter on edits.

Could that be a performance problem?
It is better to use a array_unique to do that work in php

Thanks for a rating

The worse case for testwiki, page id 71649, with over 4000 links, is done on the same level of magnitude, only having to sort 4000 records in memory, being only 3 times slower.

root@db1102.eqiad.wmnet[testwiki]> EXPLAIN SELECT el_to AS `value`  FROM `externallinks`    WHERE el_from =71649\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: externallinks
         type: ref
possible_keys: el_from,el_from_index_60
          key: el_from
      key_len: 4
          ref: const
         rows: 8488
        Extra: 
1 row in set (0.001 sec)

root@db1102.eqiad.wmnet[testwiki]> EXPLAIN SELECT DISTINCT el_to AS `value`  FROM `externallinks`    WHERE el_from =71649\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: externallinks
         type: ref
possible_keys: el_from,el_from_index_60
          key: el_from
      key_len: 4
          ref: const
         rows: 8488
        Extra: Using where; Using temporary
1 row in set (0.001 sec)

root@db1102.eqiad.wmnet[testwiki]> pager cat > /dev/null
PAGER set to 'cat > /dev/null'
root@db1102.eqiad.wmnet[testwiki]> SELECT DISTINCT el_to AS `value`  FROM `externallinks`    WHERE el_from =71649;
2615 rows in set (0.061 sec)

root@db1102.eqiad.wmnet[testwiki]> SELECT el_to AS `value`  FROM `externallinks`    WHERE el_from =71649;
4358 rows in set (0.020 sec)

Do you know if the code could potentially apply to larger datasets- e.g. pages with millions of external links? In that case the sorting would be slower, as probably wouldn't fit on the buffer.

Do you know if the code could potentially apply to larger datasets- e.g. pages with millions of external links?

Depends on whether there could really be an existing wikipage with millions of external links. There are limits on how large a wikipage can be (including templates).
(For example, if that page continued growing and reached the maximal size (2^21 bytes), there could be seven times more (~31,000) links.)

While an extreme case analysis is valuable, I think what concerns us, too, is the influence on the overall save timing metric (T277788). The query in question is supposed to execute when a user attempts to save a page and an edit filter needs to know the external links in its old version. Not many filters do that and it probably does not happen often anyway because filter maintainers are reminded to use such variables as late as possible (and start with excluding, e.g., trusted users and bots).
I'm honestly not sure what information we need for estimating that.

Assumptions:

  • The variable is lazy computed, meaning only fetched if an active filter exists that references it and reached the relevant condition.
  • The deduplication is needed for the filter to work correctly.
  • We'll need to do some logic either in PHP or in DB.

If we don't need it for the filter to work correctly, then it might be better to do this at "examine" layer instead of in the raw computation side that runs during edits. Or possibly in the middle, e.g. the part where we write the computed information to the database for later examination. This "write to examine database" presumably can or does run post-send already.

Change 818610 merged by jenkins-bot:

[mediawiki/extensions/AbuseFilter@master] Use DISTINCT on LazyVariableComputer::getLinksFromDB

https://gerrit.wikimedia.org/r/818610

Umherirrender claimed this task.
Umherirrender removed a project: Patch-For-Review.