Page MenuHomePhabricator

[Pages Created] Redirects are counted as "pages created"
Open, LowPublic8 Estimated Story Points

Description

Hello!

I'm reporting an issue on behalf of a user from German Wikipedia:

  • User A creates a redirect.
  • User B turns this redirect into an article or a disambiguation page.
  • --> The page is counted as a "page created" for user A, not for user B. This feels wrong because a redirect is not the start of an article.

Can this be changed so that the creation of redirects is not counted as a page created?

Thank you and many greetings,
Johanna

Related: T190065

Event Timeline

Can this be changed so that the creation of redirects is not counted as a page created?

It is not trivial, unfortunately. We know a page is a redirect from the page_is_redirect column in the page table. This value only pertains to the current revision of the article. XTools is able to see that you created the page, and that it isn't a redirect, so it reports it as an article you created. Meanwhile there is nothing in the logging table that says a redirect was turned into an article, so the only way to do this check is to parse the revisions of the page. This would be very inefficient.

I'll leave this ticket open but I'm not sure there is a viable solution :( Sorry!

Same for self-created redirects, that are later converted into an article: they don't show up in 'pages created' either

Matthewrbowker moved this task from Backlog to General / other on the XTools board.
Matthewrbowker subscribed.

Makes sense.

We would need to investigate whether this is technically possible. That would be the first step.

Same for self-created redirects, that are later converted into an article: they don't show up in 'pages created' either

I cannot reproduce that; either it was due to a misunderstanding with caching, or it was since fixed.

Possible solutions

For anything pre-2018, edits which made a page a redirect or removed the redirection are functionally indistinguishable from other edits.
However, around that time were introduced the change tags mw-new-redirect and mw-removed-redirect. Therefore, we can fix by adding conditions after the rev_parent_id = 0:

  • to prevent someone who just created the redir from getting the credit for the current non-redir page, something along the lines of
AND (page_is_redirect OR NOT EXISTS (
  SELECT 1
  FROM change_tag
  JOIN change_tag_def
  ON ctd_id = ct_tag_id
  WHERE ct_rev_id = rev_id
  AND ctd_name = 'mw-new-redirect'
))

which essentially boils down to "and the creating edit created a redir, but the page isn't a redir anymore".

  • to give credit to someone who de-redir'd a redir, it'd be a bit more tricky. we could do something along the lines of
OR EXISTS (
  SELECT 1
  FROM change_tag
  JOIN change_tag_def
  ON ctd_id = ct_tag_id
  WHERE ct_rev_id = rev_id
  AND ctd_name = 'mw-removed-redirect'
)

The issues with that approach would be:

  1. someone that de-redir's a page made into a redir by vandalism, they aren't really the creators
  2. we'd be counting multiple creators
  3. vice-versa: a vandal removing a redir shouldn't become the creator

The third issue can be mitigated by just AND (NOT page_is_redirect). The first/second are more tricky. In effect, doing it "correctly" would require checking properties related to being the first/last/only de-redir of a page, which requires parsing history and so is a no-no. A good approximation of the first could be just checking that the page originally was a redir at creation, i.e. adding to the previous EXISTS:

AND EXISTS (
  SELECT 1
  FROM revision as revs
  JOIN change_tag
  ON ct_rev_id = revs.rev_id
  JOIN change_tag_def
  ON ctd_id = ct_tag_id
  WHERE revs.rev_page = rev_page
  AND revs.rev_parent_id = 0
  AND ctd_name = 'mw-new-redirect'
)

I am afraid that there aren't much solutions for the second that don't require parsing the entire history, or at least an unknown number of revisions.


Is it worth it?

Now, for the cost of the above. Putting together all of the above, it would mean replacing AND rev_parent_id = 0 by:

AND (
  (
    rev_parent_id = 0
    AND (
      page_is_redirect
      OR
      NOT EXISTS (
        SELECT 1
        FROM change_tag
        JOIN change_tag_def
        ON ctd_id = ct_tag_id
        WHERE ct_rev_id = rev_id
        AND ctd_name = 'mw-new-redirect'
      )
    )
  )
  OR
  (
    rev_parent_id != 0
    AND NOT page_is_redirect
    AND EXISTS (
      SELECT 1
      FROM change_tag
      JOIN change_tag_def
      ON ctd_id = ct_tag_id
      WHERE ct_rev_id = rev_id
      AND ctd_name = 'mw-removed-redirect'
    )
    AND EXISTS (
      SELECT 1
      FROM revision as revs
      JOIN change_tag
      ON ct_rev_id = revs.rev_id
      JOIN change_tag_def
      ON ctd_id = ct_tag_id
      WHERE revs.rev_page = rev_page
      AND revs.rev_parent_id = 0
      AND ctd_name = 'mw-new-redirect'
    )
  )
)

Our worst case is a rev which a) is not a creation proper (as in rev_parent_id != 0), b) is currently not a redir, and c) has mw-removed-redirect. We'd do two simple checks on columns we have, we'd once check a tag of a rev by rev_id, and once checking the tag of the first rev of the page. Did some tests, and: tag checking by rev_id takes pretty consistently 1.7ms on average (quarry); however, checking a tag of first rev by page is much slower (quarry). Not 100% sure, but I think it takes between 100ms and 500ms (varied widely in testing). Note: for the tests, I checked, and taking the last revisions or first revisions does not alter performance noticeably.
That's the worst case. Very rough statistics:
6% of edits are page creations.
21% of pages are redirects.
0.5% of edits have mw-new-redirect.
0.04% of edits have mw-removed-redirect.
Therefore, the tree looks like this:

  • 6% go into the first branch of the OR (creations)
    • 21% of those, or 1.3% total don't do the tag check of the first branch (redirects). Here the time addition is negligible.
    • 79% of those, or 4.7% total do the tag check (non-redirects). Here, the time addition is roughly 1.7ms.
  • 94% go into the second branch of the OR (non-creations)
    • 21% of those, or 29.7% of the total, stop here (redirects). Here the time addition is negligible.
    • 79% of those, or 74.2% of the total, do the first tag check (non-redirects)
      • 99.96% of those, or 74.1% of the total, stop here (not redir removals). Here, the time addition is roughly 1.7ms.
      • 0.04% of those, or 0.03% of the total, do the last check (redir removals). Here, the time addition is very roughly 200ms.

So for 99.97% of edits, the addition is <1.7ms, and for the other 0.03%, it's ~=200ms.
This change would on average add 0.9997*1.7 + 0.0003*200 = 1.75ms per edit.
I'm checking the real performance impact right now.

Testing with this, at >330K edits: this made time grow from 68s to 139s; which is +71s; or 0.21ms/edit. That's better than my predictions, but still much too slow.
Removing the last expensive test, that brings it to 109s; which is +31s; or 0.09ms/edit. That's still very slow, and I don't think it's worth it, especially knowing that removing this last expensive test means that anyone who removed a redirect gets the page counted as created by them.
Except if we can make those tag checks significantly faster, we'll have to decline this. The root issue is that we can't just JOIN, because each tag has a separate row in change_tag.