Page MenuHomePhabricator

EditCounter's "pages created" does not match Pages tool
Closed, ResolvedPublic3 Estimated Story Points

Description

E.g. https://xtools.wmflabs.org/ec-generalstats/en.wikipedia.org/GoldenRing versus https://xtools.wmflabs.org/pages/en.wikipedia.org/GoldenRing/all/none

The discrepancy appears to be that the Pages tool (appropriately) does not count deleted pages that were redirects created because of a move. In a nutshell, we need a version of this query that gives us a single COUNT:

SELECT a.ar_namespace AS namespace,
    'arc' AS type,
    a.ar_title AS page_title,
    0 AS page_len,
    '0' AS page_is_redirect,
    MIN(a.ar_timestamp) AS rev_timestamp,
    a.ar_user AS rev_user,
    a.ar_user_text AS username,
    a.ar_len AS rev_len,
    a.ar_rev_id AS rev_id,
    NULL AS pa_class,
    NULL AS pa_page_id,
    NULL AS pa_page_revision
FROM `enwiki_p`.`archive_userindex` a
JOIN (
    SELECT b.ar_namespace, b.ar_title
    FROM `enwiki_p`.`archive_userindex` AS b
    LEFT JOIN `enwiki_p`.`logging_userindex`
        ON log_namespace = b.ar_namespace
        AND log_title = b.ar_title
        AND log_user = b.ar_user
        AND (log_action = 'move' or log_action = 'move_redir')
    WHERE ar_user = '90672'
        AND ar_timestamp > 1
        AND b.ar_parent_id = '0'
        AND log_action IS NULL
) AS c ON c.ar_namespace = a.ar_namespace AND c.ar_title = a.ar_title
GROUP BY a.ar_namespace, a.ar_title
HAVING  rev_user = '90672'

Make sure both interfaces match what is currently output by the Pages tool.

Event Timeline

kaldari edited projects, added Community-Tech-Sprint; removed Community-Tech.
kaldari set the point value for this task to 3.

So, am I correct in the following thinking? (Because none of the numbers match for this example user, live nor deleted: Edit Counter says 252 live & 13 deleted; and Pages says 250 live (+7 live redirects) & 11 deleted.)

1. Pages created and still alive:

Distinct page IDs from all revisions by this user where there's no parent revision,
but not including those that have a matching 'move' log entry
(where 'matching' is the same page ID and user ID).

This was:

SELECT 'created-live' AS `key`, COUNT(DISTINCT rev_page) AS `val`
FROM revision_userindex
WHERE rev_user = 90672 AND rev_parent_id = 0

but should be:

SELECT 'created-live' AS `key`, COUNT(DISTINCT rev_page) AS `val`
FROM revision_userindex r
	LEFT JOIN logging_userindex l
		ON (
			r.rev_user = l.log_user
			AND r.rev_page = l.log_page
			AND (log_action = 'move' OR log_action = 'move_redir')
		)
WHERE
	rev_user = 90672
	AND rev_parent_id = 0
	AND l.log_id IS NULL

But I'm not sure that's correct, because the 'matching' log entry is for the moved page rather than the redirect. ("In the case of a page move, this is set to the page_id of the moved page (since Gerrit change 157872). Formerly, it was the page_id of the redirect, or 0 if the page was moved without creating a redirect.")

2. Pages created and since deleted (by someone else):

Distinct page IDs from all archived revisions by this user where there's no parent revision,
but not including those that have a matching 'move' log entry
(where 'matching' is the same page ID and user ID).

This is currently:

SELECT 'created-deleted' AS `key`, COUNT(DISTINCT ar_page_id) AS `val`
                FROM archive_userindex
                WHERE ar_user = 90672 AND ar_parent_id = 0

but should be:

SELECT 'created-deleted' AS `key`, COUNT(DISTINCT ar_page_id) AS `val`
FROM
	archive_userindex
	LEFT JOIN logging_userindex
		ON (
			ar_user = log_user
			AND ar_page_id = log_page
			AND (log_action = 'move' OR log_action = 'move_redir')
		)
WHERE
	ar_user = 90672
	AND ar_parent_id = 0

Sorry for my confusion; I'm sure there's some simple thing I'm missing here, or I'm just not familiar enough with the relationship between logging and archiving.

So I'm thinking that the correct number is to be got by this query:

SELECT 'created-deleted' AS `key`, COUNT(DISTINCT ar_page_id) AS `val`
FROM $archiveTable LEFT JOIN $loggingTable ON (
    ar_title = log_title
    AND ar_namespace = log_namespace
    AND ar_page_id = log_page
    AND (log_action = 'move' OR log_action = 'move_redir')
)
WHERE ar_user = :userId AND ar_parent_id = 0 AND log_id IS NULL

but it really is a bit slow. (It's quicker when joining on ar_page_id, but that's not always available it seems, for older data.)

@Samwilson: Isn't the count in "Edit counter" supposed to exclude all redirects anyway? Not sure I understand why redirects created during page move would be an issue in that case.

No, this is about counting how many pages a user has created that have since been deleted. So we were just looking at the number of unique page IDs referenced in archived revisions that have no parent revision. But that includes pages created as redirects, which we don't want.

We could join to the page table instead of the logging table (and filter on page_is_redirect), but that would change it to exclude pages that were redirects at the time the page was deleted, rather than at creation.

@MusikAnimal do you have some advice?

All things considered, I'm thinking we should just show the raw pages created count as we are now, and leave a footnote explaining that all pages are included. Or we can just remove this stat altogether. Perfecting it apparently is not going to be cheap, and after all, it's not the primary use case for this tool.

Merged! Will be on xtools-dev shortly and deployed to prod with the next release