Page MenuHomePhabricator

Mysterious anonymous content page creations on English Wikipedia according to stats.wikimedia.org
Closed, ResolvedPublic

Description

According to https://stats.wikimedia.org/#/en.wikipedia.org/contributing/new-pages/normal|bar|1-month|editor_type-~anonymous~page_type~content|daily, anonymous users create an average of about 3 new content pages on English Wikipedia every day. This is strange though since anonymous page creation in the main namespace was disabled on English Wikipedia back in 2005, and English Wikipedia doesn't have any other content namespaces besides the main namespace. If you look at https://en.wikipedia.org/w/index.php?title=Special:NewPages&hideredirs=0&hideliu=1, it shows no article creations by anonymous users in the past 30 days. So either stats.wikimedia.org's definition of "content pages" doesn't match MediaWiki's definition of "content pages" or something's broken.

Event Timeline

kaldari created this task.Oct 27 2020, 5:07 PM
Restricted Application added a project: Analytics. · View Herald TranscriptOct 27 2020, 5:07 PM
LGoto moved this task from Triage to Tracking on the Product-Analytics board.Oct 27 2020, 5:12 PM

For reference of scale, we get ~600 pages created a day, so the scale of this is low. Currently I don't think this requires a deep dive from our end, but @fdans - let us know if you think it's indicative of a bigger problem.

Ammarpad added a subscriber: Ammarpad.EditedOct 27 2020, 5:30 PM

If an anonymous user creates article in draft namespace and it's later moved (with all the history) to content namespace, the end result is, I think, just as if they created it there. For instance, this article https://en.wikipedia.org/wiki/Rakeem_Buckles is created by IP today and it's in content namespace. You'd have to pay attention to the move null entry to get what's happening. The definition of 'new pages' as used in the stats does not seem to exclude such pages.

kaldari closed this task as Resolved.Oct 27 2020, 5:32 PM
kaldari claimed this task.

@Ammarpad - Ah, that makes sense! Thanks for clearing up the mystery!

I did a quick check for month 2020-09:

spark.sql("""
SELECT
  (caused_by_user_id IS NULL) as by_anon,
  page_namespace_is_content, -- current value of the page_namespace for the page
  page_namespace_is_content_historical, -- page_namespace at the time of page creation
  COUNT(1)
FROM wmf.mediawiki_page_history
WHERE snapshot = '2020-09'
  AND caused_by_event_type = 'create
  AND start_timestamp >= '2020-09-01'
  AND wiki_db = 'enwiki'
  AND not page_is_deleted
GROUP BY
  (caused_by_user_id IS NULL),
  page_namespace_is_content,
  page_namespace_is_content_historical
ORDER BY
  by_anon,
  page_namespace_is_content,
  page_namespace_is_content_historical
""").show(100, false)

+-------+-------------------------+------------------------------------+--------+
|by_anon|page_namespace_is_content|page_namespace_is_content_historical|count(1)|
+-------+-------------------------+------------------------------------+--------+
|false  |false                    |false                               |210071  |
|false  |false                    |true                                |920     |
|false  |true                     |false                               |1131    |
|false  |true                     |true                                |51642   |
|true   |false                    |false                               |2552    |
|true   |false                    |true                                |1       |
|true   |true                     |false                               |64      |
|true   |true                     |true                                |3       |
+-------+-------------------------+------------------------------------+--------+

This tells us that most pages created by anonymous users were created on non-content namespaces and for the most part still are on those namespaces (2552 + 64).
There are 4 rows that are bizarre, namely the ones for whom by_anon AND page_namespace_is_content_historical is true, meaning they are reported as being created by anonymous users in the main namespace.
I have investigated and for 3 of those 4 pages the create event we use to report page-creation is not the most accurate we could use (complex issue, see T264791), and those more accurate events report user-created pages. I have not looked further into the last one.