Page MenuHomePhabricator

Determine: What percentage of new articles are created by non-autoconfirmed editors
Closed, ResolvedPublic

Description

We believe that roughly ~1500 new mainspace pages are created each day at Enwiki (including redirects).

  • What percentage of these, are created by editors without the "autoconfirmed" usergroup?
  • Bonus: Is it possible to get a linked list of 1 day's worth of articles, split into "autoconfirmed" vs "non-autoconfirmed" groups?

It should be possible to retrieve this information from the Data Lake.
If possible, we'd like to know the answer within the next week or two, for ongoing discussions. Thanks!


(Note 1: I'm asking specifically for Enwiki stats, but hopefully we'll be able to re-use the search method for finding out details about any other wikis, later on.)
(Note 2: We know that roughly ~150 new Draft pages are submitted each day at Enwiki, per https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Articles_for_creation/Submissions - just noting for reference.)

See also: T149049: Create a chart showing percentage of new articles created each month that have not survived to the present

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
In T149021#2769737, @Neil_P._Quinn_WMF wrote:

@kaldari, yes, this comes from the recentchanges rather than the revision table, so it includes all articles that were initially created, whether or not they were later deleted, moved, or turned into a redirect.

Are you sure about that? recentchanges rows are deleted for deleted pages, see https://test.wikipedia.org/wiki/Special:RecentChanges . The deletion isn't immediate, but it only takes a few seconds.

@Neil_P._Quinn_WMF: Looks like Roan is right (from looking at the testwiki database). Can you add a note to the report mentioning that? Is there an easy way to also get the percentage for all article creations, or is that more complicated?

Is it possible,as in this research, to also track the % of these articles that remain 30 days after publication? (I know, now we're pushing it...)

@Catrope, thanks for catching that mistake!

@kaldari, I should be able to get the overall percentage by going to the archive and revision tables instead of recentchanges. I'll work on that today.

@jmatazzoni, take a look at T149049#2778616. That chart shows the survival rate of new articles over time, and suggests that in recent years it's been about 75%. There are some questions about my methodology, but so far I haven't found any mistakes.

@jmatazzoni, take a look at T149049#2778616. That chart shows the survival rate of new articles over time,

Thanks Neil. Very interesting. That must be all new articles, right? Not new articles by non-autoconfirmed users? (Would it be possible to get the latter?)

Back before Thanksgiving, I tried to get the data accounting for article deletions using the archive and revision tables. However, I ended up having to kill those queries as they had run for nearly a day without completing.

I've updated the report to mention that data caveat, and I'll try some more ways to get that data this week.

@Quiddity, I believe I can close this and T149049 as resolved. Is that correct?

I think T149049 is resolved, but we were never able to really answer the original question in this task (due to the missing data from deleted articles). I think we may need T150369 to get the final answer.

Ping @Neil_P._Quinn_WMF Looks like you tried to gather this data a while back, have you tried again using data lake?

Ping everyone on ticket again, seems that this issue has some urgency. Is anyone working on gathering the data, as we mentioned is available on edit data lake. cc @kaldari @Neil_P._Quinn_WMF

We still need to report this data back to the page patrollers on English Wikipedia and would really appreciate any assistance. Unfortunately, this is an out-of-process request that arose from an emergency situation (English Wikipedia threatening to restrict article creation to auto-confirmed users). Without good data we don't have much influence in this situation, or even a way to make an informed decision about the WMF's position on the issue.

@Neil_P._Quinn_WMF: Are you still interested in helping with this, or should we unassign you from the ticket?

kaldari updated the task description. (Show Details)

@kaldari @Neil_P._Quinn_WMF i s there a report or write up of this on wiki anywhere? I'd like to know the status of the analysis. It looks as though it hit a roadblock (or at least a timeblock on Neil's part), but I'm not 100% sure I understand what the outstanding issues are.

@Capt_Swing: The outstanding issue is getting data for deleted articles. I don't think it would be safe to just extrapolate the existing data since autoconfirmed users might have a signficantly different deletion rate than non-autoconfirmed users. According to Nuria, it should be possible to get all the data (including for deleted articles) from the Data Lake. The write-up for Neil's initial efforts can be seen at https://github.com/wikimedia-research/Editing-2016-10-enwiki-article-creation/blob/master/Analysis.ipynb. According to this data, 87% of (non-deleted) new articles on the English Wikipedia are created by autoconfirmed users.

Just to capture Kaldari, Danny and my discussion on this -- we actually need to know how many deleted articles were created by autoconfirmed users.

@Tnegrin has told me that this has some urgency, and has asked me to look back into it. My time is severely limited because I'm co-leading the New Editor Experiences research (we're flying to South Korea this coming Monday), but I'll do what I can.

Nobody has told me about any specific deadline here—let me know if I'm missing something.

Thanks Neil -- there is some urgency. Please do an initial estimate of time
and we can get some help if need be if this will go past Friday.

Thanks @kaldari. Just FYI I'm fairly sure this can be done without the data lake, too, if that's been a blocker. In the past, I've looked for deleted pages in the enwiki_p.archive table (pages deleted for normal reasons, like no notability) and RevisionDeleted pages (blatant attack pages) in the revision table on production db stat1003. If you do a random subsample of, say, 2000 pages created within a year window you should be able to calculate the autoconfirmed status of the page creator and then get an accurate estimate of the % deleted pages in that sample with a simple loop... but I've never used the data lake, so it may very well be easier that way.

Ping @Neil_P._Quinn_WMF if you need help ping us and we can help.

@Capt_Swing, @Neil_P._Quinn_WMF: the data is indeed easier to get from the data lake (https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history):

  • Page deletion: event_entity: page, event_type: delete
  • Main namespace: page_namespace_is_content
  • Users registered >= 4 days: event_user_creation_timestamp - event_timestamp >= 4 days
  • User Edit Count > 10: join to mediawiki_revision by wiki, user, event_user_creation_timestamp, and event_timestamp and count(*) as edit_count

At this point I'm wondering if other details were left out of this request, like do we care whether these users are bots or not? It'd be good to specify all the dimensions. In any case, putting it all together without making any other assumptions, the query is:

 select mediawiki_history.wiki_db,
        substr(event_timestamp, 0, 8) as day,
        sum(if(event_type = 'create', 1, 0)) pages_created,
        sum(if(event_type = 'delete', 1, 0)) pages_deleted

   from mediawiki_history
            inner join
        (select wiki_db,
                event_user_id,
                day,
                sum(edit_count) over (partition by wiki_db, event_user_id order by day) as edit_count
           from (select wiki_db,
                        event_user_id,
                        substr(event_timestamp, 0, 8) as day,
                        count(*) as edit_count
                   from mediawiki_history
                  where event_entity = 'revision'
                    and event_type = 'create'
                    and wiki_db = 'enwiki'
                    and snapshot = '2017-04'
                  group by wiki_db, event_user_id, substr(event_timestamp, 0, 8)
                ) user_edits_per_day
        ) ueatoe        ON mediawiki_history.wiki_db = ueatoe.wiki_db
                       AND mediawiki_history.event_user_id = ueatoe.event_user_id
                       AND substr(event_timestamp, 0, 8) = ueatoe.day

  where event_entity = 'page'
    and event_type in ('delete', 'create')
    and page_namespace_is_content
    and unix_timestamp(event_timestamp, 'yyyyMMddHHmmss') -
        unix_timestamp(event_user_creation_timestamp, 'yyyyMMddHHmmss')
            >= 345600
    and ueatoe.edit_count >= 10
    and mediawiki_history.wiki_db = 'enwiki'
    and snapshot = '2017-04'
    and event_timestamp > '20170101000000'

  group by mediawiki_history.wiki_db, substr(event_timestamp, 0, 8)
  order by mediawiki_history.wiki_db, day
  limit 100000
;

This is running in a screen now and I'll copy the results when it's done. Note that the query is actually quite simple except for the edit_count computation. I'm using an approximation for that, getting only the edit_count on the day of the event instead of at the exact moment (I wasn't sure how fast either would be but I'll try grouping it down to the hour and seeing if there's a big difference).

Ok, got the results for the daily approximation, running with hourly now:

enwiki 20170101 1541 518
enwiki 20170102 1403 544
enwiki 20170103 1566 679
enwiki 20170104 1617 612
enwiki 20170105 1291 528
enwiki 20170106 1576 547
enwiki 20170107 1395 716
enwiki 20170108 2111 1244
enwiki 20170109 1727 482
enwiki 20170110 1366 559
enwiki 20170111 1337 483
enwiki 20170112 1362 521
enwiki 20170113 1815 690
enwiki 20170114 1556 643
enwiki 20170115 1784 421
enwiki 20170116 1984 746
enwiki 20170117 1462 627
enwiki 20170118 1682 505
enwiki 20170119 1459 678
enwiki 20170120 1478 836
enwiki 20170121 1978 762
enwiki 20170122 1434 648
enwiki 20170123 1279 1840
enwiki 20170124 1470 4163
enwiki 20170125 1944 3937
enwiki 20170126 1407 863
enwiki 20170127 1440 582
enwiki 20170128 2025 435
enwiki 20170129 1728 605
enwiki 20170130 1882 521
enwiki 20170131 1519 602
enwiki 20170201 1439 505
enwiki 20170202 1201 665
enwiki 20170203 1373 629
enwiki 20170204 1409 497
enwiki 20170205 1567 555
enwiki 20170206 1538 676
enwiki 20170207 1358 696
enwiki 20170208 1426 647
enwiki 20170209 1446 604
enwiki 20170210 1556 513
enwiki 20170211 1477 346
enwiki 20170212 1566 730
enwiki 20170213 1519 580
enwiki 20170214 1623 580
enwiki 20170215 1676 514
enwiki 20170216 1724 656
enwiki 20170217 1329 710
enwiki 20170218 1553 540
enwiki 20170219 1599 389
enwiki 20170220 1430 543
enwiki 20170221 1856 728
enwiki 20170222 1563 607
enwiki 20170223 2098 535
enwiki 20170224 1684 594
enwiki 20170225 1715 525
enwiki 20170226 2148 418
enwiki 20170227 1667 564
enwiki 20170228 1577 770
enwiki 20170301 1610 655
enwiki 20170302 2612 511
enwiki 20170303 1491 492
enwiki 20170304 1590 445
enwiki 20170305 1707 358
enwiki 20170306 2001 496
enwiki 20170307 1571 665
enwiki 20170308 1584 555
enwiki 20170309 1571 504
enwiki 20170310 1514 417
enwiki 20170311 1558 1200
enwiki 20170312 1948 474
enwiki 20170313 1682 778
enwiki 20170314 1984 526
enwiki 20170315 1716 586
enwiki 20170316 1868 507
enwiki 20170317 1848 415
enwiki 20170318 1611 359
enwiki 20170319 2050 512
enwiki 20170320 1683 423
enwiki 20170321 1836 572
enwiki 20170322 1598 598
enwiki 20170323 1451 489
enwiki 20170324 1903 487
enwiki 20170325 1396 456
enwiki 20170326 1969 385
enwiki 20170327 2124 575
enwiki 20170328 1461 487
enwiki 20170329 2105 526
enwiki 20170330 1864 485
enwiki 20170331 1341 396
enwiki 20170401 1318 405
enwiki 20170402 1532 405
enwiki 20170403 1534 442
enwiki 20170404 1566 463
enwiki 20170405 1433 539
enwiki 20170406 1969 517
enwiki 20170407 1433 459
enwiki 20170408 1599 357
enwiki 20170409 1257 419
enwiki 20170410 1683 499
enwiki 20170411 1852 601
enwiki 20170412 1414 451
enwiki 20170413 1786 432
enwiki 20170414 1624 356
enwiki 20170415 1468 378
enwiki 20170416 1999 302
enwiki 20170417 1600 503
enwiki 20170418 1569 693
enwiki 20170419 1560 629
enwiki 20170420 1995 614
enwiki 20170421 1868 702
enwiki 20170422 1493 416
enwiki 20170423 1514 455
enwiki 20170424 1719 522
enwiki 20170425 1543 566
enwiki 20170426 1679 670
enwiki 20170427 1452 482
enwiki 20170428 1529 418
enwiki 20170429 1522 504
enwiki 20170430 1928 387

Thanks @Milimetric for swooping in and doing the hard work :)

I tweaked your query to count non-autoconfirmed vs. total creations, rather than just autoconfirmed, and also filtered out the creation of redirects (which are almost never deleted and would therefore skew the results).

It's currently running.

Thank you everyone for the hard work!

@Milimetric - reading your query, it looks like the data is <date>, <pages
created by autoconfirmed users>, <pages created that were deleted>

Did I read this right?

@Milimetric - reading your query, it looks like the data is <date>, <pages
created by autoconfirmed users>, <pages created that were deleted>

I think that's right—it's <date>, <pages created by autoconfirmed users>, <pages deleted by autoconfirmed users>.

I ran a tweaked version to get the percentage of creations by non-autoconfirmed users, but I've somehow introduced a bug. I'm working on figuring that out.

In T149021#3260579, @Neil_P._Quinn_WMF wrote:

I ran a tweaked version to get the percentage of creations by non-autoconfirmed users, but I've somehow introduced a bug. I'm working on figuring that out.

The bug turned out to be that page creation events don't have an event_comment, even though you might expect that to be the edit comment from that page creation.

I did another round of queries and got a number of non-autoconfirmed creations that was way too low (~4%). I finally figured out that I was connecting the two autoconfirmed conditions with or instead of and.

A query with that corrected is now running; it looks like it'll take an hour or two but I'll post it when done.

A quick review of Dan's data shows that articles created by autoconfirmed users are definitely deleted at about a 30-40% rate.

There is a definite increase in the articles created and a decrease in the articles deleted (with a drop in the percent as well) from Jan to May of 2017.

I made an over-busy chart which is hopefully somewhat illustrative:

autoconfirmed-user-deletes.png (635×911 px, 131 KB)

In T149021#3260830, @Neil_P._Quinn_WMF wrote:

A query with that corrected is now running; it looks like it'll take an hour or two but I'll post it when done.

This is done. However, the results say that 7–10% of article creations are by non-autoconfirmed users. This seems off to me, since my previous analysis found that 13% of non-deleted articles were created by non auto-confirmed users and I expected that including deleted articles would make that percentage go up, not down.

It's possible that there's some error in my query or (less likely) in the underlying data. Unfortunately, I'm totally out of time to work on this (since I'm flying to Seoul tomorrow). I've uploaded the query to a new section headed "Data Lake" in my notebook. There are no results there yet (I accidentally deleted my previous set while trying to copy them out of my screen session).

I'll push the notebook again when it finishes—should be in about two hours.

A quick review of Dan's data shows that articles created by autoconfirmed users are definitely deleted at about a 30-40% rate.

That's not the correct interpretation. Dan's data gives the number of articles deleted by autoconfirmed users, which is just the total number of deletions since nobody who's not autoconfirmed would have deletion rights. Those deletions include pages created by both autoconfirmed and non-autoconfirmed users.

In T149021#3261215, @Neil_P._Quinn_WMF wrote:

I'll push the notebook again when it finishes—should be in about two hours.

Done. I have to leave this here, unfortunately. I hope someone finds an error in my changes to Data Lake query, because that would definitely be the simplest solution to the conundrum.

The only other idea I have is I'm not fully filtering out the creation of redirects (which are often created by autoconfirmed users but aren't actually articles that require any meaningful patrolling).

In T149021#3261215, @Neil_P._Quinn_WMF wrote:

This is done. However, the results say that 7–10% of article creations are by non-autoconfirmed users. This seems off to me, since my previous analysis found that 13% of non-deleted articles were created by non auto-confirmed users and I expected that including deleted articles would make that percentage go up, not down.

7-10% is in line with @Halfak's [https://meta.wikimedia.org/wiki/Research:Wikipedia_article_creation#How_many_articles_are_created_by_newcomers.3F findings from 2013]. He calculated that editors with less than 1-day tenure created about 8% of all articles, including deleted articles, and editors with less than a week's tenure created about 3%. Assuming that all the editors in the 1-day group and some of the ones in the 1-week group were not autoconfirmed, assume the overall non-AC creation rate back then was around 10%.

I ran @Neil_P._Quinn_WMF's query after adding just "and not page_is_redirect_latest" to better filter out redirect pages (thinking maybe the comments weren't always accurate). The results are:

mediawiki_history.wiki_db day creations non_autoconfirmed_creati
ons
enwiki 20170101 712 38
enwiki 20170102 851 61
enwiki 20170103 708 77
enwiki 20170104 851 66
enwiki 20170105 668 68
enwiki 20170106 656 55
enwiki 20170107 674 57
enwiki 20170108 958 56
enwiki 20170109 730 74
enwiki 20170110 616 64
enwiki 20170111 650 64
enwiki 20170112 652 65
enwiki 20170113 876 59
enwiki 20170114 672 51
enwiki 20170115 733 50
enwiki 20170116 904 45
enwiki 20170117 641 67
enwiki 20170118 694 66
enwiki 20170119 600 45
enwiki 20170120 681 76
enwiki 20170121 766 41
enwiki 20170122 726 51
enwiki 20170123 650 67
enwiki 20170124 775 67
enwiki 20170125 948 66
enwiki 20170126 694 68
enwiki 20170127 776 55
enwiki 20170128 1255 54
enwiki 20170129 1017 59
enwiki 20170130 1146 68
enwiki 20170131 675 54
enwiki 20170201 637 62
enwiki 20170202 614 59
enwiki 20170203 668 63
enwiki 20170204 716 43
enwiki 20170205 783 50
enwiki 20170206 581 56
enwiki 20170207 608 70
enwiki 20170208 648 64
enwiki 20170209 708 45
enwiki 20170210 695 61
enwiki 20170211 664 44
enwiki 20170212 722 51
enwiki 20170213 739 54
enwiki 20170214 639 54
enwiki 20170215 662 63
enwiki 20170216 673 54
enwiki 20170217 617 55
enwiki 20170218 721 46
enwiki 20170219 749 57
enwiki 20170220 773 58
enwiki 20170221 732 80
enwiki 20170222 676 75
enwiki 20170223 1031 71
enwiki 20170224 705 64
enwiki 20170225 849 54
enwiki 20170226 963 59
enwiki 20170227 825 78
enwiki 20170228 741 74
enwiki 20170301 656 74
enwiki 20170302 722 86
enwiki 20170303 758 76
enwiki 20170304 879 117
enwiki 20170305 880 61
enwiki 20170306 925 78
enwiki 20170307 798 76
enwiki 20170308 738 104
enwiki 20170309 794 73
enwiki 20170310 810 77
enwiki 20170311 750 87
enwiki 20170312 1072 66
enwiki 20170313 849 98
enwiki 20170314 1224 70
enwiki 20170315 851 51
enwiki 20170316 925 59
enwiki 20170317 697 72
enwiki 20170318 862 74
enwiki 20170319 910 59
enwiki 20170320 1004 68
enwiki 20170321 877 62
enwiki 20170322 868 82
enwiki 20170323 781 75
enwiki 20170324 676 70
enwiki 20170325 656 58
enwiki 20170326 816 65
enwiki 20170327 1001 71
enwiki 20170328 763 101
enwiki 20170329 1104 95
enwiki 20170330 1018 97
enwiki 20170331 662 69
enwiki 20170401 535 63
enwiki 20170402 606 71
enwiki 20170403 679 77
enwiki 20170404 731 79
enwiki 20170405 626 73
enwiki 20170406 855 68
enwiki 20170407 674 79
enwiki 20170408 974 56
enwiki 20170409 722 70
enwiki 20170410 820 80
enwiki 20170411 840 85
enwiki 20170412 829 66
enwiki 20170413 704 63
enwiki 20170414 860 53
enwiki 20170415 631 51
enwiki 20170416 605 64
enwiki 20170417 788 71
enwiki 20170418 787 60
enwiki 20170419 701 73
enwiki 20170420 765 78
enwiki 20170421 819 99
enwiki 20170422 698 69
enwiki 20170423 665 63
enwiki 20170424 766 88
enwiki 20170425 727 103
enwiki 20170426 736 120
enwiki 20170427 752 101
enwiki 20170428 658 96
enwiki 20170429 803 85
enwiki 20170430 834 88

So, that seems closer to what Neil was looking for.

For anyone else who's staring at the long list of numbers and wondering what the bottom line is, it appears that non-autoconfirmed editors created an average (mean) of 68.175 total article creations per day (from January through April 2017), out of an average of 771.75 new articles per day. This means that about 8.8% of all new article creations come from new accounts.

Thanks Sherry. So, 9%. That raises the question: if this flow were cut off, would that even have the desired effect? I.e., would it close the gap between articles created/day and articles reviewed/day? Does anyone know by how much the backlog grows each day? Is that gap, or shortfall, greater than 68/day?

Can someone please look at Neil's query and clarify this key point: how are we defining new "articles" here? (@kaldari thinks the figure of 771 new articles per day sounds low, so we need to be clear what we're counting.)

Can someone please look at Neil's query and clarify this key point: how are we defining new "articles" here? (@kaldari thinks the figure of 771 new articles per day sounds low, so we need to be clear what we're counting.)

I haven't looked at the definition used in the query (BTW how about posting the latest version here?), but one possible sanity check is to compare with the increase in the number of undeleted articles during the same day according to [[en:Special:Statistics]], which EmausBot is recording twice daily on Meta-wiki.

E.g. it grew by 562 on April 30 (from 5,395,718 at 00:00, 30 April 2017 (UTC) to 5,396,280 at 00:00, 1 May 2017 (UTC)). Dan's result above gives 834 creations for the same day, which would correspond to a survival rate of 67% if we were to assume that all deletions happen on the day of the creation (or that the number of later deletions is matched by deletions on April 30 of articles created earlier).

(When comparing numbers from different sources one needs to keep in mind that the details of article counting are notoriously messy, but hopefully they won't impact the above comparison too much once redirects are excluded properly.)

PS: I'm wondering though whether/how the query took into account moves from other namespace into the article namespace (in particular, the "publication" of draft pages from the Draft: or User: namespaces). How should they be counted?

@Milimetric: Can you explain what page_is_redirect_latest is? We don't want to exclude pages based on their current state being a redirect, but we do want to exclude pages based on their initial state being a redirect. It could be the case that a large proportion of articles created by non-auto-confirmed editors are converted into redirects, so removing based on current article state could skew the percentage.

@Milimetric: Also can you post the actual query you are using?

@kaldari: my query:

select
mediawiki_history.wiki_db,
substr(event_timestamp, 0, 8) as day,
count(*) as creations,
sum(
        if(
            ((unix_timestamp(event_timestamp, 'yyyyMMddHHmmss') -
            unix_timestamp(coalesce(event_user_creation_timestamp, '20050101000000'), 'yyyyMMddHHmmss'))
            < 345600) or
            (user_edits_by_hour.edit_count < 10),
        1, 0)
    )
  as non_autoconfirmed_creations

from wmf.mediawiki_history
       inner join
   (select wiki_db,
           event_user_id,
           hour,
           sum(edit_count) over (partition by wiki_db, event_user_id order by hour) as edit_count
      from (select wiki_db,
                   event_user_id,
                   substr(event_timestamp, 0, 10) as hour,
                   count(*) as edit_count
              from wmf.mediawiki_history
             where event_entity = 'revision'
               and event_type = 'create'
               and wiki_db = 'enwiki'
               and snapshot = '2017-04'
             group by wiki_db, event_user_id, substr(event_timestamp, 0, 10)
           ) user_edits_per_hour
   ) user_edits_by_hour    ON mediawiki_history.wiki_db = user_edits_by_hour.wiki_db
                          AND mediawiki_history.event_user_id = user_edits_by_hour.event_user_id
                          AND substr(event_timestamp, 0, 10) = user_edits_by_hour.hour
where event_entity = 'revision'
  and event_type = 'create'
  and page_namespace = 0
  and mediawiki_history.wiki_db = 'enwiki'
  and revision_parent_id = 0
  and event_comment not regexp "[Rr]edir"
  and page_is_redirect_latest = 0
  and snapshot = '2017-04'
  and event_timestamp > '20170101000000'

 group by mediawiki_history.wiki_db, substr(event_timestamp, 0, 8)
 order by mediawiki_history.wiki_db, day
 limit 100000

In milimetric's analysis, an article is defined as a main namespace page that isn't a redirect (wasn't created with a comment including "[Rr]edir" and isn't currently a redirect). Milimetric's analysis does include deleted pages.

According to this query, which is CRAP, the average number of non-redirect articles created a day that go into the backlog (users who are not sysops or autopatrolled) is around 893:

SELECT COUNT(DISTINCT(rev_page))
FROM revision
JOIN page ON rev_page = page_id
JOIN user_groups ON ug_user = rev_user
WHERE (rev_parent_id = 0 OR rev_parent_id IS NULL)
AND page_namespace = 0
AND ug_group != 'sysop'
AND ug_group != 'autoreviewer'
AND rev_timestamp > '20170101000000'
AND rev_timestamp <= '20170501000000'
AND rev_comment NOT REGEXP "[Rr]edir";

This does NOT count pages that are currently redirects – which I don't think we want to count. E.g. One common scenario is a new article is created, is deemed by a patroller to be non-notable, and is redirected to the primary subject.

One could also create a redirect and provide a custom edit summary – those are still counted as articles here.

@Tbayer or @Milimetric: Could you re-run the query in T149021#3281771 without "page_is_redirect_latest = 0" and post the daily approximations for comparison purposes (and also because we aren't using this criteria in related queries as it may be common for newbie articles to later be turned into redirects).

If this flow were cut off, would that even have the desired effect? I.e., would it close the gap between articles created/day and articles reviewed/day? Does anyone know by how much the backlog grows each day? Is that gap, or shortfall, greater than 68/day?

I believe that the backlog is currently growing by approximately the number of articles created by brand-new editors, which suggests that actually cutting it off (and not, for example, transferring it to other backlogged review processes) could theoretically stabilize the backlog.

However, there's no reason to believe that reducing the number of articles created would actually affect the backlog, because humans are involved. When a backlog has been at a level that seems (to you, in your subjective opinion) to be a crisis, then you (i.e., the individual reviewer) increase your effort; when it declies to a familiar level, then you will likely reduce your effort. Perhaps you will find something that seemsmore urgent or more fun to you. You might only reduce your page reviewing a little bit; you might decide to take a day off. You might decide to review 9 articles each day instead of 10. But that is enough to make the difference. The current backlog-growth rate suggests that stabilization needs only about two extra volunteer-hours each day. Adding a single part-time editor could single-handedly stabilize the backlog, if everyone else remained at their current level of productivity. Spread between the 1000+ editors who are able to patrol pages, stabilization literally only requires each qualified person to review just one or two more articles per month.

But the math goes both directions: even if you cut off article creation entirely, the loss of stabilization literally only requires each qualified person to review one or two fewer articles per month. And it is very difficult to keep volunteers engaged in tedious, unrewarding work under relatively unpleasant circumstances, especially when there is no immediate crisis at hand.

I recommend this report to anyone who is interested in the problem of backlogs: https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2016-11-26/Special_report

@Tbayer or @Milimetric: Could you re-run the query in T149021#3281771 without "page_is_redirect_latest = 0" and post the daily approximations for comparison purposes (and also because we aren't using this criteria in related queries as it may be common for newbie articles to later be turned into redirects).

Result:

day	creations	non_autoconfirmed_creations
20170101	1021	43
20170102	1140	71
20170103	1063	87
20170104	1228	74
20170105	998	75
20170106	986	64
20170107	945	68
20170108	1477	65
20170109	1170	78
20170110	943	71
20170111	976	74
20170112	996	71
20170113	1203	72
20170114	1196	58
20170115	1371	55
20170116	1457	58
20170117	1089	75
20170118	1078	72
20170119	1108	51
20170120	1220	87
20170121	1107	55
20170122	1066	54
20170123	996	78
20170124	1140	85
20170125	1282	80
20170126	1051	78
20170127	1167	62
20170128	1688	62
20170129	1416	74
20170130	1553	77
20170131	1014	60
20170201	1017	65
20170202	886	68
20170203	1002	70
20170204	1030	50
20170205	1213	63
20170206	998	65
20170207	883	79
20170208	990	78
20170209	1090	55
20170210	1090	70
20170211	1017	49
20170212	1103	53
20170213	1016	58
20170214	942	61
20170215	1111	67
20170216	1118	67
20170217	955	63
20170218	1088	55
20170219	1077	71
20170220	1232	70
20170221	1111	93
20170222	1057	82
20170223	1478	75
20170224	1079	72
20170225	1449	70
20170226	1753	66
20170227	1279	89
20170228	1015	84
20170301	927	86
20170302	2099	98
20170303	1095	83
20170304	1265	124
20170305	1299	80
20170306	1233	88
20170307	1189	88
20170308	1067	110
20170309	1112	78
20170310	1165	89
20170311	1194	96
20170312	1419	78
20170313	1227	106
20170314	1612	80
20170315	1197	64
20170316	1613	77
20170317	1605	86
20170318	1209	83
20170319	1342	65
20170320	1456	80
20170321	1395	70
20170322	1228	99
20170323	1178	88
20170324	1010	85
20170325	1054	69
20170326	1115	77
20170327	1346	79
20170328	1115	111
20170329	1529	114
20170330	1490	104
20170331	979	76
20170401	1108	78
20170402	1205	76
20170403	1013	86
20170404	1038	83
20170405	968	77
20170406	1282	76
20170407	1175	85
20170408	1306	63
20170409	1016	76
20170410	1133	93
20170411	1477	98
20170412	1166	76
20170413	1096	71
20170414	1164	58
20170415	990	65
20170416	1327	72
20170417	1292	87
20170418	1092	68
20170419	1235	83
20170420	1201	88
20170421	1172	110
20170422	1028	78
20170423	953	73
20170424	1246	99
20170425	1107	108
20170426	1113	125
20170427	1166	111
20170428	1168	104
20170429	1091	90
20170430	1330	94
120 rows selected (9283.028 seconds)

select -- cf. https://phabricator.wikimedia.org/T149021#3287167
mediawiki_history.wiki_db,
substr(event_timestamp, 0, 8) as day,
count(*) as creations,
sum(
        if(
            ((unix_timestamp(event_timestamp, 'yyyyMMddHHmmss') -
            unix_timestamp(coalesce(event_user_creation_timestamp, '20050101000000'), 'yyyyMMddHHmmss'))
            < 345600) or
            (user_edits_by_hour.edit_count < 10),
        1, 0)
    )
  as non_autoconfirmed_creations

from wmf.mediawiki_history
       inner join
   (select wiki_db,
           event_user_id,
           hour,
           sum(edit_count) over (partition by wiki_db, event_user_id order by hour) as edit_count
      from (select wiki_db,
                   event_user_id,
                   substr(event_timestamp, 0, 10) as hour,
                   count(*) as edit_count
              from wmf.mediawiki_history
             where event_entity = 'revision'
               and event_type = 'create'
               and wiki_db = 'enwiki'
               and snapshot = '2017-04'
             group by wiki_db, event_user_id, substr(event_timestamp, 0, 10)
           ) user_edits_per_hour
   ) user_edits_by_hour    ON mediawiki_history.wiki_db = user_edits_by_hour.wiki_db
                          AND mediawiki_history.event_user_id = user_edits_by_hour.event_user_id
                          AND substr(event_timestamp, 0, 10) = user_edits_by_hour.hour
where event_entity = 'revision'
  and event_type = 'create'
  and page_namespace = 0
  and mediawiki_history.wiki_db = 'enwiki'
  and revision_parent_id = 0
  and event_comment not regexp '[Rr]edir'
  and snapshot = '2017-04'
  and event_timestamp > '20170101000000'

 group by mediawiki_history.wiki_db, substr(event_timestamp, 0, 8)
 order by mediawiki_history.wiki_db, day
 limit 100000

(I removed the wiki_db column since it's all enwiki. Disclaimer: I still haven't checked the query myself, except for the user_edits_per_day subquery where I spot-checked some results that looked plausible.)

Tilman's query gives us 1180.38 non-redirect articles created per day with 77.53 of those coming from non-autoconfirmed users (or 6.57%). To review, milimetric's query gave us 771.75, 68.175, and 8.8% for those same stats. Unfortunately, we don't have any way of reliably determining (via query) whether or not an article was a redirect at the time of creation. The difference between the two queries is that they use different methods to guess which articles were redirects (at the time of creation) and should be excluded from analysis. The actual numbers are going to be somewhere in between Tilman's and milimetric's numbers. In other words, they provide a high and low bound, but not an exact answer. For the purposes of reporting, it's probably better to go with Tilman's numbers for now since they are not skewed by changes to the backlog pages over time.

However, there's no reason to believe that reducing the number of articles created would actually affect the backlog

The simple alternative is to abolish the specialised backlog and let the articles follow the usual wiki process.

The simple alternative is to abolish the specialised backlog and let the articles follow the usual wiki process.

I like that idea, although I imagine it would be a hard sell to the enwiki New Page Patrollers.

. The actual numbers are going to be somewhere in between Tilman's and milimetric's numbers. In other words, they provide a high and low bound, but not an exact answer.

What prevents this data to be "actionable" if these are good bounds ? Is that not enough to answer the question posed?

@Nuria: Whether or not the data is actionable is up to the community. We are simply trying to present the most accurate data possible so that the community can make an informed decision about whether or not to restrict new article creation to auto-confirmed users (and to help the WMF decide if such an action would be reasonable).

PS: I'm wondering though whether/how the query took into account moves from other namespace into the article namespace (in particular, the "publication" of draft pages from the Draft: or User: namespaces). How should they be counted?

While we're figuring this out, a note that the second case is tracked with the "de-userfying" edit tag. Glancing at that link for the last 30 days, there are around 10 surviving articles per day created that way, so it shouldn't affect the overall result too much, but would still be good to include for accuracy (if we decide it's in scope).

In milimetric's analysis, an article is defined as a main namespace page that isn't a redirect (wasn't created with a comment including "[Rr]edir" and isn't currently a redirect). Milimetric's analysis does include deleted pages.

Another case that we may or may not want to count is a redirect becoming an article (IIRC a way to bypass NPP / Page Curation).

We are simply trying to present the most accurate data possible so that the community can make an informed decision about whether or not to restrict new article creation to auto-confirmed users (and to help the WMF decide if such an action would be >reasonable).

Given that we have not calculated this data before and we do not how much it can oscillate over time an interval of values seems very adequate as a measure. Can @Quiddity chime in?

@Tbayer or @Milimetric: Could you re-run the query in T149021#3281771 without "page_is_redirect_latest = 0" and post the daily approximations for comparison purposes (and also because we aren't using this criteria in related queries as it may be common for newbie articles to later be turned into redirects).

...

Disclaimer: I still haven't checked the query myself, except for the user_edits_per_day subquery where I spot-checked some results that looked plausible.)

In milimetric's analysis, an article is defined as a main namespace page that isn't a redirect (wasn't created with a comment including "[Rr]edir" and isn't currently a redirect). Milimetric's analysis does include deleted pages.

I have now been checking this query a bit further (in the context of T166269). It looks like the condition "wasn't created with a comment including '[Rr]edir'" may leave out a bit more than expected (as @MusikAnimal already pointed out above, redirects can also be created with a custom edit summary). In the first sample I happened to check - all page creations from April 1 between 0h and 1h that are counted in the query - no less than 38% (18 out of 47) were created as redirect. This is probably an outlier, as they are all by one editor who used AWB to mass create them. But I'm going to do some more checks, and try to add at least the edit comments for these AWB redirect creations to the WHERE clause.

SELECT CONCAT('https://en.wikipedia.org/wiki/',page_title), event_comment
from wmf.mediawiki_history
where event_entity = 'revision'
  and event_type = 'create'
  and page_namespace = 0
  and mediawiki_history.wiki_db = 'enwiki'
  and revision_parent_id = 0
  and event_comment not regexp '[Rr]edir'
  and snapshot = '2017-05'
  and event_timestamp LIKE '2017040100%';

_c0	event_comment
https://en.wikipedia.org/wiki/Joseph_Howard_Mathews	[[WP:AES|]]Created page with ''''Joseph Howard Mathews''' (1880 – April 15, 1970) was an American [[physical chemistry|physical chemist]], university professor, and expert on firearm identi...'
https://en.wikipedia.org/wiki/Earl_of_Eldon_(1830_ship)	[[WP:AES|]]Created page with '{| {{Infobox ship begin}} {{Infobox ship image | Ship image = | Ship caption = }} {{Infobox ship career | Ship country= | Ship flag={{shipboxflag|United Kingdom|...'
https://en.wikipedia.org/wiki/National_Union_of_Students_(Papua_New_Guinea)	New article.
https://en.wikipedia.org/wiki/Jenkinsville	create disambiguation page
https://en.wikipedia.org/wiki/Samuel_Toms	a partner
https://en.wikipedia.org/wiki/Tarentola_gigas	new article about an endangered animal (courtesy fr Wikipedia)
https://en.wikipedia.org/wiki/Vae	[[WP:AES|]]Created page with '{{Wiktionary|vae}} '''Vae''', '''VAE''' or '''Vaé''' may refer to *[[Vae (name)]] *[[Vae victis]], Latin for "woe to the vanquished" *''[[Vae Victis (album)]]''...'
https://en.wikipedia.org/wiki/Forrest,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Ferraby,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Folker,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Satélite_Tecnológico	[[WP:AES|]]Created page with '{{Infobox spaceflight | name                  = SATEC | image                 =  | image_caption         =  | insignia              =  | mission_type          =...'
https://en.wikipedia.org/wiki/Alessandro_Pulisci	Created page
https://en.wikipedia.org/wiki/We_Believe_(Brian_&_Jenn_Johnson_album)	[[WP:AES|]]Created page with '{{Infobox album | Name       = We Believe | Type       = live | Artist     = [[Brian & Jenn Johnson]] | Cover      = We Believe by Brian & Jenn Johnson.jpg | Alt...'
https://en.wikipedia.org/wiki/Cary,_Indiana	Created new article for Cary, Indiana.
https://en.wikipedia.org/wiki/Tom_Keele	[[WP:AES|]]Created page with '{{Infobox college coach | name = Tom Keele | image =  | alt =  | caption =  | sport = [[American  football|Football]] | birth_date = c. 1933 | birth_place =  | d...'
https://en.wikipedia.org/wiki/2021_Summer_Universiade	[[WP:AES|]]Created page with 'The 2021 Summer Universiade Is The Multi-Sport Event Known As The XXXI Summer Universiade, The Host City Has Yet To Be Announced  Candidate Cities  Bucharest, Ro...'
https://en.wikipedia.org/wiki/List_of_awards_and_nominations_received_by_The_Crown	[[WP:AES|]]Created page with '{{DISPLAYTITLE:List of awards and nominations received by ''The Crown''}} {| class="infobox" style="width: 25em; text-align: left; font-size: 90%; vertical-align...'
https://en.wikipedia.org/wiki/Fortescue,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Fortugno,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/John_Swift_(trade_unionist)	[[WP:AES|]]Created page with ''''John Swift''' (1896 &ndash; 1990) was an [[Irish people|Irish]] [[trade union]] leader.  Born in [[Dundalk]], Swift was educated at a Christian Brothers Sch...'
https://en.wikipedia.org/wiki/Fernicola,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Spring_Fling!	[[WP:AES|]]Created page with '{{Infobox film | name           = Spring Fling! | image          =  | caption        =  | writer         = [[Bart Baker]] and Carol Starr Schneider (Story)<br />...'
https://en.wikipedia.org/wiki/Fletcher,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/National_Union_of_Students_of_Papua_New_Guinea	#REDIRECT [[National Union of Students (Papua New Guinea)]]
https://en.wikipedia.org/wiki/Flair,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Foran,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Firth,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Forell,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Fisk,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Tautvydas_Eliošius	[[WP:AES|]]Created page with '{{Infobox football biography | name                = Tautvydas Eliošius | fullname            = Tautvydas Eliošius | image               =  | image_size...'
https://en.wikipedia.org/wiki/Jenkinsville,_Indiana	Created new article for Jenkinsville, Indiana.
https://en.wikipedia.org/wiki/Seoige_(surname)	[[WP:AES|]]Created page with ''''Seoige''' (Irish pronunciation: [ˈʃoːɟə]) is an Irish surname that may refer to *[[Gráinne Seoige]] (born 1973), Irish journalist and news anchor  *Ma...'
https://en.wikipedia.org/wiki/Westland,_Putnam_County,_Indiana	Created new article for Westland, Putnam County, Indiana.
https://en.wikipedia.org/wiki/Ferraro,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Fitzherbert,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Overseas_trained_doctors_in_Australia	Created Overseas Trained Doctors in Australia
https://en.wikipedia.org/wiki/Fox,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Jim_Horan	[[WP:AES|]]Created page with ''''Jim Horan'''   Jim Horan (born October 29, 1952 in Spokane, WA) is an American author, speaker, and business consultant. He is best known for ''The One Page B...'
https://en.wikipedia.org/wiki/Ford,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Emerson,_Lake_&_Powell_discography	[[WP:AES|]]Created page with '{{Infobox artist discography |Artist = Emerson, Lake & Powell |Image = |Image size =  |Caption = |Studio = 1 |Singles = 1 |Music videos = 1 |References = }} '''[...'
https://en.wikipedia.org/wiki/Cradick_Corner,_Indiana	Created new article for Cradick Corner, Indiana.
https://en.wikipedia.org/wiki/Field,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
https://en.wikipedia.org/wiki/Vae_(name)	[[WP:AES|←]]Created page with ''''Vae''', '''DeVae''' or '''Vaé''' may refer to the following people: *[[Vae Kololo]], Samoan rugby football player *[[John DeVae]] (born 1983), American assoc...'
https://en.wikipedia.org/wiki/House_of_Marcoartu	[[WP:AES|←]]Created page with 'The '''House of Marcoartu''' ({{lang-es|Casa de Marcoartu}}) was an ancient [[Spain|Spanish]] aristocratic family, descended from the Duchy of Gascony|House of...'
https://en.wikipedia.org/wiki/The_Wahoo_Bobcat	created page
https://en.wikipedia.org/wiki/Once_on_the_Rhine	[[WP:AES|←]]Created page with '{{Infobox film | name = Once on the Rhine | image = | image_size = | caption = | director = [[Helmut Weiss]] | producer =  | writer =  [[Richard Billinger]] <br>...'
https://en.wikipedia.org/wiki/Ferrar,_Nicholas	/* top */{{R from sort name}} using [[Project:AWB|AWB]]
47 rows selected (68.811 seconds)

(work log)

...

I have now been checking this query a bit further (in the context of T166269). It looks like the condition "wasn't created with a comment including '[Rr]edir'" may leave out a bit more than expected (as @MusikAnimal already pointed out above, redirects can also be created with a custom edit summary).

To find out if there are other edit summaries commonly used in creation of redirects that are missed by the condition used above (event_comment not regexp '[Rr]edir'), I looked at the most frequent summaries for pages that are a redirect today (see below).

It appears that besides the "{{R from ..." AWB edit summaries, the biggest omission was simply "#REDIRECT ..." (not caught because of capitalization).
I'm going to change the condition as follows to catch both of these cases too:

LCASE(event_comment) NOT REGEXP 'redir' AND event_comment NOT REGEXP '\\{\\{R from '

If anyone know of another edit summary variant that should be detected as well, let me know. (I'm leaving out "synonym" for now, which appears in the results below, but would probably cause too many false positives.)

Ideally, instead of these approximations via the edit summary, the property of being a redirect at the time of the edit would be a field in the table itself. I just filed T167396: Add 'page_is_redirect' field to the mediawiki_history Data Lake tables, but that won't happen in time for the present analysis.

SELECT event_comment, COUNT(*) AS pages
from wmf.mediawiki_history
where event_entity = 'revision'
  and event_type = 'create'
  and page_namespace = 0
  and mediawiki_history.wiki_db = 'enwiki'
  and revision_parent_id = 0
  and snapshot = '2017-05'
  and event_timestamp > '20170101000000'
  AND page_is_redirect_latest
GROUP BY event_comment
ORDER BY pages DESC LIMIT 100

event_comment	pages
/* top */{{R from sort name}} using [[Project:AWB|AWB]]	9348
redirect	3665
Bot: creating redirect to disambiguation page	2954
create redirect	1289
Created redirect	967
created redirect	758
rd	675
created	626
redir	617
creating redirect	615
Created redirect.	605
#REDIRECT [[Anglican Diocese of Melbourne#List of assistant bishops]]	563
#REDIRECT	535
re	424
#REDIRECT [[Anglican Diocese of Perth#Assistant bishops]]	403
Redirect	388
synonym	376
New redirect	372
create as redirect	363
#REDIRECT [[Anglican Church of Australia#Indigenous ministry]]	301
Create redirect	268
create	229
added redirect	218
/* top */redirect using [[Project:AWB|AWB]]	206
Created page	193
#REDIRECT [[Anglican Diocese of North Queensland#Bishops]]	191
update/add upon naming per M.P.C. batch 13 Apr 2017, see [[User:Rfassbind/Minor planet redirects of non-exsitent articles update11|list]]	185
new redirect	179
Creating redirect per request - Using [[User:PhantomTech/scripts/AFCRHS.js|AFC/R HS]]	170
Just for redirection and search purposes.	164
#REDIRECT [[Province of the Anglican Church of the Congo#Dioceses]]	149
[[WP:AES|]]Redirected page to [[List of Grange Hill characters]]	146
#REDIRECT [[Anglican Diocese of Brisbane#Assistant bishops]]	142
#REDIRECT [[List of parks in Portland, Oregon]]	139
[[WP:AES|]]Redirected page to [[List of roads in Kuala Lumpur]]	127
Created as redirect	119
red	117
Creating redirect	116
start	115
Creating and redirecting	113
{{R to disambiguation page}} per MOSDAB	111
Red	100
#REDIRECT [[Modern United States commemorative coins]]	97
Created redirect page	95
add	93
set up redirect	92
Created page as redirect.	89
[[WP:AES|]]Redirected page to [[United States elections, 2020]]	89
redirected to main athletics page	89
new	87
creating #REDIRECT to [[List of minor planets#Main index|List]], see [[User:Rfassbind/Minor planet redirects of non-exsitent articles|summary]]	86
Redirect created, with categories	86
create MP#R as per M.P.C. 12 Mar 2017 ([[User:Rfassbind/Minor planet redirects of non-exsitent articles update10|list]])	84
#REDIRECT [[Anglican Diocese of Adelaide#Bishop for Mission and Evangelism]]	83
#REDIRECT [[Church of the Province of Central Africa]]	81
Create	77
#REDIRECT [[Anglican Diocese of Adelaide#Bishop for Aboriginal people]]	73
#REDIRECT [[Anglican Episcopal Church of Brazil#Dioceses and mission districts]]	72
red1r	71
[[WP:AES|]]Redirected page to [[Zographus oculator]]	68
[[WP:AES|]]Redirected page to [[Tabernaemontana pandacaqui]]	68
historical synonym	68
Robot: Creating redirect to disambiguation page per [[WP:INTDABLINK]]	67
#REDIRECT [[Bishop of Berwick]]	66
[[WP:AES|]]Redirected page to [[Supercapacitor]]	64
[[WP:AES|]]Redirected page to [[Trưng Sisters]]	63
#REDIRECT [[List of Weeds episodes]]	61
redirect page	58
redirect from alternate name for battle to article under main name	57
Created page.	56
[[WP:AES|]]Redirected page to [[Multiuser DOS#REAL/32]]	55
redirecting	55
redirect created	54
Created redirect page.	54
Created Redirect page	52
fixed a broken link	51
[[WP:AES|]]Redirected page to [[WannaCry ransomware attack]]	51
created redirect page	51
[[WP:AES|]]Redirected page to [[Common bulbul]]	51
alt name	49
[[WP:AES|]]Redirected page to [[United States elections, 2019]]	49
#REDIRECT [[Diocese of Banks and Torres]]	48
redirected	48
[[WP:AES|]]Redirected page to [[Nuits de Fourvière]]	46
[[WP:AES|]]Redirected page to [[Grey shrikethrush]]	46
[[WP:AES|]]Redirected page to [[Guillotière Cemetery]]	46
Redirect a notable verse	46
R	46
[[WP:AES|]]Redirected page to [[ITU T.50]]	45
new redir	44
[[WP:AES|]]Redirected page to [[Sheffield Supertram#Tram stop list]]	43
[[WP:AES|]]Redirected page to [[African paradise flycatcher]]	43
#REDIRECT [[Heartbreak on a Full Moon]]	43
Redirect created	43
add redirect	43
#REDIRECT [[Province of the Anglican Church of Burundi#Structure]]	42
[[WP:AES|]]Redirected page to [[United States elections, 2018]]	42
RD	42
Created	42
[[WP:AES|]]Redirected page to [[Gorillaz]]	41
100 rows selected (130.269 seconds)

I adapted this query for use in gathering some statistics for the ACTRIAL project and noticed that it seemed to fail to pick up deleted articles. In my dataset gathered a week ago there is 730 article creations on 2017-01-01, and 729 of those currently exist in the revision table. What appears to be a key reason for this is that event_comment for those deleted articles is NULL leading any event_comment NOT REGEXP 'foo' to remove that row from the query result.

There are also some older deleted articles that appear to have page_namespace and revision_parent_id both NULL because these are not set in the archive table. My modified query for grabbing article creation events is here: https://github.com/nettrom/actrial/blob/master/sql/creation_history.hql

Which Phabricator project should this task be associated with, so others can actually find this task when looking at the project's workboard?

I have added analytics but likely @kaldari has a better project?

nshahquinn-wmf claimed this task.
nshahquinn-wmf edited projects, added Community-Tech; removed Analytics.