Page MenuHomePhabricator

Create a chart showing percentage of new articles created each month that have not survived to the present
Closed, ResolvedPublic

Description

To facilitate current discussions, we would like to find out how the average quality of new article submissions has changed over time on English Wikipedia. We know that roughly ~25,000 new mainspace pages are created each month on Enwiki, but we don't have a good idea of how many of those actually survive. We suspect that the percentage of new articles that are eventually deleted has increased over the years, but we don't have any actual evidence of this or idea how significant the increase has been. Note that we're mainly interested in the percentage here, not the total numbers, but total numbers would be a nice bonus :) Ideally the chart should include all years from 2001 to the present. We realize there will be a strong bias for survival within the most recent month or two (since these articles may not yet be reviewed), so you can leave off the most recent month or two if that makes more sense.

If possible, we'd like to know the answer within the next week or two, for ongoing discussions. Thanks.

Event Timeline

kaldari created this task.Oct 24 2016, 9:53 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 24 2016, 9:53 PM

If this is too complicated, just a chart comparing articles created per month with articles deleted per month would probably suffice for now.

Neil_P._Quinn_WMF triaged this task as High priority.Oct 27 2016, 10:23 PM
Neil_P._Quinn_WMF removed a project: Analytics.

@kaldari, the first issue I've encountered is that the archive table didn't start storing whether a revision was a new page creation until Mediawiki 1.14 (released early 2009). How much of a problem would it be if the chart only went back that far?

@Neil_P._Quinn_WMF: Only going back to 2009 would be fine with me. Also, have you considered getting a shorter Phabricator username? ;)

Hmm...it seems like the cutoff is not 2009 but 2012.

select max(ar_timestamp)
from archive
where ar_parent_id is null;

max(ar_timestamp)
20120229230012

And according to P4366, that wasn't just a late aberration. In 2012-02 there were more than 100 000 rows without the parent ID. So there's no way we can take this back further than 2012-03 without another technique.

@Neil_P._Quinn_WMF: Was that just for February 2012? If it's just one month of bad data, I'm fine with having a blip in the chart.

@kaldari, no, it was everything up to and including February 2012. But I've found a workaround—about to kick off the query now :)

Neil_P._Quinn_WMF added a comment.EditedNov 7 2016, 6:11 PM

It turns out that that query was too intensive and had to be killed by the DBAs (T150163). Back to the drawing board :/

@Neil_P._Quinn_WMF: Would going with the simpler suggestion at T149049#2747570 help any?

@kaldari, actually, I figured it out! Unfortunately, I have to go home and I haven't figured out how to get the chart to display correctly on Github, so in the meantime, here's an image version showing, for each month, the percentage of non-redirect new articles that still exist today (November 2016).

kaldari added a subscriber: Emijrp.Nov 8 2016, 5:32 AM

@Neil_P._Quinn_WMF: Hmm, something doesn't add up here. According to @Emijrp's charts, enwiki editors create about 800-900 new articles a day, and they delete about 600-700 articles per day. Thus it seems the survival rate (currently) should be about 25%. Your chart above, however, shows a survival rate of about 75%. Somebody's statistics are way off.

kaldari added a comment.EditedNov 8 2016, 6:33 AM

I did a couple sanity checks to see if Emijrp's numbers were in the ballpark. Here's total number of article deletions (including redirects) for 3 random days:

MariaDB [enwiki_p]> select count(*) from logging where log_type = 'delete' AND log_namespace = 0 AND log_timestamp > 20161102000000 AND log_timestamp < 20161103000000;
+----------+
| count(*) |
+----------+
|      810 |
+----------+
MariaDB [enwiki_p]> select count(*) from logging where log_type = 'delete' AND log_namespace = 0 AND log_timestamp > 20161002000000 AND log_timestamp < 20161003000000;
+----------+
| count(*) |
+----------+
|      515 |
+----------+
1 row in set (6 min 16.32 sec)
MariaDB [enwiki_p]> select count(*) from logging where log_type = 'delete' AND log_namespace = 0 AND log_timestamp > 20161028000000 AND log_timestamp < 20161029000000;
+----------+
| count(*) |
+----------+
|      595 |
+----------+
1 row in set (4 min 33.86 sec)

And here are some creation totals for 3 random days:

MariaDB [enwiki_p]> select count(*) from recentchanges where rc_new = 1 AND rc_namespace = 0 AND rc_timestamp > 20161102000000 AND rc_timestamp < 20161103000000;
+----------+
| count(*) |
+----------+
|     1852 |
+----------+
1 row in set (0.07 sec)
MariaDB [enwiki_p]> select count(*) from recentchanges where rc_new = 1 AND rc_namespace = 0 AND rc_timestamp > 20161010000000 AND rc_timestamp < 20161011000000;
+----------+
| count(*) |
+----------+
|     1532 |
+----------+
1 row in set (0.07 sec)
MariaDB [enwiki_p]> select count(*) from recentchanges where rc_new = 1 AND rc_namespace = 0 AND rc_timestamp > 20161101000000 AND rc_timestamp < 20161102000000;
+----------+
| count(*) |
+----------+
|     1629 |
+----------+
1 row in set (0.06 sec)

That gives me an average survival rate of about 62%. It looks like Emijrp's deletion numbers are about right, but their creation numbers are way off (assuming my numbers are accurate, which isn't certain).

Although https://stats.wikimedia.org/EN/TablesArticlesNewPerDay.htm suggests that my article creation numbers are wrong and Emijrp's are right. Ugh.

Emijrp added a comment.EditedNov 8 2016, 9:49 AM

@Neil_P._Quinn_WMF: Hmm, something doesn't add up here. According to @Emijrp's charts, enwiki editors create about 800-900 new articles a day, and they delete about 600-700 articles per day. Thus it seems the survival rate (currently) should be about 25%. Your chart above, however, shows a survival rate of about 75%. Somebody's statistics are way off.

The new articles chart only shows the survived articles for each day. It runs over the RC table, recalculating on a daily basis, so when a sysop deletes an article, edits related to that page are removed from the RC table and the new page disappears in the chart.

For example, on November 1st, 2016:

  • Chart 0002 says there were 844 new articles
  • Chart 0004 says there were 647 article deletions

So, 844/((844+647)/100) give us a 56.6% survival rate. The number of pages created on namespace=0 on November 1st was 844+647 = 1491.

By the way, only pages greater than 100 bytes are counted in the new pages chart. Other than that, script assumes it is a redirect or a test page, etc.

SQL queries are here https://github.com/emijrp/wmcharts

SELECT CONCAT(YEAR(rc_timestamp),'-',LPAD(MONTH(rc_timestamp),2,'0'),'-',LPAD(DAY(rc_timestamp),2,'0'),'T00:00:00Z') AS date, COUNT(*) AS count FROM recentchanges WHERE rc_timestamp>=DATE_ADD(NOW(), INTERVAL -%d DAY) AND rc_new=1 AND rc_namespace=0 AND rc_new_len>=100 GROUP BY date ORDER BY date ASC" % (lastdays)

SELECT CONCAT(YEAR(log_timestamp),'-',LPAD(MONTH(log_timestamp),2,'0'),'-',LPAD(DAY(log_timestamp),2,'0'),'T00:00:00Z') AS date, COUNT(*) AS count FROM logging WHERE log_namespace=0 AND log_timestamp>=DATE_ADD(NOW(), INTERVAL -%d DAY) AND log_action='delete' GROUP BY date ORDER BY date ASC" % (lastdays)

Though now that I analyse the queries, I see that the deletion query counts all pages in namespace 0, so it is counting deleted redirects, etc (That is a technical limitation, as in Special:Log table there is no info about the size in bytes of a page when it is deleted). So the 56.6% survival rate is conservative, the real one it is probably a bit higher.

kaldari closed this task as Resolved.Feb 2 2017, 5:35 AM
kaldari claimed this task.

@kaldari, you marked this as Resolved. Does that mean I can go someplace and see these figures? If so, thanks! Is this the final chart? Do you guys think it's right?

http://tools.wmflabs.org/wmcharts/wmchart0004.php

Now (getting greedy), is there a place where I can a) set the time period or b) download the numbers to a spreadsheet?

Oh shoot, I just realized that this chart is just deletions (and restorations). While this chart is for New pages created.

So if I want to get what the title of this task says (% of new articles that don't survive), do I have to do the math myself? fIf so, is there a place where I can download the numbers? (And what is the difference between "Deletions" and "Article deletions"?)

@jmatazzoni: The chart is...


...although it's showing the percent that have survived rather than the percent that have not survived.

If you want anything better than that, better ping @Neil_P._Quinn_WMF and see what he can do.

Neil_P._Quinn_WMF raised the priority of this task from High to Needs Triage.Mar 29 2018, 9:05 AM
Neil_P._Quinn_WMF moved this task from Blocked to Radar on the Contributors-Analysis board.