Page MenuHomePhabricator

Record an event every time a new content namespace page is created
Closed, ResolvedPublic5 Story Points

Description

As shown by T149049 and T149021, it is surprisingly difficult to get accurate stats about article creation rates or who creates articles. There are several reasons for this:

  1. We don't log it in the logging table.
  2. The page table doesn't include information about when an article was created or by whom.
  3. The revision and recentchanges tables don't include deleted revisions.

To get around these limitations, it is usually necessary to run expensive queries across several large tables, or even aggregate data from several different queries. Considering the importance of this information, it would make a lot of sense if we would just start logging article creation events through EventLogging.

Every time a new content-namespace page is created it should record the following information:

  • Page ID
  • Initial page title (including namespace if present)
  • Username of the page creator
  • Edit count of the page creator
  • Age of the page creator account in days
  • Does the page creator have the autopatrol right?
  • Whether or not the page is a redirect
  • Initial size of the page

This should make future research about article creation much easier.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I'm not sure about recentchanges but going by rev_parent_id = 0 is how XTools and other editor analysis tools do it. I don't think we've ever had a complaint of articles missing from the results.

Ya @Niharika it might be worth checking the revision table instead of recentchanges. Hopefully the results would be the same, but who knows! :)

kaldari added subscribers: aaron, brion.EditedJun 15 2017, 5:37 PM

Here's a page that has 9 revisions (out of 12) with rev_parent_id = 0: https://ia.wikipedia.org/w/index.php?title=Wikipedia:A_proposito/ro&action=history

No revisions were deleted from the page. The revisions in question are by multiple users (some logged in, some anonymous) over a span of multiple years.

Maybe @aaron or @brion could shed some light on this problem.

If it turns out that rev_parent_id = 0 isn't reliable, a solution would be to add a new hook handler into EventBus.hooks.php for the PageContentInsertComplete hook and create a new schema specifically for page creation events. This would also give us a much smaller table to query against (as the revision tables from EventBus are eventually going to be nearly as unwieldy as the revision tables for the actual wikis).

Also, FYI To @kaldari that edit count is being added to data lake, when you are putting together the sql for your metrics please be so kind as to consult with @Neil_P._Quinn_WMF or @Tbayer as to where do you want to source your data from, shape of queries.. etc

Please see: https://phabricator.wikimedia.org/T161147

brion added a comment.Jun 15 2017, 6:27 PM

I have the impression rev_parent_id isn't reliable but don't offhand recall how they can break under the hood... Worth taking a look to see if we can make it reliable. :)

I ran @Milimetric's query on enwiki with an additional rev_timestamp where clause to look at the past month's records:

mysql:wikiadmin@db1080 [enwiki]> select rev_page, count(*) as duplicate_rev_parent_id_zeroes from revision where rev_timestamp >= 20170515000000 and rev_parent_id = 0 group by rev_page having count(*) > 1;
+----------+--------------------------------+
| rev_page | duplicate_rev_parent_id_zeroes |
+----------+--------------------------------+
| 54073829 |                              2 |
| 54112112 |                              2 |
| 54112566 |                              2 |
| 54166124 |                              2 |
| 54169722 |                              2 |
| 54179829 |                              2 |
| 54188344 |                              2 |
| 54193238 |                              2 |
| 54200374 |                              2 |
| 54204127 |                              2 |
| 54257308 |                              2 |
| 54280581 |                              2 |
| 54280697 |                              2 |
| 54283421 |                              2 |
+----------+--------------------------------+
14 rows in set (53 min 42.84 sec)

Maybe it helps.

aaron added a comment.Jun 15 2017, 6:57 PM

I'm not sure what makes it diverge, but maybe the population scripts could be run again (and possible a new mode added to them). I'd agree with brion that making it more robust is worth exploring.

@Ottomata: From brion and Niharika's comments above, it looks like rev_parent_id = 0 isn't reliable. I would like to move ahead with using the PageContentInsertComplete hook instead and having a dedicated page creation schema/table. Do you think that makes sense, and is it something that you would want to help with?

Folks -- this is for a high priority, community visible project (ACTRIAL) -- we really need this data. How can I help move this forward?

thanks,

-Toby

Nuria renamed this task from Record an EventLogging event every time a new content namespace page is created to Record an event every time a new content namespace page is created.Jun 16 2017, 4:08 PM

@Tnegrin: The data coming from these patches comes from mediawiki events, there is no historic data so with them you cannot reconstruct historical information about edits, which (if I understand things right) that is what is needed to calculate ACTRIAL-related metrics. See @kaldari's questions on https://meta.wikimedia.org/wiki/Research:Wikipedia_article_creation_II which require a 90 day time frame.

The data lake can be used to calculate historical metrics and we are doing changes that address @kaldari's points above . @JAllemandou is working part-time to get edit_count (https://phabricator.wikimedia.org/T161147) and historical page_is_redirect values. edit count is almost done and the other one (redirect) requires quite a bit of effort so it won't be done in the next week or so unless we drop a bunch of work that is tied to our quaterly goals.

I would like to move ahead with using the PageContentInsertComplete hook instead and having a dedicated page creation schema/table. Do you think that makes sense, and is it something that you would want to help with?

Yes, we can help with this, do you want to put a changeset together we can codereview? But, to reiterate my point earlier, these events can be created but they will not have historical information, thus I cannot see how would you use this work immediately for ACTTRIAL metrics.

Yes, we can help with this, do you want to put a changeset together we can codereview? But, to reiterate my point earlier, these events can be created but they will not have historical information, thus I cannot see how would you use this work immediately for ACTTRIAL metrics.

During our meeting with Tobey and Victoria a few weeks ago, we decided that we needed a 2-pronged approach to dealing with ACTRIAL: a short-term plan (to deal with the immediate issues) and a longer-term plan (that includes the possibility of ACTRIAL being implemented). The dashboard that we want to build from EventBus data is mainly to address the longer-term needs, while the improvements to the Data Lake data are to address the short-term needs. Since neither of these are really going to be available in the short term, we've been working with whatever imperfect data we've been able to cobble together (with Dan and Tilman's help) in the meantime. Thanks for your continued assistance on this and prioritizing work on it. I know you guys are busy with lots of other projects and it isn't fun dealing with interruptions and context switching.

@mobrovac @Pchelolo, any objections to a new mediawiki/page/create schema and event stream triggered from PageContentInsertComplete?

Oh, this task took a long time to read.

any objections to a new mediawiki/page/create schema and event stream triggered from PageContentInsertComplete?

Obviously finding what's wrong with the rev_parent_id is a much better solution - we use this field for other stuff too, so if it's wrong it's a problem regardless of this particular discussion. So I'd prefer to fix a bug rather then create a workaround and make unnecessary topics.

As I remember we've used to use the PageContentInsertComplete in the beginning and we've had some problems with it, but I don't quite remember what exactly was wrong.

As I remember we've used to use the PageContentInsertComplete in the beginning and we've had some problems with it, but I don't quite remember what exactly was wrong.

I could be making this up, but I think it didn't capture all revision creates? PageContentInsertComplete is going to include a lot more than page creation anyway, so we'd have to filter for something. What would we filter by? rev_parent_id = 0? Its possible that doing this would have the same issues as revision create.

Obviously finding what's wrong with the rev_parent_id is a much better solution

I think you are right. In either case, we are going to need to know how to identify a page creation event, and if rev_parent_id = 0 is not it, then what is?

@Ottomata: PageContentInsertComplete isn't supposed to capture all revision creates. It only captures page creates, which is exactly what I need.

Oh indeed, we're using PageContentSaveComplete, not PageContentInsertComplete.. The latter btw is mission from the hook list in the docs: https://www.mediawiki.org/wiki/Manual:Hooks

Ohhhh, excuse me. Man what bad hook names. PageContentSaveComplete is for page updates, which is a revision insert. Didn't realize that PageContentInsertComplete was a different hook. Sorry shoulda read the docs before commenting, too bad hook names are weird.

Ok, in that case, I'm not sure what is best to do.

Ottomata added a comment.EditedJun 20 2017, 7:04 PM

Just examined some of the pages that @Niharika found have 2 revisions with rev_parent_id == 0: https://gist.github.com/ottomata/1dd994e49894a3bad691f3594885e5b0

select rev_page, page_title, rev_comment from page, revision where rev_page in (54073829,54112112,54112566,54166124,54169722,54179829,54188344,54193238,54200374,54204127,54257308,54280581,54280697,54283421) and rev_parent_id = 0 and revision.rev_page = page.page_id;

And here's a pair of them compared:
https://en.wikipedia.org/w/index.php?title=Cythara_fasciata&type=revision&diff=780953291&oldid=780806826

Its hard to know for sure, but it looks these are all redirect pages created as a result of a page move, possibly all also related to a histmerge?

@kaldari, if it is true that all of the pages with rev_parent_id == 0 are redirect pages, can you just ignore events with rev_parent_id == 0 and page_is_redirect == false?

I wonder if also undeletions cause the rev_parent_id to be set to 0.

Either way, I would be +1 on having a page creation event in EventBus, as that sounds like a useful one to have. Ideally, we could generate it from the EventBus proxy service itself, but that likely will not be possible until T168457: Sometimes rev_parent_id is set to 0 even if it isn't the first revision for a page is figured out. Until then, maybe we can emit events based on the PageContentInsertComplete hook?

Are we sure that PageContentInsertComplete doesn't suffer from the same problem as using rev_parent_id? I just tried to follow the paths that might end up triggering that in MW core, but got pretty lost. I think the main path is in EditPage.php

$new = !$this->page->exists();
...
$flags = EDIT_AUTOSUMMARY |
    ( $new ? EDIT_NEW : EDIT_UPDATE ) |
...
doEditStatus = $this->page->doEditContent(
			$content,
			$this->summary,
			$flags,

And $flags will end up being what ends up telling WikiPage which hook to fire.

But, there are lots of calls to page->doEditContent throughout the code.

The only purpose of PageContentInsertComplete is to handle events related to page creation, so it should be the most reliable thing to use. rev_parent_id == 0 is just a proxy for page creation, so I don't think it makes as much sense to rely on, especially since we already know that it isn't reliable. My vote would be to create a new page creation schema for EventBus and use PageContentInsertComplete. The other advantage of using that hook is that the data is more likely to be comparable to the EventLogging data for page creation (which has been using that hook for years).

Good point. This indeed seems to be possible, as there are places in the code that set the EDIT_NEW or EDIT_UPDATE flag based on whether oldRevision is true/false, e.g. in the SpecialChangeContentModel special page. I am quite sure digging deeper would reveal this pattern elsewhere as well.

The only purpose of PageContentInsertComplete is to handle events related to page creation, so it should be the most reliable thing to use. rev_parent_id == 0 is just a proxy for page creation, so I don't think it makes as much sense to rely on, especially since we already know that it isn't reliable. My vote would be to create a new page creation schema for EventBus and use PageContentInsertComplete. The other advantage of using that hook is that the data is more likely to be comparable to the EventLogging data for page creation (which has been using that hook for years).

I like the idea of having a page-creation event, but would really really like us to be absolutely sure PageContentInsertComplete is indeed used only for new pages, and I don't think that is currently the case (that we are sure this statement holds true, that is).

On the other hand, using PageContentInsertComplete is likely to be more accurate than rev_parent_id = 0, and it is historically what we've done. I'd be fine with making a new page-create event based on this hook, and if in the future we find we need to fix what the hook does, or do a little extra filtering before the event is emitted, we can do that.

Change 360411 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Don't attempt to examine max timestamp of eventlogging table if it doesn't have a timestamp field

https://gerrit.wikimedia.org/r/360411

Change 360411 merged by Ottomata:
[operations/puppet@production] Don't attempt to examine max timestamp of eventlogging table if it doesn't have a timestamp field

https://gerrit.wikimedia.org/r/360411

@kaldari FYI I just merged eventlogging puppet change to start inserting some of the eventbus events into the EventLogging Analytics MySQL databases. mediawiki_revision_create_1 should be a table there now.

Hm I'm wondering if we need a new schema for this or we can reuse the revision-create schema for this new topic? It seems the schema is identical (except the rev_parent_id field).

I like the idea of having a page-creation event, but would really really like us to be absolutely sure PageContentInsertComplete is indeed used only for new pages, and I don't think that is currently the case (that we are sure this statement holds true, that is).

@mobrovac: After parsing through the code, I'm pretty confident that PageContentInsertComplete is only called during new page creation. It looks like there are two cases where PageContentInsertComplete is not called during page creation, however: When a page is created via import (import/WikiRevision.php) and when a redirect is automatically created during a page move (MovePage.php). These seem like sensible exceptions (for my use case at least).

@kaldari, are we also sure that PageContentInsertComplete is *not* called during weird history merges or revision deletions, page moves etc.? Otherwise it'll have the same flaw as using rev_parent_id.

In the one double rev_parent_id 0 page I checked, both revisions were 'redirect' revisions. It is possible that rev_parent_id = 0 + page_is_redirect = false would get you only real page creates (if you don't want to count auto redirect page creations).

Anyway, there's no strong objection to moving forward with making a new event based on PageContentInsertComplete (I chatted with Marko in IRC about this yesterday). I'll try to find some time to make some patches that do this by the end of the week.

@kaldari FYI yesterday I merged eventlogging puppet change to start inserting some of the eventbus event streams into the EventLogging Analytics MySQL databases. mediawiki_revision_create_1 should be a table there now. :)

In the one double rev_parent_id 0 page I checked, both revisions were 'redirect' revisions. It is possible that rev_parent_id = 0 + page_is_redirect = false would get you only real page creates (if you don't want to count auto redirect page creations).

@Ottomata: I don't think that would solve the problem as there are definitely cases of ev_parent_id = 0 that aren't redirects. For example, none of the 9 ev_parent_id = 0 revisions on this page are redirects: https://ia.wikipedia.org/wiki/Wikipedia:A_proposito/ro

Anyway, there's no strong objection to moving forward with making a new event based on PageContentInsertComplete (I chatted with Marko in IRC about this yesterday). I'll try to find some time to make some patches that do this by the end of the week.

That's great news and much appreciated!

FYI yesterday I merged eventlogging puppet change to start inserting some of the eventbus event streams into the EventLogging Analytics MySQL databases. mediawiki_revision_create_1 should be a table there now. :)

Also great news!

In the one double rev_parent_id 0 page I checked, both revisions were 'redirect' revisions. It is possible that rev_parent_id = 0 + page_is_redirect = false would get you only real page creates (if you don't want to count auto redirect page creations).

@Ottomata: I don't think that would solve the problem as there are definitely cases of ev_parent_id = 0 that aren't redirects. For example, none of the 9 ev_parent_id = 0 revisions on this page are redirects: https://ia.wikipedia.org/wiki/Wikipedia:A_proposito/ro

Those 9 revisions are all from 2003-2004, so they predate the creation of rev_parent_id in MediaWiki 1.10 (2007). Did we never backfill this field?

Also, I just realized that if you history-merge two pages, there will be two revisions with rev_parent_id=0 (one for each of the original creations). But since you seem to be talking about looking at the revision at or shortly after creation time, I think rev_parent_id=0 should be a reliable indicator of new-ness.

kaldari added a comment.EditedJun 21 2017, 6:58 PM

@Niharika, @Ottomata: I tested and confirmed that PageContentInsertComplete is not called when undeleting a page, so there shouldn't be any issues with history merges (if we're using that hook).

Change 360698 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventlogging@master] Build SQL table name from topic if set, else use schema name

https://gerrit.wikimedia.org/r/360698

Change 360703 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/extensions/EventBus@master] Emit mediawiki.page-create event on PageContentInsertComplete

https://gerrit.wikimedia.org/r/360703

Change 360704 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/event-schemas@master] Reuse revision/create schema for new the page-create topic

https://gerrit.wikimedia.org/r/360704

Change 360698 merged by Ottomata:
[eventlogging@master] Build SQL table name from topic if set, else use schema name

https://gerrit.wikimedia.org/r/360698

Change 360704 merged by Ottomata:
[mediawiki/event-schemas@master] Reuse revision/create schema for new the page-create topic

https://gerrit.wikimedia.org/r/360704

Change 360703 merged by Ottomata:
[mediawiki/extensions/EventBus@master] Emit mediawiki.page-create event on PageContentInsertComplete

https://gerrit.wikimedia.org/r/360703

kaldari added a comment.EditedJun 26 2017, 9:07 PM

@Ottomata: Once this starts collecting the page creation data in mySQL (hopefully starting this Thursday), how do we access that data? My assumption is:

  • Log into stat1003.eqiad.wmnet
  • Connect to dbstore1002 or analytics-store: mysql --defaults-file=/etc/mysql/conf.d/research-client.cnf --host dbstore1002.eqiad.wmnet
  • Look at the mediawiki_page_create_1 table in the log database

Does that sound right?

Nuria added a comment.Jun 26 2017, 9:18 PM

How to access data in MariaDB:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#MariaDB

You can look at data at this time:

mysql:research@analytics-store.eqiad.wmnet [log]> select * from mediawiki_revision_create_1 limit 1;

kaldari closed this task as Resolved.Jun 26 2017, 9:26 PM

Thanks Nuria. I think we can mark this resolved now!

kaldari reopened this task as Open.Jun 30 2017, 12:13 AM

@Ottomata: Something seems to be wrong. It looks like it is recording an entry in the mediawiki_page_create_1 table for every revision creation rather than every page creation. Also, the topic is recorded as mediawiki.revision-create rather than mediawiki.page-create. I'm sure the hook we're using in the EventBus code is only for page creation, so I think the problem must lie elsewhere. Where does the event topic get mapped to a mysql table?

Also, unlike the regular EventLogging tables, the tables generated from EventBus have no indexes on the timestamps, making them mostly unusable for querying.

kaldari added a comment.EditedJun 30 2017, 8:41 PM

@Ottomata, @Nuria: It looks like both page-create and (some seemingly random subset) of revision.create events are being funneled into the mediawiki_page_create_1 table, while revision.create events are being funneled into the mediawiki_revision_create_1 table. I wonder if this is related to the insertion bundling that _insert_multi() is doing in jrm.py. I don't know Python, so hard for me to tell.

@Ottomata, @Nuria: Uh oh, it looks like page-create events are also going into the mediawiki_revision_create_1 table, so both tables are polluted.

Nuria added a comment.Jun 30 2017, 8:48 PM

I see, need to look at this in more detail to see if issue is with insertion or with events themselves.

Note that revision-create and page-create use the same schema, so all of the events will be in revision-create but only some of them will also be emitted as page-create.

kaldari added a comment.EditedJul 3 2017, 6:15 PM

@mobravac: There are 2 different definitions of "event" here. An "on-wiki event" and an "EventBus event". A single on-wiki event can create multiple EventBus events. It is expected that both revision and page creation on-wiki events will end up in the mediawiki_revision_create_1 (since page creations also create a revision), however, what's happening is that all revision and page creation EventBus events seem to be going into mediawiki_revision_create_1. In addition, all revision and page creation EventBus events related to page creation are going into mediawiki_page_create_1. Finally, a seemly random subset of revision EventBus events unrelated to page creations are also going into mediawiki_page_create_1. So there are 2 data problems that have been created. For a single on-wiki page creation event, mediawiki_page_create_1 and mediawiki_revision_create_1 will both have 2 recorded events (one that is revision-create and one that is page-create). Second, the mediawiki_page_create_1 is recording events that have no relation to page creation. Basically, both tables are corrupted with bad data now and need to be cleaned-up in addition to the code being fixed to stop the bugs.

Nuria added a comment.EditedJul 3 2017, 6:47 PM

Notes for @Ottomata and @Nuria .

  • Since we are consuming directly from the kafka topics in mysql consumer the events do not appear on all-events log, which mades troubleshooting hard. I think we should change that.
  • The easy way to tell the non belonger events in the table is doing:

mysql:research@analytics-store.eqiad.wmnet [log]> select distinct meta_topic from mediawiki_page_create_1 limit 20;
+---------------------------+

meta_topic

+---------------------------+

mediawiki.revision-create
mediawiki.page-create

+---------------------------+
2 rows in set (4.62 sec)

Obviously revision-create events should not be on page-create table. FYI that this issue is also present on beta so we can test there and (hopefully) fix it.

Nuria added a comment.EditedJul 3 2017, 6:50 PM

kafkacat -b kafka1012.eqiad.wmnet:9092 -t eqiad.mediawiki.page-create

Lists all correct events (ones which topic is always page-create) so issue is with the code pulling from kafka and inserting on mysql

Nuria added a subscriber: mforns.EditedJul 3 2017, 8:24 PM

Thanks to @mforns for his insight, issue is with : https://github.com/wikimedia/eventlogging/blob/master/eventlogging/handlers.py#L481 which groups events via schema/revision. Since these disjointed events share schema they are being inserted together.

Quick fix: create schema for page-create
Not so quick fix: update EL code so it understands that events with shared schemas might not necessarily be inserted together.

I think changes are needed here as this code also needs to take topic into account if present as topic has precedence over schema to decide in what tables are events inserted.

https://github.com/wikimedia/eventlogging/blob/master/eventlogging/handlers.py#L481

Change 363092 had a related patch set uploaded (by Mforns; owner: Mforns):
[eventlogging@master] Fix mysql handler scid grouping

https://gerrit.wikimedia.org/r/363092

Nuria added a comment.Jul 4 2017, 4:28 AM

Tested fix on beta and I think it looks good , can @kaldari verify? events flowing are from beta cluster: https://en.wikipedia.beta.wmflabs.org/wiki/Oatmeal

https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/TestingOnBetaCluster#Database

Before deploying fix we should think what we want to do with the bad records on tables on prod? Should we drop tables entirely or drop records with the wrong topics?

mforns added a comment.Jul 4 2017, 6:17 PM

Note that the records were not duplicated into tables. They were misdirected into a wrong table. So, we can not remove duplicates by deleting the records with the wrong topics. This would ensure that the events in a table do actually belong to it. But would not ensure completeness. Theoretically, the tables would contain only around half of the events they should contain.

As this feature is quite new (23 June IIRC). Maybe we can drop the tables entirely and try starting to read from the oldest event of the topic in kafka, instead of continuing with the current kafka offset. Not sure how to do that, though.

Since the records were misdirected rather than split into both tables, I think it would be best to TRUNCATE or DROP both tables and start over. Otherwise, we'll have to warn everyone who uses them for the rest of eternity. That's just my personal opinion though.

Then, if we want to drop tables ( I agree, that is the best solution) this is the path to action:

Edit:
Rather, let's deploy code changes and stop eventlogging once we get someone to do the table drop

Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 363092 merged by Nuria:
[eventlogging@master] Fix mysql handler scid grouping

https://gerrit.wikimedia.org/r/363092

Nuria added a comment.Jul 10 2017, 4:22 PM

Deployed eventlogging with fix after adding unit tests plus testing in beta.

kaldari closed this task as Resolved.Jul 11 2017, 4:49 PM

This seems to be working smoothly now! Thanks @Ottomata and @Nuria and everyone else who helped!

Nuria reopened this task as Open.Jul 12 2017, 10:09 PM

Reopening as a recent change broke insertion on the schema, see https://phabricator.wikimedia.org/T170486#3433321

Nuria added a comment.Jul 13 2017, 3:41 PM

We have fixed insertion issue and will be backfilling events today. @Ottomata or @Nuria will ping here when that is completed

Ok, events have been backfilled. However, I accidentally backfilled TOO many events, in that I did not filter out bot events during the backfill process.

Nuria added a comment.Jul 18 2017, 2:48 PM

Ok, events have been backfilled. However, I accidentally backfilled TOO many events, in that I did not filter out bot events during the backfill process.

To clarify: these events that come from eventbus are not affected by our bot filtering , @Ottomata 's comment above applies to other schemas

Closing ticket as events for page-create have been backfilled

Nuria closed this task as Resolved.Jul 18 2017, 2:48 PM
DannyH moved this task from Estimated to Archive on the Community-Tech board.Jul 18 2017, 10:46 PM
Neil_P._Quinn_WMF raised the priority of this task from Normal to Needs Triage.Mar 30 2018, 10:34 AM
Neil_P._Quinn_WMF moved this task from Backlog to Radar on the Contributors-Analysis board.