Page MenuHomePhabricator

Page creation data no longer updates
Closed, ResolvedPublic

Description

One of the NPP reviewers mentioned on enwiki that the Page Creations dashboard does not have current data, which makes it difficult for them to compare changes in the NPP backlog against trends in page creations. Examining the underlying mediawiki_page_create_2 table in the log database shows that it does not have data past June 20.

I am not sure what the status of data gathering for page creations is post-ACTRIAL, so I thought I'd open this ticket to track it.

Event Timeline

Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.
Milimetric added a project: Analytics-Kanban.

As of June 21st, the query started returning 0. Indeed, there's no data after that day:

mysql:research@analytics-slave.eqiad.wmnet [log]> select max(rev_timestamp) from mediawiki_page_create_2;
+---------------------+
| max(rev_timestamp)  |
+---------------------+
| 2018-06-20 19:47:25 |
+---------------------+

The kafka topics have data, so that isn't the problem (https://grafana.wikimedia.org/dashboard/db/eventbus?refresh=1m&orgId=1)

Then I thought maybe the data's not being refined, but no, that's fine:

hive (event)> select max(rev_timestamp) from mediawiki_page_create where year=2018 and month=8;
OK
_c0
2018-08-09T23:59:59Z

So then for some reason data's there just not getting into mysql for the last few months. I think maybe that part of EventBus got turned off a while back and we forgot about this dataset. So the most logical thing to do is move these queries over to stat1005 and hit Hive with them. @Ottomata, am I right in my assumption and the fix?

I don't know of anything that would have turned off these imports into MySQL. Will look into it!

Change 451864 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use Kafka jumbo-eqiad cluster for eventlogging consumer mysql eventbus

https://gerrit.wikimedia.org/r/451864

Change 451864 merged by Ottomata:
[operations/puppet@production] Use Kafka jumbo-eqiad cluster for eventlogging consumer mysql eventbus

https://gerrit.wikimedia.org/r/451864

Mentioned in SAL (#wikimedia-analytics) [2018-08-10T14:52:44Z] <ottomata> restarting eventlogging-consumer@mysql-eventbus consuming from kafka jumbo-eqiad - T201420

Wow, this is 100% my fault. We stopped using the Kafka cluster that this MySQL import process was configured to use back in June. The configs for this were just never updated. The cluster and topics still exist, there is just no page-create data there anymore.

I've just fixed the config, so new data will now be saved in MySQL (and it should be backfilling the last 7 days now).

@kaldari we should have all of the missing data in Hadoop/Hive, but it won't be trivial to backfill into MySQL. I can certainly do it. Q for you: do we need to do it? :D If you don't plan on querying the missing data in MySQL, I'd prefer not to backfill it there.

BTW, the latest version of page events seems to be 3, so events are in mediawiki_page_create_3.

Hmm, that's unfortunate. I think @Nettrom would be the best person to answer your question: How important do you think it would be to backfill the missing data? I'm not aware of any current WMF projects that are using page creation metrics (now that ACTRIAL is over). Obviously, it would be a minor problem for NPP folks, but I think they could survive without it.

I don't think backfilling all the data is very important. The only ones that appear to be affected are the NPP reviewers, and I should be able to run some queries on the Data Lake to either fill the missing data, or get reasonable estimates they can use.

@Ottomata If the name of the table has changed, that means there's also a few SQL queries that need updating for the ReportUpdater job that exports it. Not sure if you'd like me to submit a patch to the Analytics repository for those, or take care of it yourself?

Thanks @Nettrom. Backfilling that data so it shows up on the dashboard is a bit of a pain, but if you think it would be useful I'll do it.

I've updated the table name in the queries so reports going forward will get data.

@Milimetric Thanks for taking care of the SQL queries! I don't see a need for backfilling the data at the moment, there's not a benefit warranting that cost. As mentioned I can help the NPP folks out with getting their data together. In other words, as far as I can tell, this ticket can be closed now.

And thanks also to @Ottomata for helping out with this!