Page MenuHomePhabricator

Migrate New Editors Analytics to Production
Closed, ResolvedPublic

Description

  1. Migrate the new editors analytics pipeline from local to production (stat1005).
  2. Compare the time series profile of incoming new editors on enwiki with the one regularly generated for the dewiki users. Figure out whether the drop in the number of new editors is characteristic of dewiki only (previously: https://phabricator.wikimedia.org/T171098)

Event Timeline

Ongoing. Currently experimenting with enwiki (huge datasets); the results will be used for comparisons with dewiki new editors report.

Current status: solving the R {sparklyr} connectivity to cluster issue in production, see: https://phabricator.wikimedia.org/T139487

BTW I think https://phabricator.wikimedia.org/T171098 could be closed/merged here.

October 20, 2017

  • the scaling issue is resolved by performing all ETL procedures in the Map-Reduce framework, w/o. Apache Spark;
  • the New Users Report will be receiving monthly updates in the future, however the data set will have a weekly resolution;
  • comparisons w. other Wikipedias will be included as a standard section(@Verena currently, the comparisons set encompasses enwiki and frwiki; please provide a list of all projects that would be interesting to compare with dewiki).

Timeline: new version of the Report starts in early November, once the wmf.mediawiki_history Hive table receives its October 2017 snapshot.

enwiki and frwiki are sufficient. Currently no more projects to be included.

@Verena Here's a new version of the (now) monthly de.wikipedia.org New Editors Report, based on the wmf.mediawiki_history Hadoop table.

First, the October 2017 snapshot of the wmf.mediawiki_history table is still not ready. The data that we have now end in September 2017. I've contacted the Analytics on irc, but that's more or less where my maneuvering space ends.

Now, the new version of the Report. You will notice how numbers are different compared to the previous (weekly) version of the Report. However, qualitatively, or more precisely: viewed as if measured on an *ordinal scale*, the data are the same (e.g. peaks remain in January, the ordering of monthly data points in respect to how many users reach their 10th edit is the same, etc). In general, the numbers are higher than before.

Reasons for data discrepancy between the new and the old version of the report:

  • We never count edits made on pages that are redirects; however, the wmd.mediawiki_history table has only one field, page_is_redirect_latest, that addresses the information that we need. This boolean field (true/false) signifies only whether a page is a redirect *currently*. In other words, we have no historical data in the wmd.mediawiki_history table from which we could able to learn whether and when the redirect status of a certain page has changed. As a consequence, if a page is a redirect now (page_is_redirect_latest = true), we do not count in the revisions of that page made by a particular user *ever*, while there is a possibility that the page was edited by someone in the past *while it was not a redirect* - a situation that would be counted as a valid edit in the old (weekly) version of the Report. However, this scenario implies that we would be observing lower, not higher counts of new editors in the present.
  • Take for example a page that that is currently not a redirect, but it used to be in the past. No past edits - while the page was a redirect - would be counted in by the old, weekly script, while the new HiveQL based script will count in all edits made on that page simply because the page is currently not a redirect. This scenario implies that we would observe a higher number of new edits in the present, compared to the old, SQL/MariaDB based version of the Report. Sincerely, I have no plausible hypotheses on what is more likely: for a page to live as redirect for some time and then gain independence, or vice versa, for a page to be independent first and than become a redirect. Also, I guess there are situations where a page changes its redirect status back and forth, and such situations imply only additional complications.

In relation to these dilemmas, I have opened T180310 and asked if it is possible to include a historical version of the page_is_redirect_latest field to the wmf.mediawiki_history table.

  • The third possibility is that my HiveQL query does not reproduce exactly the previous SQL query. However, from what I see this would again imply getting to lower, not higher counts of new editors (I've used several nice constraints that are pre-computed and entered to wmd.mediawiki_history to help select real, self-made users only).
  • The fourth possibility is that something in the engineering of the wmf.mediawiki_history table (schema) is not crystal clear to me, and that might be the case since I know only the table schema and do not have a full insight on how the table is produced (although its origin is certainly in the regular SQL tables that we've used previously).

Also, please note that now we have complete data as of January 2007, not January 2003 as before. This fact, and the fact that the new editor counts are higher than before for (some of) the reasons discussed above, influence the forecast (Section 4 in the Report), making it less pessimistic than it was before.

Please: (a) check whether the Report contains all analytical sections that you need and let me know if anything needs to be changed or added to it; (b) we can wait for the Analytics to let me know about the possibility to introduce a historical version of the redirect field or go with the Report as it is - please bring a decision and let me know of it. In my opinion, if you're interested to follow the trend in the number of incoming new editors and compare to enwiki and frwiki (or to other projects as well), what you have here is good enough. In comparison with other statistics, stats.wikipedia.org finds 459 new editors on de.wikipedia.org in September 2017, while my method finds 480. Looking at previous five months, the numbers at https://stats.wikimedia.org/EN/TablesWikipediaDE.htm (based on dump parsing) do not match perfectly, yet the scale seems to be in place.

If your team can make use of the Report in this form, I will put it on a crontab in production and you will start receiving regular monthly updates - I will let you know where from you will be able to download them. Of course, the reports will be as regular as the wmf.mediawiki_history table gets regularly updated.

@Verena

apparently, the Analytics are already very aware of the missing historical field in wmf.mediawiki_history, see T161146.

@GoranSMilovanovic
Thank you! I am surprised by the numbers. Suddenly the situation is much better.

I agree, it doesn't seem urgent and crucial to include the historical version of the redirect.

As for the current Report coverage: In section 2 is there an explanation for the yearly huge drops in new editor numbers for every language?

In general some sentences in the report on what is and is not included in the the edit count for the new editors would be helpful.

@Verena

Suddenly the situation is much better.

Be very, very careful when making a qualitative statement based on a complicated technical discussion encompassing the origin of numbers upon which the statement is being made.

... it doesn't seem urgent and crucial to include the historical version of the redirect.

We are missing important information by not having it indeed; the Analytics hope to have that field in the wmf.mediawiki_history table during Q4.

In section 2 is there an explanation for the yearly huge drops in new editor numbers for every language?

We could probably produce a descent Phd Thesis by attempting to provide a concise answer to this question.

In general some sentences in the report on what is and is not included in the the edit count for the new editors would be helpful.

No worries, I'll do that.

@Verena Section 0. included, explaining the definition of a New User used in this Report.

Let me know what do you think of this, and then maybe we can put it on cron and in production.

@Verena @Stefan_Schneider_WMDE @Tobi_WMDE_SW

Please let me know whether the new New Users Report as we have it now presents what you need, so that I can try to

  1. put it on cron from the statboxes and
  2. automate the Rmarkdown production process.

Thanks.

@Verena @Stefan_Schneider_WMDE @Tobi_WMDE_SW

Please, what is your take on the New Users Report: we are switching to it, or not? Thanks.

From my side that's fine, but this is mainly a decision for @Verena and @Stefan_Schneider_WMDE .

The report is totally sufficient and ready for automation.

@Verena

  • The Analytics have just announced (in an e-mail from @Milimetric) that they will be re-designing the wmf.meadiwiki_history table. In that respect,
  • please let me know whether we need an update of the New Users Report urgently, so that I can run one before they start their work there.
  • Of course this makes sense only if a fresh snapshot of the wmf.meadiwiki_history table is ready (the last one that I've seen was September or October 2017).

@GoranSMilovanovic the new mediawiki_history table is already made and loading. But it's not changed significantly at all, just some field re-names. Let me know if you want to talk it over and I can code-review your updates to your code to make sure you accounted for everything.

@Milimetric Thanks a lot, Dan. I will make sure to keep you posted.

@GoranSMilovanovic
No need to hurry. The update is not urgent.

@GoranSMilovanovic
I was notified but haven't checked in detail. Will do so tomorrow.

Does it replace the report?

@Verena presents monthly, weekly, 1 - 2 years overviews and global data, bar chart and table. Download options. Interface based on the same technology as what I use from R's {Shiny}.

I mean, if you still need a separate report, let's do it, but maybe we're just re-inventing the wheel of Wikistats...

@GoranSMilovanovic
Ich checked the new Wikistats. Beautiful!

Unfortunately missing are the numbers of editors who reached their 10th edit (or more). I couldn't find that in the Dashboard or elsewhere.

At least for this number we still need the report.

As we have to do our final annual reporting, I would need that numbers by the 3rd of January....

Let's have a chat in the first week of January what is required for reporting in 2018.

@Verena No problem, we can have our New Users report anytime. Please let me know when would you like to meet.

Hi Goran,

indeed. I meant that I need the report by that date. The hangout is not related to that. We could also that early next week. I would suggest Monday or Tuesday?

I will send you an invite.

Can you send the me the report as soon as possible?

@Verena I thought you wanted to talk before producing the report. I will run the code ASAP and send you the results.

@Verena Your report. The current snapshot of the wmf.wikimedia_history table is: November 2017.

@GoranSMilovanovic Could you clarify the most likely reason for the difference in the new editor numbers in the meantime?

@Verena (!!!) I see the differences only now!

  • I will take a closer look at everything; however
  • the code did not change since the last time this Report was ran;
  • the only hypothesis that I have is that some changes were introduced to the wmf.mediawiki_history table of which I am not aware. @Milimetric has mentioned (T171420#3824341) that some changes are being introduced there, but if I understood everything correctly those referred to naming conventions merely (for example, what used to be page_is_redirect_latest is now page_is_redirect, c.f. the [[ https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history | wmf.mediawiki_history documentation ]]). Beyond such changes that should not interfere with the results at all, I am now aware of any other changes made to the table in the meanwhile.

Since I observe this as a serious problem, I will provide my HiveQL query for dewiki here for purposes of further discussion:

SELECT event_user_id, SUBSTR(from_utc_timestamp(event_timestamp, 'CET'), 1, 10) 
                    FROM (
                      SELECT *,
                      row_number() OVER (partition by event_user_id ORDER by event_timestamp) rownum
                      FROM wmf.mediawiki_history WHERE wiki_db = 'dewiki' AND event_entity = 'revision'
                      AND event_type = 'create'
                      AND event_user_is_created_by_self = true
                      AND event_user_is_bot_by_name = false
                      AND page_namespace = 0 
                      AND page_is_redirect = false
                      AND !(event_user_id = 0) 
                      AND snapshot = '2017-11') tab1 
                    WHERE rownum = 10;

The historical page_is_redirect field is still missing from the wmf.mediawiki_history. I can imagine the engineering of the table being rather complicated. However, this fact does not explain the differences that we observe between what we have from two months ago or so and what we see now. Something's changed, and I can't see what exactly.

  • Exp1. Running the same query using page_namespace_historical = 0 instead of page_namespace = 0 (_historical fields were introduced in the meanwhile).

The result of Exp1. are here:

NOTE. The only changes in respect to what I did two months ago are:

  • I am using page_namespace_historical instead of page_namespace now; the page_namespace_historical field is newly introduced;
  • I am using page_is_redirect instead of page_is_redirect_latest (the later does not exist anymore and it really looks like the former is just a rename of it).

Results: the numbers for the previous months are now matched much more closely than in comparison to what was found in the previously sent report (T171420#3874974), but they still do not match exactly.

Explanation: I have nothing else to offer than my hypothesis that something is changing constantly in the wmf.mediawiki_history table, and I do not seem to follow the changes exactly.

The HiveQL query used:

SELECT event_user_id, SUBSTR(from_utc_timestamp(event_timestamp, 'CET'), 1, 10) 
                    FROM (
                      SELECT *,
                      row_number() OVER (partition by event_user_id ORDER by event_timestamp) rownum
                      FROM wmf.mediawiki_history WHERE wiki_db = 'dewiki' AND event_entity = 'revision'
                      AND event_type = 'create'
                      AND event_user_is_created_by_self = true
                      AND event_user_is_bot_by_name = false
                      AND page_namespace_historical = 0 
                      AND page_is_redirect = false
                      AND !(event_user_id = 0) 
                      AND snapshot = '2017-11') tab1 
                    WHERE rownum = 10;
  • the update engine is now fully migrated into production;
  • the outputs will be available from the WMDE Banner Campaigns Dashboard (see: #T209055)