Page MenuHomePhabricator

Taxonomy of new user reading patterns
Closed, ResolvedPublic

Assigned To
Authored By
MGerlach
Sep 30 2019, 9:46 AM
Referenced Files
F30878007: image.png
Oct 24 2019, 1:05 PM
F30877983: image.png
Oct 24 2019, 1:05 PM
F30877988: image.png
Oct 24 2019, 1:05 PM
F30877985: image.png
Oct 24 2019, 1:05 PM
F30878009: image.png
Oct 24 2019, 1:05 PM
F30878013: image.png
Oct 24 2019, 1:05 PM
F30878015: image.png
Oct 24 2019, 1:05 PM
F30873205: sample.py
Oct 21 2019, 5:03 PM
Tokens
"Yellow Medal" token, awarded by MGerlach.

Description

Aim: Understand reading patterns of new users. What motivates users to create accounts. Note that this is a project as part of @MGerlach 's onboarding to become familiar more familiar with WMF-datasets. Here, the focus is on the webrequest logs.

What

  • Identify new users. For example, new user is someone who clicked 'Create Account' button
  • Extract reading visited pages before and after registration via webrequest logs
  • Extract features of pages, e.g. namespace, topic, length, popularity
  • Extract patterns of reading patterns
    • embedding in lower dimensional space
    • are there clsuters?
    • can we interpret the clusters?

Begin with 1 (small) wiki-project.


Some context from earlier discussions with Jon Katz that can be helpful in Martin's thinking and direction. (We don't need to attempt to address the questions below. They may be longer term.) --Leila

Nice-to-have

  • Which parts of the current Desktop experience are key in turning readers to contributors or deepening the reading experience and should not be removed/hid as part of the refresh?
  • Can we characterize account creations?
  • Can we characterize the first edit?
  • ...

Higher priority

  • Characterize the account who edit with the hope of gaining a deeper understanding about the characteristics of readers. (Consider repeating Why We Read Wikipedia? as Why We Edit Wikipedia?)
  • Is there anything causal going on? For example, is there a way to train readers to become editors? (beyond editathons and ongoing initiatives by the community). We may not necessarily want to do this, but we want to understand what our options are.

Related Objects

Event Timeline

Identify new users

If we want to understand reading patterns, we want to use wmf.webrequests.

Approach
We can identify new users from visits to "Create Account" pages. In the webrequest logs, we can trace visits to individual pages (and their referers). The Create Account - pages are characterized in the following way:

  • namespace_id = -1
  • pageview_info.page_title = "Special:CreateAccount" (in english)

In order to differentiate between different users, we use the hash of the fingerprint obtained from concatenating user_agent+client_ip+accept_language

Further, we impose the following restrictions on the query:

  • is_page = 1; whether request was classified as a view in the refinement
  • agent_type = "user"; avoid automated traffic
  • access_method = "desktop"; fingerprinting not reliable for mobile users
  • referer_class = "internal"; only if create-account button was clicked from within wikimedia project
  • year, month, day, hour; focus on time-window (also to reduce number of partitions to query)
  • normalized_host.project_family = "wikipedia" AND normalized_host.project = "en"; focus on specific project

Visit Special:CreateAccount != New user

When we try to identify new user via checking requests to the 'CreateAccount'-page we do not know whether they actually registered. Some exploratory research suggests that this might not be the case:

  • there are way more requests to these pages than new users being registered in the user-table with similar timestamp-registration
  • some fingerprints make more than 100 requests per hour indicating automated traffic.

Therefore, the idea is to check the status of the registration by looking at logged-in status.

  • before the request the user should not be logged in
  • after the request the user should be logged in

Jospeh suggested that wwe can check this via[[ https://wikitech.wikimedia.org/wiki/X-Analytics | x_analytics_map ]]['loggedIn']

  • 'If set, will be equal to "1", and indicates that the request came from a logged in user'
  • this means that if not set, user was logged out, if set it will be 1 indicating user was logged in

Preliminary results from 2019-10-02
Indeed, this seems to give promising results. We pursue the following pipeline:

  • extract all (unique) users that visited Special:CreateAccount
  • query complete log of these users (in the same time frame) and check whether they were loggedin after visiting the create-account page

Example

  • parameters
    • year, month, day, hour = 2019, 7, 15, 10
    • only english wikipedia
  • raw-query:
    • 188 users; 86,356 logs
    • time: 61s
  • filtering registration-events via loggedin
    • 53 users, 792 logs
    • it seems we are filtering lots of automated traffic
    • avg session-length: 15
    • count how many different pages before/after
leila triaged this task as High priority.Sep 30 2019, 5:29 PM

Optimizing the code for querying new users from webrequests

@JAllemandou @Ottomata
I would love to get your feedback on the spark-code I wrote to query the reading patterns of new users [notebook is attached].

While the code works and I am also happy with its performance, the main challenge for me was the following aspect of the query:

  • I first query the webrequests of all users that visited a 'CreateAccount' pages at least once (and group them by users and then order by timestamp)
  • I only want to keep the requests of those users in which the user was
    • not logged in when visiting the 'CreateAccount' page
    • was logged in when visiting the next page

My approach:

  • For the former, I make a simple query consisting of hashing a fingerprint and filtering those that visited at least once (df)
  • For the latter, I create an aggregate-table in which I record users which have the pattern (not logged in+createaccount --> loggedin+next-page) (df_users)
  • I then make left semi join of df and df_users to only keep those sessions in which users fulfill the latter condition.

My question would be if there is a better or more efficient way to make those queries, perhaps even combining them in a single query.
Also any other comments regarding usage of spark are very much appreciated.
Thanks


Ottomata added subscribers: mforns, Milimetric.

cc @Milimetric and @mforns too in case they have query optimization tips for Martin :)

@MGerlach Logged in users have a different pattern through the site, it will be worth checking that all his requests are served via varnish, if they are not (which might be the case) you have an incomplete dataset.

Ok, so i confirmed the data will be indeed there for all logged in users, just pageview times will be longer on repeated pageviews cause those pages are not cached

@JAllemandou can you take a look at Martin's code and see if there is anything he can optimize?

We can identify new users from visits to "Create Account" pages

BTW, in the future if this is something we wanted to regularly measure and analyze, we could create a mediawiki.user-create stream that would contain events for every user account creation.

I looked at this and it's a clever way to get some rough information to answer the main question. But I just wanted to point out: it's not by accident that this kind of connection is hard to make. We made a conscious decision a while back that we should not optimize accessing reading patterns of specific users. Being able to get this data is problematic for privacy reasons, and more so if we can get this data quickly. So, nice work, but I would hesitate before optimizing it too much more.

Adding in a few links for reference:
Definitely want to tie this to two projects from the Growth Team: Understanding First Day and Welcome Survey. Specifically, it's worth checking out Morten's reports for these two projects:

I looked at the code and have some comments, but not that many given the complexity of the analysis :) Good job @MGerlach!

  • When using date for partition pruning, I discourage going with functions over the partition-values as it can lead to incorrect results depending on how date formats are interpreted in functions. I give you and example:
spark.sql("select count(1) from pageview_hourly where concat(year, '-', month, '-', day, ' ', hour, ':00:00') = '2019-10-14 00:00:00'").show(100, false)
+--------+                                                                      
|count(1)|
+--------+
|0       |
+--------+

spark.sql("select count(1) from pageview_hourly where concat(year, '-', month, '-', day, ' ', hour, ':00:00') = '2019-10-14 0:00:00'").show(100, false)
+--------+                                                                      
|count(1)|
+--------+
|22173140|
+--------+

I'd rather use and explicit year = 2019 and month = 10 and day = 14 and hour = 0 to prevent this kind of error.

  • When using windowing functions, is there a reason to have chosen Window.partitionBy().orderBy(F.col('user_id'),F.col('ts')) instead of Window.partitionBy(F.col('user_id')).orderBy(F.col('ts')) ? this prevent spark to have to sort the full dataset group-by is cheaper than sorting) and simplifies the check at aggregation time:
.withColumn('logged_in_next', F.when( F.col('user_id') == F.lag('user_id',count=-1).over(w_user_ts), F.lag('logged_in',count=-1).over(w_user_ts)).otherwise(0))

becomes:

.withColumn('logged_in_next', F.coalesce(F.lead('logged_in',1).over(w_user_ts), 0))

I also changed lag(xx, -1) to lead(xx, 1) that I find less misleading.

  • Finally I like very much the use of windowing functions but they are resource expensive. Depending on how much data you'll need to process (timewise and columnwise) they may fail. In that case, the solution is to process standalone dataframes and join them as you've done with user_df (but let's keep them as long as they work :).

Please ping me if you have questions!

Change of formatting in webrequest logs?

When looking at the number of registration events over time, I find that there are between 100-200 events per hour.
However, at some point this number drops to exaclty 0 on 2019-07-23.
See plot here:

number-registration.png (348×735 px, 26 KB)

Was there a change in how the webrequest data is formatted? e.g. how projects are coded (I focus on english wikipedia via normalized host)
Was there a change in mediawiki-software, e.g. page_title of the create-account page?

These are the most important parameters

## shortcut for page-title
.withColumn('page_title', F.col('pageview_info.page_title') )
## logged_in 0/1
.withColumn('logged_in', F.coalesce(F.col('x_analytics_map.loggedIn'),F.lit(0)) )
## page create account
.withColumn('page_create_account',page_create_account)

## select time partition via windows
.where(row_timestamp >= ts_start)
.where(row_timestamp < ts_end)

## select wiki project
.where( F.col('normalized_host.project_family') == "wikipedia" )
.where( F.col('normalized_host.project') == "en" )

## only requests marked as pageviews
.where( F.col('is_pageview') == 1 )
## user: desktop/mobile/mobile app; isaac filters != mobile app
.where( F.col('access_method') != "mobile app" )
## unclear yet; done by isaac
.where( F.col('webrequest_source') ==  'text' )

with

date_start = datetime.datetime(2019, 7, 20, 0)
date_end = datetime.datetime(2019, 7, 27, 0)


ts_start = calendar.timegm(date_start.timetuple())
ts_end = calendar.timegm(date_end.timetuple())

row_timestamp = F.unix_timestamp(F.concat(
    F.col('year'), F.lit('-'), F.col('month'), F.lit('-'), F.col('day'), 
    F.lit(' '), F.col('hour'), F.lit(':00:00')))

@JAllemandou Any idea what is going on?

Unhelpful comment below -- (I don't know why things changed)
This is why using webrequest for this kind of thing is difficult! The HTTP request patterns you are querying for are not designed to be searched for e.g. logins; if developers change the way log ins work, the HTTP pattern might change.

+1 to @ottomatas's comment, webrequest data is very limited for this type of research. Mediawiki does not change dramatically frequently but it does change every time there is a release. Have in mind that login urls do not need to be backwards compatible.

Edit, correcting link to mediawiki releases:

https://www.mediawiki.org/wiki/MediaWiki_1.33#Subpages

I have an explanation for you @MGerlach, and will try to show you how I have found it:

This means we have on purpose removed the pageview flag from the page you're after. ou can still find it, but need to rely on uri_path field without is_pageview being true.

That is fantastic @JAllemandou
I was suspecting something along these lines but it was not sure where/how to track those changes.
Should be possible to fix now.
Thanks a lot.

@leila : Indeed doc is available, but in https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly, not in webrequest page. I wonder if we should n't always add lines to both (I usually do, but we don't have it team formalized)

@JAllemandou I see. I now see that you had already linked to the Pageview_hourly page in the section I linked to. I had missed that link. sorry. No need to update the table I linked to, imo. Thanks!

Control group

Aim: Query the reading sessions of users that did not create a new account (i.e. those users which did not login at any point during the time-window of focus).
Obviously, the number of these users is much larger. Therefore, we want to subsample a subset of the same (or at least comparable size as the new-user data).

Approach: Essentially I am applying a filter on the webrequest-df on the user-id field with a given number of users (not very large) such that the resulting dataframe contains perhaps 5000 rows.

First, I filter only users that were never logged-in during their session:

## user-id and partition over users
user_id = F.hash(F.concat(F.col('client_ip'),F.lit('-'),F.col('user_agent'))) 
w = Window.partitionBy(F.col('user_id'))

df = (
    spark.read.table('wmf.webrequest')
    ## define some columns
    .withColumn('user_id',user_id)
    .withColumn('logged_in', F.coalesce(F.col('x_analytics_map.loggedIn'),F.lit(0)) )
    ## some selection for user-agent, project, pageviews, time, etc. [see full code attached below]
    ... 
    ## filter users that were never logged in
    .withColumn('n_p_loggedin_by_user', F.max( F.col('logged_in') ).over(w) )
    .where( F.col('n_p_loggedin_by_user') == 0 )
)

Second, I get N_sample user_ids which I want to sample (in a new dataframe df_users)

N_sample = 1000
df_users = df.select('user_id').distinct().limit(N_sample)

Third, I filter the reading sessions for the 1000 users via a left-semi join (to get a much smaller dataframe df_sample)

df_sample = df.join(df_users, df['user_id'] == df_users['user_id'] , "left_semi" )
df_sample = df_sample.toPandas()

Result: For time-windows of 1 day or less this works, e.g. adding

.where(F.col('hour')==3)

However, for time-windows larger than 1 day, I get the error :

Py4JJavaError: An error occurred while calling o977.collectToPython.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange rangepartitioning(user_id#3002 ASC NULLS FIRST, ts#2942 ASC NULLS FIRST, 200)
+- *(9) BroadcastHashJoin [user_id#3002], [user_id#3261], LeftSemi, BuildRight
... more stuff

I am not sure I understand why this happens.
Is it that the original dataframe is too large, or is spanning too many partitions (in terms of number of days)?
Should I choose an alternative instead of the semi-left join? (my understanding though is that this is the join to use when one wants to keep the rows from the left df which match some row of the right df).
Or should I apply the join for each day separately and concatenate individual days' sampled dataframes to avoid the error?

Any thoughs/recommendations, @JAllemandou ?
Thanks

This is the full code from my notebook

(pyspark-yarn large)

Hi @MGerlach - There are two things I can think of to try to help:

  • Technical trick - Increase the default value of Spark-SQL partitioner. By default, spark uses 200 partitions for datasets/dataframes, which is not problematic if data is small (spark handles many small tasks fast), but can be if data is big (200 partitions is too small, resulting in tasks being too big). Here is a way to do so: spark.sql('SET spark.sql.shuffle.partitions = 1024') (to be executed before the dataframe execution, obviousy :)
  • Functional discussion - Is grouping by user over a few days of webrequest really what you want? I'm asking in regard to usual session definitions, where user interactions are considered part of the same session as long as there is no inaction period longer than X for that user. If we consider this definition, then the grouping can be done not only on user-id but also on day (with some incorrect data at day boundaries), facilitating a lot the spread of the data among tasks. Please let me know if my thinking makes sense :)

Updates 2019-10

Methods

I queried reading sessions of new users from webrequest-logs for 1 week 2019-09-09 -- 2019-09-15 for the English Wikipedia (access_method = desktop | mobile web).
I consider a registration event if the following conditions are met:

  • visit to page 'Special:CreateAccount' (via uri_query) and x_analytics_map.loggedIn == null (i.e. not logged in)
  • visit to the following page with x_analytics_map.loggedIn == 1 (i.e. logged in).

Reading sessions are defined as consecutive visits to pages ('is_pageview'==1) by same fingerprint (hash of client_ip+user_agent) with less than 1 hour of inactivity between them.
This yields approximately 24,909 registrations.
For comparison, I query a dataset which contains a similar number of reading sessions which did not contain a registration event and did not already have an account (i.e. x_analytics_map.loggedIn == null for the whole session).

1. Reliability

I want to assess how reliable this method of identifying new users is.
Therefore, I compare the number of new users from this method with the number of newly registered users ('created_by_self'==True) from mediawiki_user_history.
The method based on webrequests captures 80% of the number of new users at any given hour in the 1-week period.
At the moment, I am not completely sure about the missing 20%, but this analysis suggests that we can reliably capture the majority of registration events.

image.png (348×735 px, 59 KB)

image.png (348×731 px, 33 KB)

2. Desktop vs Mobile

For new users, almost 2/3 use the desktop version.
In contrast, regular reading sessions preferentially take place via the mobile-web version.

image.png (306×517 px, 16 KB)

3. From where are webrequests made

Here, I compare the location (on the level of continent) of users.
Most new users come from North America and Asia in absolute numbers.
However, in comparison to the amount of 'regular' traffic, the number of new users from North America and Europe is much smaller and the number of new users from Asia and Africa is much larger.

4. Length of session

Here, I want to compare the number of (distinct) pages visited during each reading sessions.
When considering the full session, the new-user reading sessions tend to be much longer.
The reading sessions of 'regular' users is peaked at n=1 page visit.

image.png (287×442 px, 15 KB)

Interestingly, the two distributions become much more similar if one only focuses on the reading session *before* the registration event

image.png (287×442 px, 15 KB)

Note that some of the new users have a session-length of n=0 (no visits to pages before the registration event)

5. Namespaces

Do new-user reading sessions originate in specific namespaces?
Here, I record the namespace of the first page-visit in the reading session (if not available, I assign value=-2).
Most sessions start in ns=0 (article), but for new users the percentage is slightly smaller.

image.png (297×508 px, 17 KB)

In order to reveal differences for the other namespaces, I plot the y-axis on a log-scale
This reveals that new users have a richer spectrum of namespaces from which their reading sessions start.

image.png (297×518 px, 18 KB)

Note that this difference persists also if we separate desktop and mobile users.

Updates 2019-11-11

*Replicate analysis for 6 different wikis : en, de, fr, ar, cs, ko metawiki
*Added summary of main findings metawiki

Most sessions start in ns=1 (article), but for new users the percentage is slightly smaller.

just a quick note that ns=1 is article talk, ns=0 is the article namespace, but I figure from the graph and your results that this was just a typo

Most sessions start in ns=1 (article), but for new users the percentage is slightly smaller.

just a quick note that ns=1 is article talk, ns=0 is the article namespace, but I figure from the graph and your results that this was just a typo

You are right, luckily just a typo. Thanks for pointing out, I corrected the comment above.

@MGerlach Jon Katz mentioned that he will read the report and he will get back to you with any feedback he has on the Discussion page of the report or via email. I asked him to report any item missing in the analysis that is key for understanding the reader-->create account transition. He may have other questions outside of this realm that can be added to Future Directions in the page. Let's plan to resolve this task in a week unless major feedback comes in via Jon.

Thanks for all your work and documentation.

MGerlach updated the task description. (Show Details)