Page MenuHomePhabricator

Get Wikidata clickstream
Open, MediumPublic

Description

Motivation
We want to evaluate whether the changes we are making to the mobile interface of Wikidata are improving its usefulness. To be able to do that we need some basic info about what people usually do.

Task

  • Show the development of sessions with Wikidata. E.g. "X people came from a wikidata item, then to a WD property page, then another item, and another item" or "one item only and gone again" or "from Wikipedia app to WD editing to WD item view to other WD item".
  • It should be visible whether the mobile or the desktop version of the page was shown
  • It should be seperated between logged in and anonymous users.

Notes

  • Probably the results will be too complex for grafana, and it may make sense to do this with a shiny dashboard
  • Maybe one can make use of the click stream analysis done by the search team
  • We still need to define what "session" means
  • It is not clear yet how exactly the result should look like - it also depends on what we can do at what cost

Event Timeline

Lea_WMDE updated the task description. (Show Details)

@JAllemandou Hey, I need an insight into the production code for the Clickstream dataset, but I can't find the code repository anywhere. Maybe you could help? Thanks. N.B. I am not looking for Python use cases (I've found them) neither for the SQL extraction of the monthly updates (I've seen that too) but rather for the code that feeds the tables in the clickstream database in Hadoop.

@Lea_WMDE I still need to get back to you on this one. I have studied the existing datasets, I still have some thinking to do about how to get to what we need based on what is already in production, and while in general I think it can be done I can say that it is not going to be "cheap" in any respect (i.e. time and computational resources).

In general, the prima facie structure of the Clickstream dataset is what we are looking for, except for we need some additional fields (desktop/mobile, logged/anonymous) and that our filtering criteria (e.g. how do we define a user session) might be different.

Also, I would need to study the ua_parser library (luckily, there's an R version) and find out how did the Analytics-Engineering make use of it to filter out spider traffic. In other words, it can be done, we can have the dataset (someday), but it is going to be complex and take quite some time - and especially if I have to produce every bit of it by myself.

Hi @GoranSMilovanovic, the code we use to generate monthly data is here: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala
As per the clickstream database in Hive, it's not used anymore, it's a left-over from Ellerys time.

@Lea_WMDE Just to clarify, I will put no further efforts here until you let me know what you think in respect to my insights in T208569#4767950.

Addshore moved this task from Data Analytics to Product on the WMDE-Analytics-Engineering board.
Addshore added a subscriber: GoranSMilovanovic.