Maniphest T208569

Get Wikidata clickstream
Open, MediumPublic
Actions

Assigned To

None

Authored By

	• Lea_WMDE
	Nov 2 2018, 9:05 AM

Description

Motivation
We want to evaluate whether the changes we are making to the mobile interface of Wikidata are improving its usefulness. To be able to do that we need some basic info about what people usually do.

Task

Show the development of sessions with Wikidata. E.g. "X people came from a wikidata item, then to a WD property page, then another item, and another item" or "one item only and gone again" or "from Wikipedia app to WD editing to WD item view to other WD item".
It should be visible whether the mobile or the desktop version of the page was shown
It should be seperated between logged in and anonymous users.

Notes

Probably the results will be too complex for grafana, and it may make sense to do this with a shiny dashboard
Maybe one can make use of the click stream analysis done by the search team
We still need to define what "session" means
It is not clear yet how exactly the result should look like - it also depends on what we can do at what cost

Related Objects

Mentioned In: T289532: Add more languages to Wikipedia Clickstream

Event Timeline

• Lea_WMDE created this task.Nov 2 2018, 9:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2018, 9:05 AM

Addshore moved this task from incoming to in progress on the Wikidata board.Nov 3 2018, 10:20 AM

• Lea_WMDE triaged this task as Medium priority.Nov 7 2018, 9:28 AM

• Lea_WMDE updated the task description. (Show Details)

GoranSMilovanovic moved this task from Technical Wishlist to Prioritized on the User-GoranSMilovanovic board.Nov 19 2018, 1:35 AM

• Lea_WMDE moved this task from Backlog to Other on the Wikidata-Termbox board.Nov 22 2018, 2:28 PM

@JAllemandou Hey, I need an insight into the production code for the Clickstream dataset, but I can't find the code repository anywhere. Maybe you could help? Thanks. N.B. I am not looking for Python use cases (I've found them) neither for the SQL extraction of the monthly updates (I've seen that too) but rather for the code that feeds the tables in the clickstream database in Hadoop.

@Lea_WMDE I still need to get back to you on this one. I have studied the existing datasets, I still have some thinking to do about how to get to what we need based on what is already in production, and while in general I think it can be done I can say that it is not going to be "cheap" in any respect (i.e. time and computational resources).

In general, the prima facie structure of the Clickstream dataset is what we are looking for, except for we need some additional fields (desktop/mobile, logged/anonymous) and that our filtering criteria (e.g. how do we define a user session) might be different.

Also, I would need to study the ua_parser library (luckily, there's an R version) and find out how did the Analytics-Engineering make use of it to filter out spider traffic. In other words, it can be done, we can have the dataset (someday), but it is going to be complex and take quite some time - and especially if I have to produce every bit of it by myself.

Hi @GoranSMilovanovic, the code we use to generate monthly data is here: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala
As per the clickstream database in Hive, it's not used anymore, it's a left-over from Ellerys time.

@JAllemandou So kind of you, thanks!

@Lea_WMDE Just to clarify, I will put no further efforts here until you let me know what you think in respect to my insights in T208569#4767950.

GoranSMilovanovic moved this task from Prioritized to Current/Deprioritized on the User-GoranSMilovanovic board.Nov 28 2018, 6:48 PM

GoranSMilovanovic moved this task from Current/Deprioritized to Technical Wishlist on the User-GoranSMilovanovic board.Jan 14 2019, 3:09 PM

Addshore moved this task from Incoming to Data Analytics on the WMDE-Analytics-Engineering board.Jan 29 2019, 9:14 AM

Addshore removed GoranSMilovanovic as the assignee of this task.Jan 29 2019, 9:20 AM

Addshore moved this task from Data Analytics to Product on the WMDE-Analytics-Engineering board.

Addshore added a subscriber: GoranSMilovanovic.

GoranSMilovanovic moved this task from Technical Wishlist to Radar on the User-GoranSMilovanovic board.Nov 26 2019, 8:43 AM

Manuel subscribed.Jun 17 2021, 10:28 AM

Isaac mentioned this in T289532: Add more languages to Wikipedia Clickstream.Aug 23 2021, 8:24 PM

GoranSMilovanovic removed a project: User-GoranSMilovanovic.Sep 7 2021, 10:07 PM

Get Wikidata clickstreamOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Get Wikidata clickstream
Open, MediumPublic
Actions