Page MenuHomePhabricator

Clickstream dataset for Persian Wikipedia only includes external values
Closed, ResolvedPublic3 Estimated Story Points

Description

Steps to reproduce:

Expected result:

  • Some clickstream values should be internal

Actual result:

  • All are external values

Event Timeline

Thanks for reporting this @Ladsgroup. I added Analytics since the data is maintained by Joseph et al.

fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

What I think is happening: the ulr decoded farsi text from internal referrers does not match (encoding?) any page by name and thus those internal referrers are being lost.

This might indicate that similar encoding issues might be happening for other languages, maybe less prevalent.

Hey, Six months have passed and it's not fixed. Can I take a look at code? Maybe I can help.

Sorry progress is so slow, issue is one of encoding of urls versus page titles so there is no match and thus it seems no internal hits are happening for pages.

Change 472700 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[analytics/refinery/source@master] ClickstreamBuilder: Decode refferer url to utf-8

https://gerrit.wikimedia.org/r/472700

^ I made this patch but I basically had no way to test things. Please double check and if possible run it for a short period of time before merging.

ah, missing an option. good cmd for reference:

spark2-submit --class org.wikimedia.analytics.refinery.job.ClickstreamBuilder --name nuria-farsi-clickstream --master yarn \
--deploy-mode cluster --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.yarn.executor.memoryOverhead=4096 \
--conf spark.dynamicAllocation.maxExecutors=32 --executor-cores 4 --executor-memory 32G --driver-memory 8G \
/home/nuria/workplace/clickstream/refinery-job-0.0.81-SNAPSHOT.jar \
--year 2018 --month 8 --day 1 --hour 1 --minimum-count 10 --snapshot 2018-08 \
--output-base-path hdfs://analytics-hadoop/tmp/nuria-clickstream-test-2018-11-09 \
--wikis fawiki \
--webrequest-table wmf.webrequest \
--project-namespace-table wmf_raw.mediawiki_project_namespace_map \
--page-table wmf_raw.mediawiki_page \
--redirect-table wmf_raw.mediawiki_redirect \
--pagelinks-table wmf_raw.mediawiki_pagelinks

It gives me application not found :/
Where can I submit a spark job like you did? stat1007? and how I can download my patch to test it?
(If I can test it on my own, I will make sure it works and then I bug you)

Thanks!

@Ladsgroup i tested this and several other variations, none of which worked.

You can git clone refinery depot on stats machines, change code (no need to have a gerrit patch) and execute task through a similar command to the one I outlined above. You will be executing job against the jar you have build thus your changes come into place. Note that I restricted time range to be 1 hour, you do not need the whole data set to find the issue.

See: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark

It gives me application not found :/
Where can I submit a spark job like you did? stat1007? and how I can download my patch to test it?
(If I can test it on my own, I will make sure it works and then I bug you)

Sweeping around gerrit, found this discussion. Yes, stat1007 or stat1004 can both run the spark job as Nuria points out. I'm doing this now for a different task, @Ladsgroup so if you want to follow up with me on IRC, I can walk you through any issues.

Sweeping around gerrit, found this discussion. Yes, stat1007 or stat1004 can both run the spark job as Nuria points out. I'm doing this now for a different task, @Ladsgroup so if you want to follow up with me on IRC, I can walk you through any issues.

Thanks. My biggest problem right now is that I don't know how to build a fully functional jar file. The jar file I build gives error when I run them. (This is my first interaction with java, sorry if I'm really stupid)

it's not you, it's Java. But I can't help without details, ping me on IRC, I'm very behind on my phab pings as you can see.

Hi @Ladsgroup - I'm extremely sorry for not having taken the time to answer you faster :(
I've quickly tested your patch and it seems to work.
I have run it on fawiki and frwiki to compare proportions of link vs other-* link types:

wiki_dblinkother-*
frwiki22460152309938
fawiki437332438971

It looks super good :)
Merging for a deploy next week.

Change 472700 merged by jenkins-bot:
[analytics/refinery/source@master] ClickstreamBuilder: Decode refferer url to utf-8

https://gerrit.wikimedia.org/r/472700

Thanks @JAllemandou and sorry @Ladsgroup that in totally missed on my tests that your patch fixed the problem. @JAllemandou can you please update docs and document fix? https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

Nuria set the point value for this task to 3.