Page MenuHomePhabricator

Analyze referrer traffic to determine report format {hawk} [8 pts]
Closed, ResolvedPublic

Description

As product manager, I want to know the proportion of pageviews coming from 'X' search engine. To do this, I need a referrer column in the table. See T108895

Scope: we only need the top 20 or even 100 referrer domains [and sometimes subdomain], not all possible referrers. Also, this list does not need to be updated frequently.

  • Analyze top 1000 including domain and subdomain, and decide on format of output data
  • output data:
    • top level domain?
    • top 1000?

Event Timeline

kevinator raised the priority of this task from to High.
kevinator updated the task description. (Show Details)
kevinator moved this task to Incoming on the Analytics-Backlog board.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 12 2015, 10:39 PM
kevinator set Security to None.
DarTar moved this task from Staged to Radar on the Research board.
kevinator renamed this task from Add referrer to pageviews_hourly to Analyze referrer traffic to determine report format {hawk} [8 pts].Aug 17 2015, 4:30 PM
kevinator updated the task description. (Show Details)
kevinator moved this task from Prioritized to Tasked on the Analytics-Backlog board.
Milimetric edited projects, added Analytics-Kanban; removed Analytics-Backlog.
Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.
ggellerman moved this task from In Code Review to In Progress on the Analytics-Kanban board.

I picked an hour I'm not going to disclose on purpose, and I did (notice the 0.01% sampling)

select referer, count(*)
  from wmf.webrequest TABLESAMPLE (0.01 PERCENT) w
 where year, month, day, hour, etc,
 group by referer

Some results from the analysis:

  • There were a little over 2 million log lines in the sample
  • There were 3277 distinct referer HOSTS
  • There were 612060 distinct referer paths. But some of the top referers, like www.google.com, didn't have a lot of distinct paths
  • Everything ran relatively fast, the dump of the hour of data was a couple of minutes, the python to run through it was 30 seconds or so
  • The first listing below is just the hostname as parsed out of the referer url
  • I think one interesting report would be grouped by the interesting part of the domain name, second listing below

First listing, just the hostname as parsed out of the referer url

en.m.wikipedia.org	741393
			438673
ja.m.wikipedia.org	117075
es.m.wikipedia.org	79691
ru.m.wikipedia.org	74862
de.m.wikipedia.org	62122
www.google.com		54847
fr.m.wikipedia.org	45334
it.m.wikipedia.org	42624
zh.m.wikipedia.org	28165
pt.m.wikipedia.org	22976
www.google.co.in	17795
ar.m.wikipedia.org	13402
id.m.wikipedia.org	12839
www.google.co.uk	12146
pl.m.wikipedia.org	12036
tr.m.wikipedia.org	11978
nl.m.wikipedia.org	11716
sv.m.wikipedia.org	10726
www.google.co.jp	10422
www.bing.com		9699
www.google.it		8417
www.google.de		8364
ko.m.wikipedia.org	7970
www.google.fr		7340
fi.m.wikipedia.org	6375
www.google.es		6258
fa.m.wikipedia.org	6229
www.google.ru		5670
th.m.wikipedia.org	5640
www.google.com.br	5315
search.yahoo.co.jp	4794
bits.wikimedia.org	4462
vi.m.wikipedia.org	4291
commons.m.wikimedia.org	4145
he.m.wikipedia.org	4093
www.google.nl		3797
www.google.com.mx	3783
www3.skypicker.com	3657
www.google.ca		3638
www.j0zf.com		3554
www.seznam.cz		3399
hi.m.wikipedia.org	3381
www.google.co.za	3330
hu.m.wikipedia.org	3273
no.m.wikipedia.org	3162
assets-cdn.github.com	3112
www.google.com.tr	2735
en.m.wiktionary.org	2559
www.google.pl		2453
cs.m.wikipedia.org	2422
www.google.co.id	2412
da.m.wikipedia.org	2412
www.google.com.co	2411
www.google.com.tw	2398
www.google.se		2322
www.google.com.ar	2155
r.search.yahoo.com	2103
yandex.ru		1858
ro.m.wikipedia.org	1755
uk.m.wikipedia.org	1659
www.google.com.ua	1607
static.lulokilaqew.com	1508
www.google.co.th	1478
static.lymacydykik.com	1477
static.deqasetasyk.com	1461
static.paguboxekef.com	1459
www.google.com.ph	1446
static.dabyfytuvot.com	1423
www.google.fi		1412
el.m.wikipedia.org	1410
www.google.com.hk	1331
www.google.cl		1316
www.google.ch		1302
www.google.co.il	1299
www.google.com.pk	1293
bn.m.wikipedia.org	1268
www.wikipedia.org	1202
bg.m.wikipedia.org	1172
www.google.com.ng	1160
www.google.at		1146
www.google.com.sg	1093
search.yahoo.com	1081
127.0.0.1:8001		1070
fr.m.wiktionary.org	1048
www.google.com.gh	1026
m.search.naver.com	1010
en.wikipedia.org	981
ru.m.wiktionary.org	969
hr.m.wikipedia.org	960
www.google.com.au	956
www.google.be		877
etherpad.wikimedia.org	866
www.reddit.com		757
www.google.ie		739
www.google.no		737
sr.m.wikipedia.org	736
www.google.com.sa	736
www.google.com.pe	731
simple.m.wikipedia.org	724

Second listing, the hostname is stripped of .m|.org|.com*|.co.*|www.

en.wikipedia	742374
		438673
google		139797
ja.wikipedia	117134
es.wikipedia	79745
ru.wikipedia	75015
de.wikipedia	62458
fr.wikipedia	45381
it.wikipedia	42679
zh.wikipedia	28247
pt.wikipedia	22995
ar.wikipedia	13428
id.wikipedia	12853
pl.wikipedia	12054
tr.wikipedia	11990
nl.wikipedia	11745
sv.wikipedia	10728
bing		9699
google.it	8417
google.de	8364
ko.wikipedia	7972
google.fr	7340
fi.wikipedia	6380
google.es	6258
fa.wikipedia	6232
search.yahoo	5875
google.ru	5670
th.wikipedia	5645
bits.wikimedia	4462
commons.wikimedia	4341
vi.wikipedia	4299
he.wikipedia	4105
google.nl	3797
www3.skypicker	3657
google.ca	3638
j0zf	3554
seznam.cz	3399
hi.wikipedia	3382
hu.wikipedia	3279
no.wikipedia	3164
assets-cdn.github	3112
en.wiktionary	2597
google.pl	2453
cs.wikipedia	2422
da.wikipedia	2412
google.se	2322
r.search.yahoo	2103
yandex.ru	1867
ro.wikipedia	1755
uk.wikipedia	1662
static.lulokilaqew	1508
static.lymacydykik	1477
static.deqasetasyk	1461
static.paguboxekef	1459
static.dabyfytuvot	1423
google.fi	1412
el.wikipedia	1410
google.cl	1316
google.ch	1302
bn.wikipedia	1268
wikipedia	1202
bg.wikipedia	1172
google.at	1146
127.0.0.1:8001	1070
fr.wiktionary	1049
m.search.naver	1010
ru.wiktionary	969
hr.wikipedia	960
google.be	877
etherpad.wikimedia	866
reddit	757
sr.wikipedia	741
google.ie	739
google.no	737
simple.wikipedia	724
google.dk	699
googleweblight	617
google.gr	609
google.hu	606
google.kz	601
google.ro	587
translate.googleusercontent	576
sk.wikipedia	550
ca.wikipedia	545
google.ae	542
m.facebook	535
ta.wikipedia	530
google.pt	487
search.smt.docomo.ne.jp	475
m.wikimediafoundation	469
google.hr	454
en.wikiquote	444
google.cz	444
en.wikibooks	439
google.bg	404
lt.wikipedia	394
de.wiktionary	369
google.rs	366
googleu		360
google.by	358

@Milimetric can you run using a whole day instead of an hour? I'm surprised to see Japan so high and I'm left wondering if this was done at an hour when Japan is awake and Europe and America are asleep.

I suggest we ping Toby/Jon and maybe Lisa/Sheree to understand what kind of data would be most useful to them. For example, I wonder if people care more about top *global* referrer or top referrers by country or language.

Ok, I used two days and downsampled more, this time 0.001 percent. Here are the results:

  • There were a little over 3.3 million log lines in the sample
  • There were 10900 distinct referer HOSTS
  • The hosts that appear in the top 100 are totally different in ranking, good call Kevin. But they're almost too different. And when I get the count(distinct referer) it's only 254075. This is lower than last sample, which was 0.01 percent but over only an hour. So... color me confused. This is why I'm not an analyst. I think maybe the analytics engineering team could really use an analyst...

Here are the top 100 referer hosts:

(none)			2283300
bits.wikimedia.org	149764
www3.skypicker.com	123455
www.j0zf.com		123296
www.google.com		120606
www.seznam.cz		112158
assets-cdn.github.com	106222
static.paguboxekef.com	39686
static.lulokilaqew.com	39339
static.pelulasylex.com	19844
static.mimygasoseb.com	19732
static.lymacydykik.com	19707
static.nutufisunov.com	19679
static.deqasetasyk.com	19645
static.dabyfytuvot.com	19572
wkp.fresheye.com	14900
etherpad.wikimedia.org	10051
www.baka-tsuki.org	8369
us.wow.com		6031
wiki.cyanogenmod.org	4417
us.when.com		4108
de.wikipedia.org	3420
nonciclopedia.wikia.com	2957
www.digplanet.com	2784
metabattle.com		2329
baike.chachaba.com	2209
wiki.openstack.org	1840
wiki.openstreetmap.org	1809
ejje.weblio.jp		1514
neolaki.net		1463
kodi.wiki		1398
inciclopedia.wikia.com	1342
o.aolcdn.com		1321
traditio.wiki		1271
ftb.gamepedia.com	1152
userbase.kde.org	1056
library.kiwix.org	1044
tcrf.net		976
docs.joomla.org		974
util.xat.com		902
howto.corel.com		868
aovivonatv.com		858
dan.wikitrans.net	798
www.cubeworldwiki.net	739
ja.wikipedia.org	721
wiki.waze.com		652
wpedia.goo.ne.jp	628
www.insidethex.co.uk	601
phabricator.wikimedia.org	591
en.wikipedia.org	566
www.progrentis.com	552
zh.wikipedia.org	527
vi.vipr.ebaydesc.com	474
he.wikipedia.org	435
mentor.progrentis.com	432
vikipediya.uz.cm	430
www.litmir.co		426
allthetropes.orain.org	420
www.weblio.jp		409
www.huatuo.org		394
wikis.zum.de		374
www.freecadweb.org	372
hue.wikimedia.org	365
corruptionofchampions.orain.org	364
cuttingsarchive.org	353
arabianwiki.in		353
ms.wikipedia.org	346
www.cactusclubcafe.com	341
www.pindula.co.zw	337
nzb.su			321
wiki.nitrado.net	315
translate.googleusercontent.com	293
solowej-slowik.ru	287
www.questpedia.org	281
www.sexuil.com		277
psychonautwiki.org	274
www.nzb.su		271
blog.wikimedia.org	269
epo.wikitrans.net	264
gadget.artega.biz	262
content.wow.com	250
www.clubfetichista.com	250
tessera.wikimedia.org	247
www.electrical-installation.org	242
cache.baiducontent.com	240
doc.wikimedia.org	236
wiki.talonro.com	229
www.omegawiki.org	226
hndeati.com		225
ethw.org		218
www.bing.com		216
integration.wikimedia.org	206
www.de.wikitotal.info	199
stats.wikimedia.org	199
dvscr.ru		197
grafana.wikimedia.org	187
www.stevemorse.org	181
wt.blog.jp		176
wiki.simplemachines.org	163
annual.wikimedia.org	159

Also, super weird data, like one of the top referers is http://www.j0zf.com/includes/plugins/blogs/images/yahoo_favicon.png

What the hell is going on? Looking through this I feel like we totally don't understand how people use our content, from both programatic / api points of view and repackaging with different interfaces.

This next one is sampled 0.5 % but filtered to only look at pageviews.

  • 8,162,386 lines analyzed
  • 10348 unique referer hosts
  • 715027 unique referer paths
  • What I get from this is that our pageviews and the stuff that we care about have sensical referers. The rest of our webrequests, and also the vast majority, seems to come from some sources that we would do well to investigate. 15 minutes might save us 15% of our traffic :)

Top 100:

(none)			2470972
www.google.com		1228404
en.m.wikipedia.org	1182142
www.google.co.jp	266107
ja.m.wikipedia.org	216533
search.yahoo.co.jp	174462
www.google.com.br	149045
www.google.co.uk	145868
es.m.wikipedia.org	129756
www.google.ca		128314
www.google.co.in	103310
www.google.com.mx	90398
www.google.de		79154
www.google.it		78884
www.google.com.au	71739
www.google.fr		71716
www.bing.com		69983
www.google.es		60830
de.m.wikipedia.org	56490
www.google.co.id	54906
fr.m.wikipedia.org	53102
it.m.wikipedia.org	51278
ru.m.wikipedia.org	50139
www.google.com.ar	47672
r.search.yahoo.com	47437
www.google.com.co	46748
pt.m.wikipedia.org	45929
www.google.ru		36325
search.yahoo.com	35798
m.search.naver.com	32772
zh.m.wikipedia.org	31624
www.google.cl		31347
www.google.com.tw	26417
www.google.com.tr	26184
www.google.com.ph	24591
ar.m.wikipedia.org	22131
www.wikipedia.org	22001
www.google.se		20196
www.reddit.com		18980
search.smt.docomo.ne.jp	18733
www.google.nl		18452
www.google.co.th	18346
www.google.pl		17056
www.google.com.sa	16230
yandex.ru		15291
www.google.com.sg	15114
www.google.co.ve	14561
www.google.com.hk	14149
id.m.wikipedia.org	14091
www.google.com.pe	14037
www.google.ie		13200
www.google.ch		12218
www.google.com.eg	10962
tr.m.wikipedia.org	10396
m.facebook.com		10299
www.google.co.nz	10109
ko.m.wikipedia.org	9890
www.google.no		9617
www.google.pt		9565
nl.m.wikipedia.org	9491
pl.m.wikipedia.org	9436
www.google.co.il	9241
commons.m.wikimedia.org	9062
www.google.at		8660
www.google.co.kr	8569
www.google.be		8320
www.google.fi		8314
sv.m.wikipedia.org	8097
www.google.com.ua	7885
www.google.com.ec	6963
www.google.dk		6951
googleweblight.com	6782
sp-search.auone.jp	6707
www.google.com.do	6573
www.google.com.gt	6322
www.google.com.uy	6104
www.google.co.za	6094
www.google.gr		6006
www.google.com.pk	5611
www.google.ae		5538
www.google.com.bo	5272
www.google.com.pr	5150
fa.m.wikipedia.org	5015
duckduckgo.com		4901
www.google.com.ng	4724
www.google.com.vn	4672
www.google.com.py	4233
fi.m.wikipedia.org	4194
www.google.ro		4022
th.m.wikipedia.org	3978
en.wikipedia.org	3917
he.m.wikipedia.org	3823
www.google.hu		3803
www.google.dz		3753
www.google.hr		3568
vi.m.wikipedia.org	3521
m.search.daum.net	3490
www.google.rs		3385
www.google.cz		2744
en.m.wiktionary.org	2736

New tasks created to track the proposed solution: T112284

JAllemandou moved this task from Paused to Done on the Analytics-Kanban board.Sep 11 2015, 4:28 PM

Should I then close this one, @kevinator?

kevinator added a comment.EditedSep 11 2015, 6:09 PM

@Milimetric I see it is in the done column, I'll close it ;-)

kevinator closed this task as Resolved.Sep 15 2015, 3:29 PM