Page MenuHomePhabricator

Virtual pageview refine should not refine data that does not come from wikimedia domains
Closed, ResolvedPublic8 Estimated Story Points

Description

Virtual pageview refine should not include non wikimedia domains as valid projects.

See webhost stats for 1 day for webhosts with more than 100 hits (notice garbage values too, some values are in the hundreds)

27888400 en.wikipedia.org
6614810 ru.wikipedia.org
4018171 es.wikipedia.org
3702105 ja.wikipedia.org
3431340 de.wikipedia.org
3351771 fr.wikipedia.org
1917446 it.wikipedia.org
1452173 zh.wikipedia.org
1416752 pt.wikipedia.org
1029820 pl.wikipedia.org
547260 nl.wikipedia.org
344682 fa.wikipedia.org
290909 cs.wikipedia.org
275475 ar.wikipedia.org
215454 sv.wikipedia.org
198683 he.wikipedia.org
193534 hu.wikipedia.org
180617 vi.wikipedia.org
159108 uk.wikipedia.org
151875 th.wikipedia.org
140160 ko.wikipedia.org
116467 fi.wikipedia.org
95148 el.wikipedia.org
89299 id.wikipedia.org
81580 sr.wikipedia.org
80440 bg.wikipedia.org
75298 no.wikipedia.org
70043 ro.wikipedia.org
65007 da.wikipedia.org
55964 ca.wikipedia.org
42713 hr.wikipedia.org
34634 sk.wikipedia.org
33133 tr.wikipedia.org
32650 simple.wikipedia.org
25516 lt.wikipedia.org
22305 hi.wikipedia.org
21013 sl.wikipedia.org
18333 ka.wikipedia.org
16117 ms.wikipedia.org
15336 et.wikipedia.org
14037 az.wikipedia.org
10626 sh.wikipedia.org
9051 lv.wikipedia.org
9045 ta.wikipedia.org
9039 hy.wikipedia.org
8575 bn.wikipedia.org
7263 eu.wikipedia.org
5425 bs.wikipedia.org
4787 mk.wikipedia.org
4708 sq.wikipedia.org
4355 kk.wikipedia.org
3994 te.wikipedia.org
3859 mr.wikipedia.org
3564 arz.wikipedia.org
3338 gl.wikipedia.org
3280 ml.wikipedia.org
2333 zh-yue.wikipedia.org
1623 nn.wikipedia.org
1599 ur.wikipedia.org
1477 af.wikipedia.org
1379 is.wikipedia.org
1100 sw.wikipedia.org
1064 be.wikipedia.org
893 la.wikipedia.org
891 tl.wikipedia.org
861 kn.wikipedia.org
815 my.wikipedia.org
768 mn.wikipedia.org
762 si.wikipedia.org
672 gu.wikipedia.org
609 eo.wikipedia.org
578 zh.wiki.dieproxy.com
464 uz.wikipedia.org
440 ne.wikipedia.org
418 ru.bywiki.com
375 lb.wikipedia.org
366 an.wikipedia.org
361 ast.wikipedia.org
334 be-tarask.wikipedia.org
305 so.wikipedia.org
301 als.wikipedia.org
293 ky.wikipedia.org
285 km.wikipedia.org
282 tt.wikipedia.org
281 cy.wikipedia.org
269 tg.wikipedia.org
264 sco.wikipedia.org
215 am.wikipedia.org
202 z5h64q92x9.net
198 bar.wikipedia.org
192 ba.wikipedia.org
184 ckb.wikipedia.org
176 fy.wikipedia.org
175 en.wikipedi0.org
173 zh-classical.wikipedia.org
145 zh.bywiki.com
141 pa.wikipedia.org
141 or.wikipedia.org
138 br.wikipedia.org
121 kbd.wikipedia.org
121 wi.sxisa.org
111 oc.wikipedia.org
109 ku.wikipedia.org
105 as.wikipedia.org
105 scn.wikipedia.org
105 ga.wikipedia.org
102 ceb.wikipedia.org

Event Timeline

Nuria renamed this task from Virtual pageview refine to Virtual pageview refine should not refine smapy domains.Jun 22 2018, 4:13 PM
Nuria updated the task description. (Show Details)
Nuria renamed this task from Virtual pageview refine should not refine smapy domains to Virtual pageview refine should not refine data that does not come from wikimedia domains.Jun 22 2018, 4:14 PM

(Context: T196904#4303671 )

How do we handle this again in the webrequest refinement for normal pageviews (cf. T188804 ) - are such requests dropped there too?

Similar requests to these do not exist on the pageview pipeline cause these come basically from some other site running our eventlogging code - as is- and reporting us their page previews. We do not receive the equivalent server side pageviews of these sites.
Now, some guards we have on the pageview pipeline for other puposes - like the project whitelist- can be used here to discount spamy traffic.
https://github.com/wikimedia/analytics-refinery/blob/master/static_data/pageview/whitelist/whitelist.tsv

Vvjjkkii renamed this task from Virtual pageview refine should not refine data that does not come from wikimedia domains to 4faaaaaaaa.Jul 1 2018, 1:02 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
JJMC89 renamed this task from 4faaaaaaaa to Virtual pageview refine should not refine data that does not come from wikimedia domains.Jul 1 2018, 4:38 AM
JJMC89 raised the priority of this task from High to Needs Triage.
JJMC89 updated the task description. (Show Details)
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
CommunityTechBot raised the priority of this task from High to Needs Triage.Jul 5 2018, 7:04 PM
Nuria set the point value for this task to 5.Jul 16 2018, 6:53 PM

Ok, digging through the data a bit, I think the best approach would be to join the virtualpageview event table and wmf.pageview_whitelist in hive when performing the virtualpageview query. The only caveat I see here is that the authorised wikis in the witelist are listed without the TLD (e.g. es.wikipedia instead of es.wikipedia.org) so in my change I'm assuming that all wikimedia project hosts end in ".org". I've run the following query to check that the assumption is correct:

select distinct webhost from event.virtualpageview where not (webhost like '%.org%') and year=2018 and month =6 and day=15;
et.bywiki.com
cnwk.xsec.top
zh-wiki.eriri.ml
0s.oj2q.o5uww2lqmvsgsyjon5zgo.nblz.ru
ru.bywiki.com
0s.mvxa.o5uww2lqmvsgsyjon5zgo.cmle.ru
0s.oj2q.o5uww2lqmvsgsyjon5zgo.nblu.ru
0s.oj2q.o5uww2lqmvsgsyjon5zgo.blaim.ru
en.wiki.dieproxy.com
wk.mekaku.com
wiki.4o4.click
de.bywiki.com
uk.bywiki.com
0s.ovvq.o5uww2lqmvsgsyjon5zgo.nblz.ru
lt.bywiki.com
en-wiki.issizler.club
ja.bywiki.com
zh.wiki.dieproxy.com
ja.wiki.dieproxy.com
wiki.kfd.me
zh.100ke.info
0s.oj2q.o5uww2lqmvsgsyjon5zgo.cmle.ru
ko.wiki.dieproxy.com
zh.bywiki.com
en.bywiki.com
z5h64q92x9.net
speechpanel.readspeaker.com
wikidemo.micro.raiden.network
www.anyproxy.top
kk.bywiki.com

So we can see that no valid wikis are selected. By just adding a JOIN statement like this:

JOIN wmf.pageview_whitelist whitelist
    ON (regexp_replace(dv.webhost, ".org", "") = whitelist.authorized_value)

That should get rid of all unwanted sites. @JAllemandou let me know if you think this is the right approach. Since webhost comes with the EventLogging capsule, we could do this filtering at the EventLogging level, but I think it's better to keep it in refinery.

Change 447665 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery@master] Filter out unwanted wikis from wmf.virtualpageview_hourly

https://gerrit.wikimedia.org/r/447665

Sounds good to me, maybe with a .org$ to match only end-of-string (and make the regexp parser life easier).

Change 447665 merged by Nuria:
[analytics/refinery@master] Filter out unwanted wikis from wmf.virtualpageview_hourly

https://gerrit.wikimedia.org/r/447665

fdans changed the point value for this task from 5 to 8.Aug 16 2018, 9:05 PM

Also these 2 (which I don't see mentioned in the comment/list from Tue, Jul 24, 12:10 PM):
175 en.wikipedi0.org
121 wi.sxisa.org

Resolving after confirming that the "spammy" domains are not present on data for the 21st