Page MenuHomePhabricator

Topviews Analysis of the Hungarian Wikipedia is flooded with spam
Closed, ResolvedPublic

Description

Since October, due some kind of spam, sex and narcotic-related articles are the most viewed articles of the Hungarian Wikipedia. For background, here you can read about the scandal.

Today as an example:
Home page 28k+
Cannabis 13k+
Oral sex 12k+

As you can see, the numbers of these articles are constant and almost the same.


https://tools.wmflabs.org/pageviews/?project=hu.wikipedia.org&platform=all-access&agent=user&start=2019-10-01&end=2019-11-02&pages=Or%C3%A1lis_szex|Metil%C3%A9ndioxi-metamfetamin|Kannabisz|Kokain|Szifilisz|H%C3%ADmvessz%C5%91%7CAn%C3%A1lis_szex|LSD|Hepatitis_C|Kank%C3%B3

Event Timeline

This is a bot, see patterns that are symmetric per UA (just looked at Orális_szex page)

+---+-------------+-------------------+------------------------------------------------------------+

cipgeocoded_data[city]ua

+---+-------------+-------------------+------------------------------------------------------------+

309167.xFrankfurt am MainMozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) Appl
300167.xFrankfurt am MainMozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) App
275167.xFrankfurt am MainMozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/
267167.xFrankfurt am MainMozilla/5.0 (Android 7.0; Mobile; rv:54.0) Gecko/54.0 Firefo
1661167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/201
1641167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
159167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/201
144167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
102167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
86167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36

+---+-------------+-------------------+------------------------------------------------------------+

This is a bot, see patterns that are symmetric per UA (just looked at Orális_szex page)

+---+-------------+-------------------+------------------------------------------------------------+

cipgeocoded_data[city]ua

+---+-------------+-------------------+------------------------------------------------------------+

309167.xFrankfurt am MainMozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) Appl
300167.xFrankfurt am MainMozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) App
275167.xFrankfurt am MainMozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/
267167.xFrankfurt am MainMozilla/5.0 (Android 7.0; Mobile; rv:54.0) Gecko/54.0 Firefo
1661167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/201
1641167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
159167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/201
144167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
102167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
86167.xFrankfurt am MainMozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36

+---+-------------+-------------------+------------------------------------------------------------+

Numbers above are just for 1 hour.

Also if you look at the pageviews from this IP from 1 day these are the titles requested.

+-----+----------------------+

cpage_title

+-----+----------------------+

353706521_Pina
34046FASZ_Pirszósz_Grevenón
32473Hüvely
31998Hímvessző
29105Pina_(folyó)
28810Pina_(település)
28529Pina_(film)
26318Anális_szex
25991Orális_szex
25752Ondó

+-----+----------------------+

After running the data for hu.wikipedia through bot spikes detection the top list for 2019/10/16 looks like the following. Most rogue pages (marked in red) disappear, note that for a few pages about 80% of traffic is bot in nature. This are results for 2019/10/16

Pinging here Product-Analytics so they are aware that effects of bots in "small" sites like these can be dramatic

@Nuria Thanks for the details! Is there anything further we can do?

@Bencemac not for known, @JAllemandou and myself are thinking this quarter how to best deploy our bot spike detection algorithms, when we have more news we will send an update

Ottomata triaged this task as High priority.
Ottomata moved this task from Incoming to Data Quality on the Analytics board.

Since 19th of October, the flood has stopped (17th and 18th of October).

Removing Tool-Pageviews as this is an issue with the underlying data, not the tool itself.

The flood fortunately stopped, but the most viewed articles of 2019 is strongly occupied by them (5., 7., 8., 10–16., etc.). Would be a disclaimer possible and useful until they are gone? @MusikAnimal

Update on this, we have deployed our identifying code to the pageview pipeline and it is being run on shadow mode (meaning end users do not yet see the results of the classification)

Update on this, we have deployed our identifying code to the pageview pipeline and it is being run on shadow mode (meaning end users do not yet see the results of the classification)

Excited to see. :)

The bot detection has been deployed, see: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection

Closing this ticket as there are no actionables