Page MenuHomePhabricator

Don't accept data from automated bots in Event Logging
Closed, ResolvedPublic8 Story Points

Description

For example, from google:

My guess is this might be somehow related to framing.

However, it uses:

webHost : window.location.hostname

which in theory should work even in a framing scenario, unless the context of the JavaScript was somehow lost. On http://jsfiddle.net/596mX/ the top web host is htp://jsfiddle.net, the frame host is http://fiddle.jshell.net, and it alerts the latter's hostname.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=55449

Details

Reference
bz65508

Event Timeline

bzimport raised the priority of this task from to Needs Triage.
bzimport set Reference to bz65508.
bzimport added a subscriber: Unknown Object (MLST).

Thanks for bringing this up and for linking the other ticket, this is an issue that we should take very seriously as people crunching the data will rarely remember to filter by webHost (assuming filtering by wiki is sufficient) which will cause the inclusion of a lot of bogus events caused by test instances on labs or legitimate but spurious events caused by users with proxies or other factors.

I think we should try and enforce stricter validation for the webHost field to only accept events from a list of known hostnames and set up monitoring of events that fail validation on the webHost field.

Steven also suggested it could be related to Chrome's automatic translation feature.

My tests indicate that still uses the original hostname in Chromium 34.0.1847.132 Debian 7.5 (265804) , but this behavior might vary by version or something.

kevinator set Security to None.
kevinator triaged this task as High priority.
Milimetric closed this task as Declined.Dec 14 2015, 5:44 PM
Milimetric claimed this task.
Milimetric added a subscriber: Milimetric.

We don't really know what's going on here, please update if this is still an issue

Milimetric renamed this task from translate.googleusercontent.com in webHost for some client-side events to Don't accept data from automated bots in Event Logging.Dec 14 2015, 5:46 PM
Milimetric updated the task description. (Show Details)
Milimetric reopened this task as Open.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 14 2015, 5:46 PM
Milimetric updated the task description. (Show Details)Dec 14 2015, 5:46 PM
Milimetric moved this task from Incoming to Backlog on the Analytics-Engineering board.
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptOct 10 2016, 3:57 PM
Nuria removed Milimetric as the assignee of this task.Feb 9 2017, 4:59 PM
Nuria lowered the priority of this task from High to Low.Mar 13 2017, 4:00 PM
Nuria moved this task from Backlog (Later) to Wikistats Production on the Analytics board.
Nuria raised the priority of this task from Low to Normal.Apr 3 2017, 4:27 PM
Nuria raised the priority of this task from Normal to High.Apr 17 2017, 3:58 PM
Nuria added a comment.EditedApr 17 2017, 4:04 PM

We can do this work now that we are parsing the user agent for incoming EventLogging data.

We need to: add a "self-identified" bot filter for all incoming data. We just need to use the same regex than the pageview code uses: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L59

It is unfortunate that this regex would need to be duplicated on both EL and pageview data.

This would take care of events that asent by , say, a google bot crawling the android application but it will not take care of other issues that have to do with javascript. Premise of ticket is not super clear on this regard.

How do we keep track of bot-identified traffic?
Sending data to graphite

Nuria added a comment.Apr 17 2017, 4:16 PM

We can publish bot-dentified events to a bot schema that is similar to EventError that is only pushed to hadoop (not to db): https://meta.wikimedia.org/wiki/Schema:EventError

Nuria assigned this task to fdans.Apr 17 2017, 4:23 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria set the point value for this task to 8.
Nuria removed subscribers: kevinator, wikibugs-l-list.
Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Apr 21 2017, 3:12 PM

Change 350234 had a related patch set uploaded (by Fdans):
[eventlogging@master] Flag requests sent by spiders/bots using AutomatedRequest schema

https://gerrit.wikimedia.org/r/350234

Change 350235 had a related patch set uploaded (by Fdans):
[operations/puppet@production] Add AutomatedRequest to schema black list

https://gerrit.wikimedia.org/r/350235

Change 350234 merged by Ottomata:
[eventlogging@master] Mark events as bots if they self-identify

https://gerrit.wikimedia.org/r/350234

Change 352579 had a related patch set uploaded (by Fdans; owner: Fdans):
[eventlogging@master] Add handler for event filters

https://gerrit.wikimedia.org/r/352579

Change 350235 abandoned by Fdans:
Add AutomatedRequest to schema black list

Reason:
This is no longer needed since we've changed the approach for the task

https://gerrit.wikimedia.org/r/350235

Change 352582 had a related patch set uploaded (by Fdans; owner: Fdans):
[operations/puppet@production] Add bot filter to mysql consumer

https://gerrit.wikimedia.org/r/352582

Change 352579 merged by Ottomata:
[eventlogging@master] Add handler for event filters

https://gerrit.wikimedia.org/r/352579

Change 352582 merged by Ottomata:
[operations/puppet@production] Add bot filter to mysql consumer

https://gerrit.wikimedia.org/r/352582

Change 355238 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove use of is_not_bot filter in eventlogging mysql until code is fixed and change is cleared (announced)

https://gerrit.wikimedia.org/r/355238

Change 355238 merged by Ottomata:
[operations/puppet@production] Remove use of is_not_bot filter in eventlogging mysql until code is fixed and change is cleared (announced)

https://gerrit.wikimedia.org/r/355238

fdans added a comment.May 23 2017, 7:19 PM

@Nuria @Tbayer is there anything we should announce before deploying this change?

@fdans: Yes, since this is going to affect the results of various queries (even though it's by improving their accuracy), people working with them should be notified. I think a quick note to Analytics-l would be justified.

Change 355482 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use is_not_bot filter function for eventlogging mysql consumer

https://gerrit.wikimedia.org/r/355482

Change 355482 merged by Ottomata:
[operations/puppet@production] Use is_not_bot filter function for eventlogging mysql consumer

https://gerrit.wikimedia.org/r/355482

Tgr added a subscriber: Tgr.May 29 2017, 12:46 PM

How will this affect EventLogging calls made from PHP (which might need to be recorded whether the user used some bot framework or not)?

Nuria added a comment.May 29 2017, 3:15 PM

@Tgr: All calls go through varnish, there are no direct posts from php anymore (it is been a while), thus they are all process equally.

Nuria closed this task as Resolved.May 30 2017, 10:48 PM
Tgr reopened this task as Open.May 31 2017, 3:06 PM

@Tgr: All calls go through varnish, there are no direct posts from php anymore (it is been a while), thus they are all process equally.

Except that requests initiated from the backed will be rejected because the useragent is MediaWiki (or something similar). I feel this hasn't really been thought through.

Tgr added a comment.May 31 2017, 5:06 PM

Note that even if EventLogging::logEvent would forward the user agent (which it currently doesn't), filtering on that would still make no sense for schemas such as Pingback or CommandInvocation.

Tgr added a comment.May 31 2017, 5:10 PM
mysql:research@analytics-store.eqiad.wmnet [log]> select timestamp, count(*) from MediaWikiPingback_15781718 group by substr(timestamp, 1, 8) order by timestamp desc limit 50;
+----------------+----------+
| timestamp      | count(*) |
+----------------+----------+
| 20170524001855 |      125 |
| 20170523003847 |      148 |
| 20170522000909 |      140 |
| 20170521002601 |      131 |
| 20170520000349 |      118 |
| 20170519001419 |      179 |

Change 356423 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/puppet@production] Revert "Use is_not_bot filter function for eventlogging mysql consumer"

https://gerrit.wikimedia.org/r/356423

Nuria added a comment.May 31 2017, 6:11 PM

Note that even if EventLogging::logEvent would forward the user agent (which it currently doesn't)

To recap from IRC. Server side events DO forward UA: https://github.com/wikimedia/mediawiki-extensions-EventLogging/blob/master/includes/EventLogging.php#L67

However there are 2 schemas that do not log using this method: MWPingback and CommandLine, of these we think Commandline is not used.

Nuria moved this task from Done to In Progress on the Analytics-Kanban board.May 31 2017, 8:25 PM

Change 356423 abandoned by Gergő Tisza:
Revert "Use is_not_bot filter function for eventlogging mysql consumer"

Reason:
I misdiagnosed the problem, it only affects a few schemas. There are more useful ways to handle it.

https://gerrit.wikimedia.org/r/356423

Tgr added a comment.Jun 1 2017, 8:09 AM

What about logging from the job queue (which could in theory happen for PageDeletion etc when some job creates/moves/deletes pages)? That will probably have a bot UA too. (It will probably be rare though and not sure whether the people owning the schemas would want to log it in the first place.)

Tgr added a comment.Jun 1 2017, 3:56 PM

Re IRC question: for the MediaWikiPingback schema, if the UA is not recorded, it won't be missed. All the information that could be possibly learned from it (MW version, or PHP version) is already included in the payload.

Change 356624 had a related patch set uploaded (by Fdans; owner: Fdans):
[eventlogging@master] Add is_mediawiki property to UA map

https://gerrit.wikimedia.org/r/356624

Change 356626 had a related patch set uploaded (by Fdans; owner: Fdans):
[operations/puppet@production] Add exception for events tagged as coming from MW

https://gerrit.wikimedia.org/r/356626

Change 356624 merged by Nuria:
[eventlogging@master] Add is_mediawiki property to UA map

https://gerrit.wikimedia.org/r/356624

Change 357243 had a related patch set uploaded (by Ottomata; owner: Nuria):
[eventlogging@master] Simpler parsing of user_agent to asses whether 'mediawiki' is present

https://gerrit.wikimedia.org/r/357243

Change 357243 merged by Ottomata:
[eventlogging@master] Simpler parsing of user_agent to asses whether 'mediawiki' is present

https://gerrit.wikimedia.org/r/357243

Change 356626 merged by Ottomata:
[operations/puppet@production] Add exception for events tagged as coming from MW

https://gerrit.wikimedia.org/r/356626

Ottomata moved this task from In Code Review to Done on the Analytics-Kanban board.Jun 6 2017, 3:49 PM
Nuria added a comment.Jun 7 2017, 12:05 AM

Closing, events are present on MediaWikiPingback from 20170606231658.

Nuria closed this task as Resolved.Jun 7 2017, 12:05 AM