Page MenuHomePhabricator

EventLogging requests we get from non-wiki* hostnames or apps should be filtered at refine time
Closed, ResolvedPublic5 Story Points

Description

Keeping this brief on purpose, because WP:BEANS, but basically we should write a query that tells us:

  • of all webrequests to our EventLogging endpoints
  • how many are from hostnames that look like IP addresses
  • how many are from hostnames that match those on the sitematrix
  • how many "others" are there

That last one is the interesting one, if it's unexpectedly high, we can dig deeper to see if any of those validate. We can also dig deeper in the IP-looking ones to see if the User Agent is one of our apps.

Once quantified we should remove this data en eventlogging probably at refine time (with a filtered function?)

Putting this on kanban to get it done by q4.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 27 2018, 4:03 PM
fdans triaged this task as Normal priority.Mar 29 2018, 5:00 PM
fdans moved this task from Incoming to Data Quality on the Analytics board.
fdans added a project: Analytics-Data-Quality.
mforns added a subscriber: mforns.Mar 11 2019, 4:03 PM

Much of this data may be coming from bots as well, see: T210006

mforns raised the priority of this task from Normal to Needs Triage.Mar 25 2019, 5:32 PM
mforns triaged this task as Normal priority.
Nuria renamed this task from Spike: Quantify how many EventLogging requests we get from non-wiki* hostnames or apps to EventLogging requests we get from non-wiki* hostnames or apps should be filtered at refine time.Apr 15 2019, 11:54 PM
Nuria reassigned this task from Milimetric to mforns.
Nuria added a project: Analytics-Kanban.
Nuria updated the task description. (Show Details)
Nuria removed a subscriber: Tbayer.
mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Apr 17 2019, 3:45 PM
mforns moved this task from In Progress to Paused on the Analytics-Kanban board.May 13 2019, 2:11 PM
mforns moved this task from Paused to In Progress on the Analytics-Kanban board.May 20 2019, 7:20 PM

Change 511934 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] Add refine transform function to filter our non-wiki hostnames

https://gerrit.wikimedia.org/r/511934

Change 511934 merged by Nuria:
[analytics/refinery/source@master] Add refine transform function to filter our non-wiki hostnames

https://gerrit.wikimedia.org/r/511934

mforns moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Jun 6 2019, 2:29 PM
mforns moved this task from Done to Ready to Deploy on the Analytics-Kanban board.Jun 6 2019, 2:42 PM
mforns moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Jun 6 2019, 3:12 PM
Nuria closed this task as Resolved.Jun 17 2019, 7:56 PM
Nuria set the point value for this task to 5.