Page MenuHomePhabricator

Security review of Analytics Query Service
Closed, ResolvedPublic

Description

Needs a security review, and public deployment should be blocked on T109724 (if I'm understanding what the service does)

@Milimetric, can you link to any design / architecture docs, and the code repo?

Event Timeline

csteipp created this task.Oct 7 2015, 5:49 PM
csteipp raised the priority of this task from to Needs Triage.
csteipp updated the task description. (Show Details)
csteipp added subscribers: csteipp, Milimetric, mobrovac.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 7 2015, 5:49 PM

The main component is a RESTBase cluster on aqs100[123]. It uses the vanilla version, with an additional module - pageviews. That's all I'm aware of on the Services side that needs reviewing (wrt code).

Thanks @mobrovac. Once that code forwards the request to restbase, e.g., https://github.com/wikimedia/restbase/blob/master/mods/pageviews.js#L221, what does request look?

@Milimetric, I'm trying to understand this at a high level. Based on the code mobrovac pointed out, it looks like you're pushing the pageviews into restbase somewhere (link to code?), and then restbase is serving that back from the pageviews module. What's the minimum granularity? And what is agent?

Ah, I might have been more descriptive in my previous comment. Here's how it will work once all of the bits and pieces are in place:

  • The Services Team's RESTBase cluster (aka https://{domain}/api/rest_v1/) will expose a public API for the pageviews API (we're still bike-shedding on the actual layout, cf. T114830). As we do for other services, headers and friends will be filtered here.
  • The requests will reach the Analytics Team's RESTBase instance on aqs100x, where the aforementioned pageviews module will handle them.

On the Analytics side (so to speak), the Cassandra cluster on aqs100x will be filled with data produced (daily?) by Hadoop, so effectively the system itself will not produce (nor provoke that for) any data, it's a strictly read-only module.

Once that code forwards the request to restbase, e.g., https://github.com/wikimedia/restbase/blob/master/mods/pageviews.js#L221, what does request look?

The link you point to simply gets the data out of Cassandra on aqs100x. If it hasn't been filled by Hadoop already, the client receives a 404, no data fetching takes place.

Yes to everything that Marko said, thank you for the explanation. Answers to the other questions:

  • minimum granularity: hourly
  • agent can be: spider / user / bot / all-agents, as detected by our refine process [1], [2]
  • The code that pushes these stats into restbase aggregates on top of the wmf.pageview_hourly table which we were looking at together recently. Here's the change: https://gerrit.wikimedia.org/r/#/c/236224/ and the HQL scripts in there are the ones doing the actual aggregation. They're all the combinations of granularity (hourly, daily, monthly) and the three endpoints (top pageviews, per-article stats, and per-project stats).

[1] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L94
[2] https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/Webrequest.java#L114

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptDec 7 2015, 6:08 PM
Milimetric closed this task as Resolved.Aug 8 2016, 4:55 PM
Milimetric claimed this task.
Milimetric added a subscriber: dpatrick.

@dpatrick if you think this is still needed, please reopen or ping us. The service has been up for a few months and we reviewed it internally.