Page MenuHomePhabricator

Estimate maximum throughput of Schema:Search (capacity) {oryx}
Closed, DuplicatePublic

Description

@DarTar and @Deskana have created a new search schema https://meta.wikimedia.org/wiki/Schema:Search
https://trello.com/c/wXgAdRgv/644-cross-platform-search-instrumentation

To do:

  • estimate throughput for schema
  • determine sampling needed

Event Timeline

kevinator raised the priority of this task from to High.
kevinator updated the task description. (Show Details)
kevinator added a subscriber: kevinator.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 9 2015, 7:10 PM
kevinator renamed this task from Estimate maximum throughput of Schema:Search to Estimate maximum throughput of Schema:Search {oryx}.Feb 12 2015, 2:16 AM
kevinator lowered the priority of this task from High to Normal.
kevinator set Security to None.
kevinator renamed this task from Estimate maximum throughput of Schema:Search {oryx} to Estimate maximum throughput of Schema:Search (capacity) {oryx}.Feb 18 2015, 10:14 PM
kevinator added subscribers: DarTar, Deskana.

figuring out the throughput will be something similar to what was done for Hovercards T88173#1022338

Thanks. What's the process to figure out the throughput of EV? Feel free to point us to a guide with questions and we'll answer them to the best of our knowledge.

Nuria added a comment.Feb 24 2015, 8:32 PM

It's matter of figuring out how often you are logging data so we estimate a ratio, this completely depends on the feature at hand and the particulars instrumented.

Have you instrumented your code?

What user actions will trigger logging of events?

Are you logging data for logged in users only, anonymous users or both?

Have you instrumented your code?

No. We're reaching out first because we want to know what sampling rate to use.

What user actions will trigger logging of events?

I could explain here, but the schema itself documents that: https://meta.wikimedia.org/wiki/Schema:Search

Are you logging data for logged in users only, anonymous users or both?

Both, ideally.

Nuria added a comment.Feb 27 2015, 5:18 PM

Questions.

Schema related:

I could explain here, but the schema itself documents that: https://meta.wikimedia.org/wiki/Schema:Search

Ok, I see. But is this a new feature? A search revamp available to some users only? If not I am not clear how you are going to link client side and server side events
in this schema (I do not much a bout search so my apologies if I am missing something here).

The current search that I can see deployed on wikipedia has a client side autocomplete but results are served with a server side page requests. Your schema has the notion of a sessionId, how is that going to persist to the server side so you can tag serving of search results as belonging to a particular session? It seems that schema is assuming that there search is a client-side feature when the only part I can see client side is the autocomplete. Search results require a full page reload. Please let me know if this makes sense. I think working with a developer through instrumenting the schema might make my comment more clear.

Traffic Related:
Is your instrumentation going to be deployed to all wikis/projects?
How many search requests (per day) do we have on the wikis you are planning to deploy?

Nuria added a comment.Feb 27 2015, 5:47 PM

Summing up IRC conversation:

Looks like @Deskana is interested in two things: 1) How our users use search 2) number of search pageviews

For 1) EL is great fit but not so for 2). We use EL to see how users use a feature, sercrh pageviews should be available now via search logs without additional instrumentation. Just like we log via varnish the pageviews to hadoop/sampled logs, elastic search must have its own logging. To find number of pageviews /serach usage numbers should come from those logs.

Nuria added a comment.Feb 27 2015, 5:58 PM

Using EL client side to estimate search pageviews you will be missing:

  1. Bot searches
  2. Non js enabled,capable clients
  3. Mobile apps

.... probably more that I cannot think of ....

For 1 and 2 see: https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript

Nuria added a comment.Feb 27 2015, 6:09 PM

To sum up: measuring pageviews with EL is not effective cause you need to instrument all clients to get a somewhat accurate picture. Serach Pageviews should be measured on the server side end (just like regular pageviews are measured that way).

Checking in to see what's moved this into paused. thanks.

Nuria added a comment.Mar 10 2015, 1:51 AM

I think this is a mistake, moving into WIP.

Please see my last comments, we can use EL to see how people use search. It will not work well for pageviews, just like we do for regular pageviews those should come from server logs.

Nuria added a comment.Mar 10 2015, 1:52 AM

Note that last action item is not on analytics team but rather the team that owns sercrh.

Someone should figure out how to either:

A. make elasticsearch log search logs to kafka
B. make CirrusSearch log search logs to kafka

I think A is more ideal, but B. might be easier. Not sure.

ggellerman assigned this task to Nuria.Apr 3 2015, 1:57 PM
Nuria reassigned this task from Nuria to kevinator.Apr 22 2015, 3:45 PM

This schema is in production right now with a sampling rate of 0.1% and has
been for several weeks. Not sure how that affects this task, so I'm just
noting that here.

kevinator added a comment.EditedJun 3 2015, 8:41 PM

Closing this & merging into our load test of EventLogging.
The load test will reveal our maximum throughput.
We will announce at scrum of scrum the results of load test.