Page MenuHomePhabricator

Can Search up sampling to 5%? {oryx}
Closed, ResolvedPublic

Description

From @ksmith:
We would like to increase sampling of search to 5% (desktop, mobile web, and both mobile apps [android, and iOS]). They have been at 0.1% to 1% so far. According to Oliver: "Napkin maths based on current load projected upwards to 5% suggests it's around 71 events per second."

Event Timeline

kevinator raised the priority of this task from to High.
kevinator updated the task description. (Show Details)
kevinator renamed this task from Can Search up sampling 5%? to Can Search up sampling to 5%?.Jun 19 2015, 9:42 PM
kevinator set Security to None.
kevinator renamed this task from Can Search up sampling to 5%? to Can Search up sampling to 5%? {oryx}.Jun 19 2015, 10:48 PM

Folks, what is the sampling rate of events right now? The average throughput is not as interesting for performance as the highest spike. Right now I'm seeing spikes up to 3 events per second. If the current rate is 0.1% then increasing to 5 would mean spikes of up to 150 events per second. That might be a problem if not for the Event Logging pipeline then for the mysql tables that have to hold this data. If the current rate is 1%, then we're ok to raise the sampling to 5%.

Current (as far as we know) sampling and resulting rates can be seen on the search dashboard: http://searchdata.wmflabs.org/metrics/

I see the dashboard says you're sampling at 0.1%. In that case, the jump to 5% can't be done right now. It would simply overwhelm mysql with too many events and the table would become unresponsive to queries.

We have prepared an alternative. We can black-list the Search schema from the Event Logging mysql consumer and store it only in HDFS. Is this acceptable? We could migrate the existing data into a Hive table, and the queries themselves should be easily portable. But I'm not sure what this does to the data collection / rsync / etc. Oliver?

@Milimetric: To be clear, desktop and mobile web are currently(ish) sampling at .1%. Mobile apps are at 1% (at least, that's how I read it). If that changes your analysis, let me know.

Otherwise, we'll have to wait for Oliver to weigh in on your proposal. That probably won't be until next week.

thanks @ksmith, I didn't realize that. The bulk of our traffic comes from desktop and mobile, so the 1% sampling on apps doesn't help much. I'll wait for Oliver to weigh in.

FYI, 150 events per second would mean over a billion rows in those tables assuming we truncate them after 90 days as is usual with our other tables. If we can truncate earlier that might help as well, I could get you a size that works pretty well and we can work backwards to find a number of days that would keep us under that size in MySQL.

FYI this is still blocked on @Ironholds responding to my comment above (https://phabricator.wikimedia.org/T103186#1412355). Apologies for not pinging him correctly last time.

It was probably more to do with the fact that, as my OOO email made clear, I've been afk for a week ad a half. I'll try to address this tomorrow (just clearing my email backlog rn)

Okay!

So: in my mind throwing it in HDFS is not the best solution. If we want to throw /all/ eventlogging data in HDFS, great. If we want to throw /no/ eventlogging data in HDFS, great. When we throw only a tiny bit in I have to make architectural changes to dashboards that have just got ultra-stable and that makes me sad.

Basically this is a @Deskana call; if he's happy for us to eat the time to switch over our code and nursemaid it for a while after, let's go with HDFS. If he's not, let's talk about a higher-but-not-as-high sampling rate; 0.1 percent transitioning up to 1%? 2%? something like that.

@Ironholds, regarding the higher but not as high sampling, how long do you need this data for? That will affect what sampling you can do because the limiting factor here is just that we're trying to keep the mysql table from being so big it's impossible to query. So, do you need the data for 90 days or can you get by with 30 days or 10 days, etc.

As far as all or nothing in HDFS, that makes sense. I think once we're happy with our new kafka pipeline, we'll have all data in kafka and therefore HDFS. We would just blacklist some schemas from ending up in mysql. So you wouldn't have to deal with querying two places.

We're setting a 90 day restriction around other tables right now and that seems totally viable to me (we could even go for less, if Other Dan is okay with being able to backfill less in the case of some sort of irrevocable, machine-destroying error)

Yeah, 90 days is where we'd like all our EL tables to be. Less than that would increase the sampling that we could potentially do here. I'm thinking if the table ends up being in the hundreds of millions of rows that should be ok. If you're coming to wikimania maybe we can sit and do the math together

@Milimetric: Dan has approved 2% across the board. Will that work for you?

@ksmith, yes, by my math that should be ok. Just in case though, could I be cc-ed on the patch that makes the change? That way I can undo it in case we have any unexpected problems.

@Milimetric. Will do, and thanks! As far as I'm concerned, this issue can be closed/resolved.