We would like to increase sampling of search to 5% (desktop, mobile web, and both mobile apps [android, and iOS]). They have been at 0.1% to 1% so far. According to Oliver: "Napkin maths based on current load projected upwards to 5% suggests it's around 71 events per second."
Folks, what is the sampling rate of events right now? The average throughput is not as interesting for performance as the highest spike. Right now I'm seeing spikes up to 3 events per second. If the current rate is 0.1% then increasing to 5 would mean spikes of up to 150 events per second. That might be a problem if not for the Event Logging pipeline then for the mysql tables that have to hold this data. If the current rate is 1%, then we're ok to raise the sampling to 5%.
I see the dashboard says you're sampling at 0.1%. In that case, the jump to 5% can't be done right now. It would simply overwhelm mysql with too many events and the table would become unresponsive to queries.
We have prepared an alternative. We can black-list the Search schema from the Event Logging mysql consumer and store it only in HDFS. Is this acceptable? We could migrate the existing data into a Hive table, and the queries themselves should be easily portable. But I'm not sure what this does to the data collection / rsync / etc. Oliver?
@Milimetric: To be clear, desktop and mobile web are currently(ish) sampling at .1%. Mobile apps are at 1% (at least, that's how I read it). If that changes your analysis, let me know.
Otherwise, we'll have to wait for Oliver to weigh in on your proposal. That probably won't be until next week.
FYI, 150 events per second would mean over a billion rows in those tables assuming we truncate them after 90 days as is usual with our other tables. If we can truncate earlier that might help as well, I could get you a size that works pretty well and we can work backwards to find a number of days that would keep us under that size in MySQL.
So: in my mind throwing it in HDFS is not the best solution. If we want to throw /all/ eventlogging data in HDFS, great. If we want to throw /no/ eventlogging data in HDFS, great. When we throw only a tiny bit in I have to make architectural changes to dashboards that have just got ultra-stable and that makes me sad.
Basically this is a @Deskana call; if he's happy for us to eat the time to switch over our code and nursemaid it for a while after, let's go with HDFS. If he's not, let's talk about a higher-but-not-as-high sampling rate; 0.1 percent transitioning up to 1%? 2%? something like that.
@Ironholds, regarding the higher but not as high sampling, how long do you need this data for? That will affect what sampling you can do because the limiting factor here is just that we're trying to keep the mysql table from being so big it's impossible to query. So, do you need the data for 90 days or can you get by with 30 days or 10 days, etc.
As far as all or nothing in HDFS, that makes sense. I think once we're happy with our new kafka pipeline, we'll have all data in kafka and therefore HDFS. We would just blacklist some schemas from ending up in mysql. So you wouldn't have to deal with querying two places.
We're setting a 90 day restriction around other tables right now and that seems totally viable to me (we could even go for less, if Other Dan is okay with being able to backfill less in the case of some sort of irrevocable, machine-destroying error)
Yeah, 90 days is where we'd like all our EL tables to be. Less than that would increase the sampling that we could potentially do here. I'm thinking if the table ends up being in the hundreds of millions of rows that should be ok. If you're coming to wikimania maybe we can sit and do the math together