Page MenuHomePhabricator

Determine if poolcounter hits in CirrusSearch-Search:_elasticsearch_enwiki are due to increased number of queries, or queries taking longer to process.
Closed, ResolvedPublic

Description

The number of requests on enwiki hitting the concurrent search request limit has been increasing lately:

/a/mw-log/archive/poolcounter.log-20150218.gz:3057
/a/mw-log/archive/poolcounter.log-20150219.gz:1728
/a/mw-log/archive/poolcounter.log-20150220.gz:3189
/a/mw-log/archive/poolcounter.log-20150221.gz:3151
/a/mw-log/archive/poolcounter.log-20150222.gz:3034
/a/mw-log/archive/poolcounter.log-20150223.gz:2372
/a/mw-log/archive/poolcounter.log-20150224.gz:2840
/a/mw-log/archive/poolcounter.log-20150225.gz:828
/a/mw-log/archive/poolcounter.log-20150226.gz:598
/a/mw-log/archive/poolcounter.log-20150227.gz:0
/a/mw-log/archive/poolcounter.log-20150228.gz:0
/a/mw-log/archive/poolcounter.log-20150301.gz:1186
/a/mw-log/archive/poolcounter.log-20150302.gz:291
/a/mw-log/archive/poolcounter.log-20150303.gz:13
/a/mw-log/archive/poolcounter.log-20150304.gz:146
/a/mw-log/archive/poolcounter.log-20150305.gz:0
/a/mw-log/archive/poolcounter.log-20150306.gz:228
/a/mw-log/archive/poolcounter.log-20150307.gz:101
/a/mw-log/archive/poolcounter.log-20150308.gz:767
/a/mw-log/archive/poolcounter.log-20150309.gz:332
/a/mw-log/archive/poolcounter.log-20150310.gz:644
/a/mw-log/archive/poolcounter.log-20150311.gz:635
/a/mw-log/archive/poolcounter.log-20150312.gz:348
/a/mw-log/archive/poolcounter.log-20150313.gz:193
/a/mw-log/archive/poolcounter.log-20150314.gz:1527
/a/mw-log/archive/poolcounter.log-20150315.gz:504
/a/mw-log/archive/poolcounter.log-20150316.gz:1696
/a/mw-log/archive/poolcounter.log-20150317.gz:492
/a/mw-log/archive/poolcounter.log-20150318.gz:178
/a/mw-log/archive/poolcounter.log-20150319.gz:185
/a/mw-log/archive/poolcounter.log-20150320.gz:41
/a/mw-log/archive/poolcounter.log-20150321.gz:1
/a/mw-log/archive/poolcounter.log-20150322.gz:200
/a/mw-log/archive/poolcounter.log-20150323.gz:13
/a/mw-log/archive/poolcounter.log-20150324.gz:0
/a/mw-log/archive/poolcounter.log-20150325.gz:0
/a/mw-log/archive/poolcounter.log-20150326.gz:230
/a/mw-log/archive/poolcounter.log-20150327.gz:58
/a/mw-log/archive/poolcounter.log-20150328.gz:534
/a/mw-log/archive/poolcounter.log-20150329.gz:513
/a/mw-log/archive/poolcounter.log-20150330.gz:43
/a/mw-log/archive/poolcounter.log-20150331.gz:1421
/a/mw-log/archive/poolcounter.log-20150401.gz:0
/a/mw-log/archive/poolcounter.log-20150402.gz:102
/a/mw-log/archive/poolcounter.log-20150403.gz:989
/a/mw-log/archive/poolcounter.log-20150404.gz:1222
/a/mw-log/archive/poolcounter.log-20150405.gz:1641
/a/mw-log/archive/poolcounter.log-20150406.gz:809
/a/mw-log/archive/poolcounter.log-20150407.gz:694
/a/mw-log/archive/poolcounter.log-20150408.gz:1053
/a/mw-log/archive/poolcounter.log-20150409.gz:722
/a/mw-log/archive/poolcounter.log-20150410.gz:1682
/a/mw-log/archive/poolcounter.log-20150411.gz:20
/a/mw-log/archive/poolcounter.log-20150412.gz:463
/a/mw-log/archive/poolcounter.log-20150413.gz:55
/a/mw-log/archive/poolcounter.log-20150414.gz:82
/a/mw-log/archive/poolcounter.log-20150415.gz:1
/a/mw-log/archive/poolcounter.log-20150416.gz:16
/a/mw-log/archive/poolcounter.log-20150417.gz:1472
/a/mw-log/archive/poolcounter.log-20150418.gz:6213
/a/mw-log/archive/poolcounter.log-20150419.gz:399
/a/mw-log/archive/poolcounter.log-20150420.gz:974
/a/mw-log/archive/poolcounter.log-20150421.gz:595
/a/mw-log/archive/poolcounter.log-20150422.gz:1642
/a/mw-log/archive/poolcounter.log-20150423.gz:917
/a/mw-log/archive/poolcounter.log-20150424.gz:4246
/a/mw-log/archive/poolcounter.log-20150425.gz:1279
/a/mw-log/archive/poolcounter.log-20150426.gz:649
/a/mw-log/archive/poolcounter.log-20150427.gz:11
/a/mw-log/archive/poolcounter.log-20150428.gz:1171
/a/mw-log/archive/poolcounter.log-20150429.gz:2644
/a/mw-log/archive/poolcounter.log-20150430.gz:16580
/a/mw-log/archive/poolcounter.log-20150501.gz:15208
/a/mw-log/archive/poolcounter.log-20150502.gz:5246
/a/mw-log/archive/poolcounter.log-20150503.gz:269
/a/mw-log/archive/poolcounter.log-20150504.gz:3013
/a/mw-log/archive/poolcounter.log-20150506.gz:8874
/a/mw-log/archive/poolcounter.log-20150507.gz:5810
/a/mw-log/archive/poolcounter.log-20150508.gz:417
/a/mw-log/archive/poolcounter.log-20150509.gz:9471
/a/mw-log/archive/poolcounter.log-20150510.gz:23448
/a/mw-log/archive/poolcounter.log-20150511.gz:3977
/a/mw-log/archive/poolcounter.log-20150512.gz:12287
/a/mw-log/archive/poolcounter.log-20150513.gz:5904
/a/mw-log/archive/poolcounter.log-20150514.gz:15536
/a/mw-log/archive/poolcounter.log-20150515.gz:217899
/a/mw-log/archive/poolcounter.log-20150516.gz:7535
/a/mw-log/archive/poolcounter.log-20150517.gz:5576
/a/mw-log/archive/poolcounter.log-20150518.gz:31688
/a/mw-log/archive/poolcounter.log-20150519.gz:8014

What we are wondering is if this is due to a larger volume of requests overall, or are the search requests taking longer to process? It would be nice to take into consideration not just the day the request was made, but how many requests are processing around the same moments in time.

All of this information is available in the CirrusSearchRequests logs, in the form of

2015-05-20 04:03:24 mw1146 enwiki CirrusSearchRequests DEBUG: full_text search for 'canno' against enwiki_content took 14 millis and 12 Elasticsearch millis. Found 20 total results and returned 2 of them starting at 0. Requested via api.

I tried naively doing this with a a bit of sed/awk, but the request logs are huge. Each day is >3GB compressed and contain 150 to 200 million rows each. Essentially pulling out the 'took 14 millis' portion, and then summing those up. Just regexing out that 'took 14 millis' portion takes a couple hours per file, so a more appropriate parsing method is needed.

Event Timeline

EBernhardson raised the priority of this task from to Needs Triage.
EBernhardson updated the task description. (Show Details)
EBernhardson subscribed.

So just ended up running awk against this overnight, first collected the timing with:

ls /a/mw-log/archive/CirrusSearchRequests.log-20150* | xargs -I '{}' -P 4 bash -c 'zcat {} | awk -F "'\''" '\''/DEBUG: (full_text|prefix|regex|near_match|degraded_full_text|more_like) search for/ { split( $NF, a, " "); if (a[1]=="took" && a[3] == "millis") { print a[2] } else if (a[3]=="took" && a[5] == "millis") { print a[4] } }'\'' |pv -crN {} > /a/tmp/cirrus/$(basename {} | sed "s/.gz$//")'

Then ran some quick sum/avg over that info with

^[[Aebernhardson@fluorine:/a/tmp/cirrus$ sleep 5h; awk 'function report() {printf("%-35s %15dhrs %15d %10dms\n",v, sum/1000/60/60, lines, sum/lines)} BEGIN {printf("%-35s %15s    %15s %10s\n", "filename", "sum", "num", "avg")} FNR==1{if (lines>0) {report();sum=0;lines=0;}; prev=FILENAME} {if($1==$1+0) {sum += $1; lines += 1}} END {report()}' /a/tmp/cirrus/CirrusSearchRequests.log-20150*

This gives us the final results of:

filename                                        sum                num        avg
CirrusSearchRequests.log-20150425              2334hrs       196956146         42ms
CirrusSearchRequests.log-20150426              1886hrs       156146837         43ms
CirrusSearchRequests.log-20150427              2153hrs       171802020         45ms
CirrusSearchRequests.log-20150428              1905hrs       166531071         41ms
CirrusSearchRequests.log-20150429              1811hrs       161704978         40ms
CirrusSearchRequests.log-20150430              1895hrs       164989240         41ms
CirrusSearchRequests.log-20150501              1666hrs       148732125         40ms
CirrusSearchRequests.log-20150502              1750hrs       150594460         41ms
CirrusSearchRequests.log-20150503              1707hrs       149098832         41ms
CirrusSearchRequests.log-20150504              2011hrs       167780428         43ms
CirrusSearchRequests.log-20150506              4376hrs       327104730         48ms
CirrusSearchRequests.log-20150507              2169hrs       166197443         46ms
CirrusSearchRequests.log-20150508              2200hrs       166647516         47ms
CirrusSearchRequests.log-20150509              2099hrs       154751001         48ms
CirrusSearchRequests.log-20150510              2462hrs       166472970         53ms
CirrusSearchRequests.log-20150511              2560hrs       175685880         52ms
CirrusSearchRequests.log-20150512              2437hrs       173959097         50ms
CirrusSearchRequests.log-20150513              2389hrs       172233043         49ms
CirrusSearchRequests.log-20150514              2360hrs       167082990         50ms
CirrusSearchRequests.log-20150515              2733hrs       167550790         58ms
CirrusSearchRequests.log-20150516              2312hrs       160431934         51ms
CirrusSearchRequests.log-20150517              2500hrs       168765657         53ms
CirrusSearchRequests.log-20150518              3015hrs       188929420         57ms
CirrusSearchRequests.log-20150519              2350hrs       174381804         48ms

@EBernhardson so does the above mean you've...kinda done this? ;p

Kinda/sorta done it, but still don't have any good answers :( I think this ticket itself can be closed. Moving forward I think collecting these stats will fall under statsd integration (T90889) and will allow us to run much more detailed queries against the data.