I was curious about query usage, so I took a look at the nginx access logs of the tool, using arbitrary data range from 20 May to today 14 June 2018.
$ cat access-.log |grep "GoogleDocs;" |wc -l 142410
$ cat access-.log |grep "GoogleDocs;" |grep "/csv" |wc -l 125333
$ cat access-.log |grep --binary-files=text "GoogleDocs;" |tail 10.68.21.68 - - [28/May/2018:06:35:31 +0000] "GET /run/xxxxxx/output/1/csv HTTP/1.1" 200 57488 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)" 10.68.21.68 - - [28/May/2018:06:35:32 +0000] "GET /run/xxxxxx/output/0/csv HTTP/1.1" 200 47 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)" 10.68.21.68 - - [28/May/2018:06:35:32 +0000] "GET /query/xxxxxx/result/latest/4/csv?qd HTTP/1.1" 302 307 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)" 10.68.21.68 - - [28/May/2018:06:35:33 +0000] "GET /run/xxxxxx/output/2/csv HTTP/1.1" 200 90 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)" 10.68.21.68 - - [28/May/2018:06:35:34 +0000] "GET /run/xxxxxx/output/3/csv HTTP/1.1" 200 90 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)" 10.68.21.68 - - [28/May/2018:06:35:34 +0000] "GET /query/xxxxxx/result/latest/6/csv?qd HTTP/1.1" 302 307 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)" 10.68.21.68 - - [28/May/2018:06:35:35 +0000] "GET /query/xxxxxx/result/latest/5/csv?qd HTTP/1.1" 302 307 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)" 10.68.21.68 - - [28/May/2018:06:35:36 +0000] "GET /run/xxxxxx/output/4/csv HTTP/1.1" 200 90 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)" 10.68.21.68 - - [28/May/2018:06:35:38 +0000] "GET /run/xxxxxx/output/6/csv HTTP/1.1" 200 339341 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)" 10.68.21.68 - - [28/May/2018:06:35:38 +0000] "GET /run/xxxxxx/output/5/csv HTTP/1.1" 200 15816 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)"
As I understand it Google has requested over 3.56 GiB in almost a month for its spreadsheet app, over 20% of the requests. I suppose it tries to fetch data when an url is added to a link spreadsheet ? Or do they use some kind of import to their system ? I can identify at least one url for witch its quarry user page was reached 6 656 times by google bot, witch is the top 5 most downloaded page of the last month. I'll sent him an email to get more information, but I don't want to expose it username without his consent, and stripped query run numbers from the output above.
Here is the distribution of the bot requests by day. It looks pretty constant - the data are similar day by day, no major leaps.
20180614 3827 2.69% 20180613 9330 6.55% 20180612 11797 8.28% 20180611 9780 6.87% 20180610 11224 7.88% 20180609 6867 4.82% 20180608 3905 2.74% 20180607 2564 1.80% 20180606 3792 2.66% 20180605 4916 3.45% 20180604 4153 2.92% 20180603 4555 3.20% 20180602 4312 3.03% 20180601 4726 3.32% 20180531 4716 3.31% 20180530 4171 2.93% 20180529 3794 2.66% 20180528 5938 4.17% 20180527 4184 2.94% 20180526 4374 3.07% 20180525 6059 4.25% 20180524 5894 4.14% 20180523 3956 2.78% 20180522 5152 3.62% 20180521 4724 3.32% 20180520 3700 2.60%
And FYI here is the stats I obtain about all crawlers :
GoogleDocs 142410 23.18% Python-urllib/2.7 91390 14.88% Python-urllib/2.6 50336 8.19% SemrushBot/2~bl 29813 4.85% Google-Apps-Script 25964 4.23% DotBot/1.1 23907 3.89% bingbot/2.0 6626 1.08% YandexBot/3.0 5352 0.87% Baiduspider/2.0 4942 0.80% Googlebot/2.1 2551 0.42% SeznamBot/3.2 657 0.11% Googlebot 501 0.08% ...
IDK if this bot is useful for someone, perhaps it's just a good quarry user that has the habit to import the result into G docs ? but several thousand imports a day seem a little exaggerated :)
Is only quarry concerned in WMCS ? What should we do ?