Page MenuHomePhabricator

GoogleDocs bot has download 125 000 csv exports in the last month
Open, Needs TriagePublic

Description

I was curious about query usage, so I took a look at the nginx access logs of the tool, using arbitrary data range from 20 May to today 14 June 2018.

$ cat access-.log |grep "GoogleDocs;"  |wc -l
142410
$ cat access-.log |grep "GoogleDocs;" |grep "/csv" |wc -l
125333
$ cat access-.log |grep --binary-files=text "GoogleDocs;" |tail
10.68.21.68 - - [28/May/2018:06:35:31 +0000] "GET /run/xxxxxx/output/1/csv HTTP/1.1" 200 57488 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)"
10.68.21.68 - - [28/May/2018:06:35:32 +0000] "GET /run/xxxxxx/output/0/csv HTTP/1.1" 200 47 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)"
10.68.21.68 - - [28/May/2018:06:35:32 +0000] "GET /query/xxxxxx/result/latest/4/csv?qd HTTP/1.1" 302 307 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)"
10.68.21.68 - - [28/May/2018:06:35:33 +0000] "GET /run/xxxxxx/output/2/csv HTTP/1.1" 200 90 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)"
10.68.21.68 - - [28/May/2018:06:35:34 +0000] "GET /run/xxxxxx/output/3/csv HTTP/1.1" 200 90 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)"
10.68.21.68 - - [28/May/2018:06:35:34 +0000] "GET /query/xxxxxx/result/latest/6/csv?qd HTTP/1.1" 302 307 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)"
10.68.21.68 - - [28/May/2018:06:35:35 +0000] "GET /query/xxxxxx/result/latest/5/csv?qd HTTP/1.1" 302 307 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)"
10.68.21.68 - - [28/May/2018:06:35:36 +0000] "GET /run/xxxxxx/output/4/csv HTTP/1.1" 200 90 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)"
10.68.21.68 - - [28/May/2018:06:35:38 +0000] "GET /run/xxxxxx/output/6/csv HTTP/1.1" 200 339341 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)"
10.68.21.68 - - [28/May/2018:06:35:38 +0000] "GET /run/xxxxxx/output/5/csv HTTP/1.1" 200 15816 "-" "Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)"

As I understand it Google has requested over 3.56 GiB in almost a month for its spreadsheet app, over 20% of the requests. I suppose it tries to fetch data when an url is added to a link spreadsheet ? Or do they use some kind of import to their system ? I can identify at least one url for witch its quarry user page was reached 6 656 times by google bot, witch is the top 5 most downloaded page of the last month. I'll sent him an email to get more information, but I don't want to expose it username without his consent, and stripped query run numbers from the output above.

Here is the distribution of the bot requests by day. It looks pretty constant - the data are similar day by day, no major leaps.

20180614	3827	2.69%
20180613	9330	6.55%
20180612	11797	8.28%
20180611	9780	6.87%
20180610	11224	7.88%
20180609	6867	4.82%
20180608	3905	2.74%
20180607	2564	1.80%
20180606	3792	2.66%
20180605	4916	3.45%
20180604	4153	2.92%
20180603	4555	3.20%
20180602	4312	3.03%
20180601	4726	3.32%
20180531	4716	3.31%
20180530	4171	2.93%
20180529	3794	2.66%
20180528	5938	4.17%
20180527	4184	2.94%
20180526	4374	3.07%
20180525	6059	4.25%
20180524	5894	4.14%
20180523	3956	2.78%
20180522	5152	3.62%
20180521	4724	3.32%
20180520	3700	2.60%

And FYI here is the stats I obtain about all crawlers :

GoogleDocs		142410	23.18%
Python-urllib/2.7	91390	14.88%
Python-urllib/2.6	50336	8.19%
SemrushBot/2~bl		29813	4.85%
Google-Apps-Script	25964	4.23%
DotBot/1.1		23907	3.89%
bingbot/2.0		6626	1.08%
YandexBot/3.0		5352	0.87%
Baiduspider/2.0		4942	0.80%
Googlebot/2.1		2551	0.42%
SeznamBot/3.2		657	0.11%
Googlebot		501	0.08%
...

IDK if this bot is useful for someone, perhaps it's just a good quarry user that has the habit to import the result into G docs ? but several thousand imports a day seem a little exaggerated :)
Is only quarry concerned in WMCS ? What should we do ?

Event Timeline

I've found this page from google help center that describe their import function It can match our problem. https://support.google.com/docs/answer/3093335?hl=en&ref_topic=3105411

Vvjjkkii renamed this task from GoogleDocs bot has download 125 000 csv exports in the last month to zzaaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from zzaaaaaaaa to GoogleDocs bot has download 125 000 csv exports in the last month.Jul 2 2018, 6:40 AM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
Stashbot added a subscriber: Stashbot.

Mentioned in SAL (#wikimedia-cloud) [2018-09-13T19:10:07Z] <framawiki> copy /var/log/nginx from legacy main-01 to /data/project/nginx-logs-legacy-20180913-framawiki for further analysis T202588 T197256