Page MenuHomePhabricator

Provide web crawler data logs to Go Fish Digital
Closed, ResolvedPublic

Description

I'm working with Go Fish Digital to help me understand more about what we can do to improve our SEO. As part of their work, they need some of our data.

They need crawl logs from the following bots:

  • Googlebot
  • Googlebot mobile
  • Googlebot smartphone
  • Bingbot
  • Baiduspider
  • Yandexbot

Requirements:

  • They want about four days worth of raw data of the activity of these bots on our sites.
  • The data shouldn't be older than, say, one month, but it doesn't necessarily need to be the four days immediately preceding today.
  • Data format should be gzipped TSV.

Notes:

  • I'll get a public key using 2048-bit RSA to transmit the data to them securely.
  • It's possible that they'll come back to us and ask for more or less data, so it'd be good if you could be prepared for that.
  • I believe these bots use specific user agents which should make your job a bit easier.

They have signed a master service agreement which fully covers our privacy policy, data retention, and data security requirements, and the agreement received signoff from Jim Buatti (in Legal) and Toby (the Chief Product Officer), amongst others.

Event Timeline

Very time sensitive.

mpopov moved this task from Triage to Doing on the Product-Analytics board.

@Deskana: Googlebot Mobile does not seem to be a thing anymore according to its lack of presence on https://support.google.com/webmasters/answer/1061943

There was an update to https://webmasters.googleblog.com/2011/12/introducing-smartphone-googlebot-mobile.html …in 2015, so I doubt it's still operational. If Go Fish Digital knows that Googlebot Mobile is still operation, can they please provide the UA?

Also, how rigorous does this need to be? Specifically, should I verify every crawler's IP address with reverse DNS lookup to make sure the hostnames belong to Yandex, Bing, Google, Baidu? (I suspect the answer is yes but just wanted to make sure because that part isn't trivial.)

@chelsyx Can you please help me? I’ve been able to find documentation (UserAgent strings and instructions for verifying) on Google’s, Bing’s, and Yandex’s crawlers but the best I’ve been able to find for Baidu (in English) is this blog post by a third party from 2011: https://chineseseoshifu.com/blog/new-baidu-user-agent-baiduspider.html so I suspect any official documentation about what Baiduspider’s UA looks like these days (or how to verify) would be on Baidu's website and in Chinese.

Here's an example of the documentation I'm looking for: https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.html

FAQs of Baiduspider (in English, include UA): http://help.baidu.com/question?prod_id=99&class=0&id=3001
How to identify Baiduspider (in Chinese, let me know if you can't understand it with google translate): https://ziyuan.baidu.com/college/articleinfo?id=1002

FAQs of Baiduspider (in English, include UA): http://help.baidu.com/question?prod_id=99&class=0&id=3001
How to identify Baiduspider (in Chinese, let me know if you can't understand it with google translate): https://ziyuan.baidu.com/college/articleinfo?id=1002

Thank you so much!!!

@Deskana: Googlebot Mobile does not seem to be a thing anymore according to its lack of presence on https://support.google.com/webmasters/answer/1061943

There was an update to https://webmasters.googleblog.com/2011/12/introducing-smartphone-googlebot-mobile.html …in 2015, so I doubt it's still operational. If Go Fish Digital knows that Googlebot Mobile is still operation, can they please provide the UA?

I'll check with them.

Also, how rigorous does this need to be? Specifically, should I verify every crawler's IP address with reverse DNS lookup to make sure the hostnames belong to Yandex, Bing, Google, Baidu? (I suspect the answer is yes but just wanted to make sure because that part isn't trivial.)

Yeah, the extra check makes sense.

@Deskana: Progress update: I have 4 days of data (~12GB gzipped) and right now I have a script that's verifying ~20K IP addresses to determine which ones are legit and which ones spoofed the UA and pretended to be one of those crawlers. As you might expect, that part is taking some time.

A side benefit of doing the verification is that I will have an extra deliverable for the traffic team of all the IP addresses from those days that misrepresented themselves.

Once that's done, I'll just be waiting for upload instructions and a public encryption key.

@mpopov Thanks! I asked them for the public key and upload instructions last Friday, and I should hear back from them soon.

Just uploaded the data to Go Fish Digital.

Vvjjkkii renamed this task from Provide web crawler data logs to Go Fish Digital to r8daaaaaaa.Jul 1 2018, 1:13 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed mpopov as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Mainframe98 renamed this task from r8daaaaaaa to Provide web crawler data logs to Go Fish Digital.Jul 1 2018, 10:02 AM
Mainframe98 closed this task as Resolved.
Mainframe98 assigned this task to mpopov.
Mainframe98 updated the task description. (Show Details)
Mainframe98 added a subscriber: Aklapper.