Page MenuHomePhabricator

[EPIC] Set up mechanism for archiving Google Search Console data
Closed, ResolvedPublic

Description

Per some recent discussions around our presence in search results and SEO in general, there is renewed interest in better understanding and monitoring the Google Search Console (GSC) data about our sites (example - special access required). Google only makes this available for the last 90 days at any given point in time (16 months of data is slowly becoming available via the new GSC interface, although not yet available through the API), and this limit prevents us from getting a better understanding of longer-term trends in this data (and separating them from short-term changes).

This task is about setting up an automated mechanism to regularly download the data of interest via the Search Console API and store it on our own servers, in a form suitable for analysis (e.g. CSVs and/or Druid with a Turnilo/Superset visualization front-end). Parts of the data to prioritize for storing:

Store these daily numbers for each site:

  • Clicks
  • Impressions
  • CTR
  • (average) Position

Do NOT store for now:

  • top keywords list

Filtered/split by:

  • Country
  • Device type (Desktop/Mobile/Tablet)
  • Search appearance: rich results vs all results (note: GSC does not allow stats for non-rich results)

Note: back in 2015 there were related attempts by the then Discovery team (e.g. T116822, T101158), which were abandoned because it was decided that SEO was not in the team's core scope back then.

Event Timeline

JKatzWMF updated the task description. (Show Details)
JKatzWMF awarded a token.

Status update: JK will look into giving Chelsy and/or me some kind of access so we can take a look into it. I have previous experience with Google APIs and building R bindings to web APIs, so that will be helpful here :D

debt renamed this task from Set up mechanism for archiving Search Console data to Set up mechanism for archiving Google Search Console data.Aug 24 2017, 8:23 PM
debt triaged this task as Medium priority.
debt added a project: Discovery-ARCHIVED.

Okay, I've got some stuff working right now for fetching this stuff which means it's time to consider aspects we haven't before.

According to this documentation, if you group, filter, or compare by page or search appearance, all metrics in the report are aggregated by page; otherwise, all metrics are aggregated by site. Because of the different accounting methods, the click-through rate and average position are higher when aggregating by site if multiple pages from the same site appear in the search results.

  • For impressions, if a site appears twice on a search results page when aggregating by site, it counts as a single impression; if grouping by page or search appearance, each unique page is counted separately.
  • For clicks, if a site appears twice in search results when grouped by site, and the user clicks on one link, backs up, then clicks the other link, it counts as a single click, since the final destination is the same site.
  • For position, when aggregating by site, the topmost position of your property in search results is reported; when grouped by page or search appearance, the topmost position of the page in search results is reported.

Mostly this affects the data where we're filtering by search appearance. No issues there but something to make a note of and keep in mind.

Since metrics like position are calculated based on grouping (e.g. position by country+device is different than position by country and position by device), how fine do we want to go here? Do we want to have clicks, impressions, CTR, average position per each site URL in 8 datasets (4 x {all results, rich results}):

  • overall
  • split by country
  • split by device
  • split by country and device

Do we actually want to release this data publicly? If not, do we want to store the data as an only-internally-accessible CSV? MySQL DB?

I'm using OAuth2 to authenticate and create credentials for using the API, which required me to sign in with my WMF Google account. As far as I can tell there is no other way to do this. (I've looked thoroughly.) Do we want to use my account for this in production? If so, I would need to be given access to every property. So…do we want to authenticate as the Great Webmaster account in production instead?

Furthermore, it occurred to me to check with Google API Terms of Service. Specifically, the section on content:

c. Retrieval of content

When a user's non-public content is obtained through the APIs, you may not expose that content to other users or to third parties without explicit opt-in consent from that user.

d. Data Portability

Google supports data portability. For as long as you use or store any user data that you obtained through the APIs, you agree to enable your users to export their equivalent data to other services or applications of their choice in a way that's substantially as fast and easy as exporting such data from Google products and services, subject to applicable laws, and you agree that you will not make that data available to third parties who do not also abide by this obligation.

e. Prohibitions on Content

Unless expressly permitted by the content owner or by applicable law, you will not, and will not permit your end users or others acting on your behalf to, do the following with content returned from the APIs:

  1. Scrape, build databases, or otherwise create permanent copies of such content, or keep cached copies longer than permitted by the cache header;
  2. Copy, translate, modify, create a derivative work of, sell, lease, lend, convey, distribute, publicly display, or sublicense to any third party;
  3. Misrepresent the source or ownership; or
  4. Remove, obscure, or alter any copyright, trademark, or other proprietary rights notices; or falsify or delete any author attributions, legal notices, or other labels of the origin or source of material.

I am concerned about the point the ToS makes about building a database or permanent copy from content returned from the APIs. We should seek advice from Legal before we proceed with this endeavor.

Call me crazy but i bet if we ask google for this data they will be happy to give it to us w/o having to setup web scraping/downloads

Again, call me crazy but i bet this data could be made public by google 100% such you do not need authentication to query it , we woudl be able to do it and so will be any interested party. seems that it would require a few conversations but little actual hands-on work

There are couple initiatives of us meeting with google, pinging @DFoy in this ticket in case it is of interest.

Call me crazy but i bet if we ask google for this data they will be happy to give it to us w/o having to setup web scraping/downloads
Again, call me crazy but i bet this data could be made public by google 100% such you do not need authentication to query it , we woudl be able to do it and so will be any interested party. seems that it would require a few conversations but little actual hands-on work

I actually already have a Python script that just gets the data via their API. We're just waiting to hear back from @Slaporte about the validity of storing it and also making it public (because Google's APIs have a ToS).

@DarTar This is the ticket I was talking about. I hope this data can help answering questions that research team is interested in.

FWIW, Google is currently rolling out a new version of the Search Console that (among other changes) makes some data available for a longer timespan - 16 months: https://webmasters.googleblog.com/2018/01/introducing-new-search-console.html
(E.g. right now we can look up clicks, impressions, CTR and position for https://es.wikipedia.org since September 30, 2016.)
According to the annoucement post, "In the near future, this data will also be available via the Search Console API", so perhaps we should extend/modify Mikhail's tool to ingest that data too.

@mpopov do you need help accelerating the conversation?

@mpopov now that you have NOC credentials, can you use them to access the API and get all the properties? Also, the new console now shares 16 months of data, so you might be able to go back much further than you could have before.

Code is currently in a private repo under my GitHub account. @dr0ptp4kt @Tbayer @chelsyx: I added you as collaborators so you should be able to see it.

While it works completely fine locally, I'm having problems getting it to run on stat1005 so I've also added @mforns @Nuria @Ottomata to the repo where I documented the output in an Issue. Specifically, I get the following:

(myenv) bearloga@stat1005:~/gsc$ python fetch_data.py -v cache/creds-mpopov.json en.wikipedia.org none output/enwiki 2018-03-24 90
Loading credentials from cache/creds-mpopov.json
Traceback (most recent call last):
  File "fetch_data.py", line 61, in <module>
    webmasters_service = build('webmasters', 'v3', http=http)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/oauth2client/_helpers.py", line 133, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/googleapiclient/discovery.py", line 229, in build
    requested_url, discovery_http, cache_discovery, cache)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/googleapiclient/discovery.py", line 276, in _retrieve_discovery_doc
    resp, content = http.request(actual_url)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/oauth2client/transport.py", line 175, in new_request
    redirections, connection_type)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/oauth2client/transport.py", line 282, in request
    connection_type=connection_type)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/httplib2/__init__.py", line 1509, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/httplib2/__init__.py", line 1259, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/httplib2/__init__.py", line 1182, in _conn_request
    conn.connect()
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/httplib2/__init__.py", line 1006, in connect
    sock.connect((self.host, self.port))
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/httplib2/socks.py", line 412, in connect
    if (not type(destpair) in (list,tuple)) or (len(destpair) < 2) or (not isinstance(destpair[0], basestring)) or (type(destpair[1]) != int):
NameError: name 'basestring' is not defined

which may be an issue with how httplib2 is installed by pip on that machine? For consistency, I am using Python 3.5 just like stat1005 does.

Please follow up with me on helping fix the problem since there are credentials/access limitations with actually using the script.

Code is currently in a private repo under my GitHub account. @dr0ptp4kt @Tbayer @chelsyx: I added you as collaborators so you should be able to see it.

Errr, why are we running private code in production? Everything we do is supposed to be open source and free software.

"All software should be installed via puppetized packages only. Do not install software via pip, gem, pulling down a git repo from an external source, etc. If software that you need is not available, please work with Ops to get the package added to our repository." (L3)

Errr, why are we running private code in production? Everything we do is supposed to be open source and free software.

I was just testing it out to see if it would work. It doesn't. I'm still finalizing it and will open it up when it's ready to be used for real.

"All software should be installed via puppetized packages only. Do not install software via pip, gem, pulling down a git repo from an external source, etc. If software that you need is not available, please work with Ops to get the package added to our repository." (L3)

Will do.

Errr, why are we running private code in production? Everything we do is supposed to be open source and free software.

I was just testing it out to see if it would work. It doesn't. I'm still finalizing it and will open it up when it's ready to be used for real.

That doesn't matter at all? There's no exception for closed source code "until it's ready".

Errr, why are we running private code in production? Everything we do is supposed to be open source and free software.

I was just testing it out to see if it would work. It doesn't. I'm still finalizing it and will open it up when it's ready to be used for real.

That doesn't matter at all? There's no exception for closed source code "until it's ready".

Could you be a bit more precise about what you mean by "in production" here?

Errr, why are we running private code in production? Everything we do is supposed to be open source and free software.

I was just testing it out to see if it would work. It doesn't. I'm still finalizing it and will open it up when it's ready to be used for real.

That doesn't matter at all? There's no exception for closed source code "until it's ready".

Could you be a bit more precise about what you mean by "in production" here?

The bash prompt showed that the code was being run on stat1005. I was using production in terms where pages like https://wikitech.wikimedia.org/wiki/Production_shell_access, https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Production_access, etc. apply. Though to be clear, really, closed source code isn't allowed in Cloud Services either, so it wouldn't really matter how production was being defined.

Not sure which closed source stuff you referring too. The library @mpopov is asking for is Apache licensed: https://github.com/google/google-api-python-client/blob/master/LICENSE

I think @Legoktm refers to the fact that @mpopov 's code lives in a private repo in his Github account.

I think @Legoktm refers to the fact that @mpopov 's code lives in a private repo in his Github account.

Not anymore :) it's open now: https://github.com/bearloga/wmf-gsc

With respect to productionizing it and quotas, we're going to be fine. The only thing is that there's a delay of about 2 days before today's data is available in the search console, so we'd have to account for that.

The daily stats are:

  • clicks
  • impressions
  • clickthrough rate (clicks/impressions)
  • average position (e.g. 4.5 means on average the wiki shows up in the top 5 results)

For each site, the utility fetches:

  • overall stats
  • stats split by country
  • stats split by device
  • stats split by country & device

(The reason that we split by country/device separately from country & device is for the average position in the search results.)

There are two sets of results for which we get stats: all search results and results that appear as rich cards (not all sites have this enabled or support this).

NOC account has 994 verified domains (274 that are HTTP only, 10 that are HTTPS only, and 355 that we have both variants for).

So that's 994 sites × 4 sets of stats × 2 types of search result appearances = 7,952 API calls per day. There's a limit of 100,000,000 queries to the API per day, so we're good :P

Once we have the Google API Python library available (T190767) and @dr0ptp4kt reviews my code, we should be good to go. In the meantime, there's still the question of how to store and access this data. I just finished backfilling (stats for the most recent 90 days, which is the maximum possible) and have a bunch of CSVs that I wouldn't mind putting into a queryable database. @Nuria suggested Druid, which I like because it would make the data available in Pivot & Superset.

I think @Legoktm refers to the fact that @mpopov 's code lives in a private repo in his Github account.

Not anymore :) it's open now: https://github.com/bearloga/wmf-gsc

Thanks, I think we're pretty close, it just needs a license now: https://github.com/bearloga/wmf-gsc/issues/4

I think @Legoktm refers to the fact that @mpopov 's code lives in a private repo in his Github account.

Not anymore :) it's open now: https://github.com/bearloga/wmf-gsc

Thanks, I think we're pretty close, it just needs a license now: https://github.com/bearloga/wmf-gsc/issues/4

Done! :) Available under Apache License 2.0 https://github.com/bearloga/wmf-gsc/blob/master/LICENSE.md

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
mpopov moved this task from Triage to Doing on the Product-Analytics board.
mpopov renamed this task from Set up mechanism for archiving Google Search Console data to [EPIC] Set up mechanism for archiving Google Search Console data.Jun 21 2018, 8:50 PM
mpopov updated the task description. (Show Details)
mpopov updated the task description. (Show Details)
mpopov removed subscribers: Ottomata, mforns, DarTar and 5 others.

Proof of concept dashboard up using the new test_gsc_all datasource in Druid: https://superset.wikimedia.org/superset/dashboard/wikipediagoogledemo/

Need to:

  • figure out how to clickthrough rate (CTR) in Superset
  • add desktop vs mobile as a dimension (right now have to split by subdomain -- e.g. en vs en.m)
  • add language as a dimension
  • add CTR from the rich card results (separate datasource test_gsc_rich also up)
kzimmerman moved this task from Blocked to Doing on the Product-Analytics board.
kzimmerman subscribed.

GSC data is now scheduled to update weekly on Mikhail's account; we may consider a CRON job in the future but that would be separate work (which we should align with upcoming data warehousing best practices)