[EPIC] Set up mechanism for archiving Google Search Console data
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Tbayer
	Aug 5 2017, 12:14 AM

Description

Per some recent discussions around our presence in search results and SEO in general, there is renewed interest in better understanding and monitoring the Google Search Console (GSC) data about our sites (example - special access required). Google only makes this available for the last 90 days at any given point in time (16 months of data is slowly becoming available via the new GSC interface, although not yet available through the API), and this limit prevents us from getting a better understanding of longer-term trends in this data (and separating them from short-term changes).

This task is about setting up an automated mechanism to regularly download the data of interest via the Search Console API and store it on our own servers, in a form suitable for analysis (e.g. CSVs and/or Druid with a Turnilo/Superset visualization front-end). Parts of the data to prioritize for storing:

Store these daily numbers for each site:

Clicks
Impressions
CTR
(average) Position

Do NOT store for now:

top keywords list

Filtered/split by:

Country
Device type (Desktop/Mobile/Tablet)
Search appearance: rich results vs all results (note: GSC does not allow stats for non-rich results)

Note: back in 2015 there were related attempts by the then Discovery team (e.g. T116822, T101158), which were abandoned because it was decided that SEO was not in the team's core scope back then.

Related Objects
Search...

Status	Assigned	Task
Resolved	mpopov	T172581 [EPIC] Set up mechanism for archiving Google Search Console data
Resolved	Ottomata	T190767 Make Google API Python Client Library available on stat* machines
Resolved	Ottomata	T197896 Make various auth libraries available on stat* machines
Resolved	dr0ptp4kt	T197897 Review code for fetching Google Search Console data
Resolved	mpopov	T192497 Add missing properties to NOC's Google Search Console

Event Timeline

• Tbayer created this task.Aug 5 2017, 12:14 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 5 2017, 12:14 AM

• JKatzWMF updated the task description. (Show Details)Aug 5 2017, 12:25 AM

• JKatzWMF updated the task description. (Show Details)

• JKatzWMF awarded a token.

Status update: JK will look into giving Chelsy and/or me some kind of access so we can take a look into it. I have previous experience with Google APIs and building R bindings to web APIs, so that will be helpful here :D

• Tbayer added a project: SEO.Aug 10 2017, 6:34 AM

debt renamed this task from Set up mechanism for archiving Search Console data to Set up mechanism for archiving Google Search Console data.Aug 24 2017, 8:23 PM

debt triaged this task as Medium priority.

debt added a project: Discovery-ARCHIVED.

debt moved this task from Needs triage to Up Next on the Discovery-Analysis board.Aug 30 2017, 12:21 AM

mpopov moved this task from Up Next to Current work on the Discovery-Analysis board.Dec 13 2017, 6:41 PM

mpopov edited projects, added Discovery-Analysis (Current work); removed Discovery-Analysis.

Okay, I've got some stuff working right now for fetching this stuff which means it's time to consider aspects we haven't before.

According to this documentation, if you group, filter, or compare by page or search appearance, all metrics in the report are aggregated by page; otherwise, all metrics are aggregated by site. Because of the different accounting methods, the click-through rate and average position are higher when aggregating by site if multiple pages from the same site appear in the search results.

For impressions, if a site appears twice on a search results page when aggregating by site, it counts as a single impression; if grouping by page or search appearance, each unique page is counted separately.
For clicks, if a site appears twice in search results when grouped by site, and the user clicks on one link, backs up, then clicks the other link, it counts as a single click, since the final destination is the same site.
For position, when aggregating by site, the topmost position of your property in search results is reported; when grouped by page or search appearance, the topmost position of the page in search results is reported.

Mostly this affects the data where we're filtering by search appearance. No issues there but something to make a note of and keep in mind.

Since metrics like position are calculated based on grouping (e.g. position by country+device is different than position by country and position by device), how fine do we want to go here? Do we want to have clicks, impressions, CTR, average position per each site URL in 8 datasets (4 x {all results, rich results}):

overall
split by country
split by device
split by country and device

Do we actually want to release this data publicly? If not, do we want to store the data as an only-internally-accessible CSV? MySQL DB?

I'm using OAuth2 to authenticate and create credentials for using the API, which required me to sign in with my WMF Google account. As far as I can tell there is no other way to do this. (I've looked thoroughly.) Do we want to use my account for this in production? If so, I would need to be given access to every property. So…do we want to authenticate as the Great Webmaster account in production instead?

Furthermore, it occurred to me to check with Google API Terms of Service. Specifically, the section on content:

c. Retrieval of content

When a user's non-public content is obtained through the APIs, you may not expose that content to other users or to third parties without explicit opt-in consent from that user.

d. Data Portability

Google supports data portability. For as long as you use or store any user data that you obtained through the APIs, you agree to enable your users to export their equivalent data to other services or applications of their choice in a way that's substantially as fast and easy as exporting such data from Google products and services, subject to applicable laws, and you agree that you will not make that data available to third parties who do not also abide by this obligation.

e. Prohibitions on Content

Unless expressly permitted by the content owner or by applicable law, you will not, and will not permit your end users or others acting on your behalf to, do the following with content returned from the APIs:

Scrape, build databases, or otherwise create permanent copies of such content, or keep cached copies longer than permitted by the cache header;
Copy, translate, modify, create a derivative work of, sell, lease, lend, convey, distribute, publicly display, or sublicense to any third party;
Misrepresent the source or ownership; or
Remove, obscure, or alter any copyright, trademark, or other proprietary rights notices; or falsify or delete any author attributions, legal notices, or other labels of the origin or source of material.

I am concerned about the point the ToS makes about building a database or permanent copy from content returned from the APIs. We should seek advice from Legal before we proceed with this endeavor.

Call me crazy but i bet if we ask google for this data they will be happy to give it to us w/o having to setup web scraping/downloads

Again, call me crazy but i bet this data could be made public by google 100% such you do not need authentication to query it , we woudl be able to do it and so will be any interested party. seems that it would require a few conversations but little actual hands-on work

There are couple initiatives of us meeting with google, pinging @DFoy in this ticket in case it is of interest.

In T172581#3888536, @Nuria wrote:

Call me crazy but i bet if we ask google for this data they will be happy to give it to us w/o having to setup web scraping/downloads
Again, call me crazy but i bet this data could be made public by google 100% such you do not need authentication to query it , we woudl be able to do it and so will be any interested party. seems that it would require a few conversations but little actual hands-on work

I actually already have a Python script that just gets the data via their API. We're just waiting to hear back from @Slaporte about the validity of storing it and also making it public (because Google's APIs have a ToS).

@DarTar This is the ticket I was talking about. I hope this data can help answering questions that research team is interested in.

FWIW, Google is currently rolling out a new version of the Search Console that (among other changes) makes some data available for a longer timespan - 16 months: https://webmasters.googleblog.com/2018/01/introducing-new-search-console.html
(E.g. right now we can look up clicks, impressions, CTR and position for https://es.wikipedia.org since September 30, 2016.)
According to the annoucement post, "In the near future, this data will also be available via the Search Console API", so perhaps we should extend/modify Mikhail's tool to ingest that data too.

@mpopov do you need help accelerating the conversation?

@mpopov now that you have NOC credentials, can you use them to access the API and get all the properties? Also, the new console now shares 16 months of data, so you might be able to go back much further than you could have before.

Code is currently in a private repo under my GitHub account. @dr0ptp4kt @Tbayer @chelsyx: I added you as collaborators so you should be able to see it.

While it works completely fine locally, I'm having problems getting it to run on stat1005 so I've also added @mforns @Nuria @Ottomata to the repo where I documented the output in an Issue. Specifically, I get the following:

(myenv) bearloga@stat1005:~/gsc$ python fetch_data.py -v cache/creds-mpopov.json en.wikipedia.org none output/enwiki 2018-03-24 90
Loading credentials from cache/creds-mpopov.json
Traceback (most recent call last):
  File "fetch_data.py", line 61, in <module>
    webmasters_service = build('webmasters', 'v3', http=http)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/oauth2client/_helpers.py", line 133, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/googleapiclient/discovery.py", line 229, in build
    requested_url, discovery_http, cache_discovery, cache)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/googleapiclient/discovery.py", line 276, in _retrieve_discovery_doc
    resp, content = http.request(actual_url)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/oauth2client/transport.py", line 175, in new_request
    redirections, connection_type)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/oauth2client/transport.py", line 282, in request
    connection_type=connection_type)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/httplib2/__init__.py", line 1509, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/httplib2/__init__.py", line 1259, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/httplib2/__init__.py", line 1182, in _conn_request
    conn.connect()
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/httplib2/__init__.py", line 1006, in connect
    sock.connect((self.host, self.port))
  File "/srv/home/bearloga/myenv/lib/python3.5/site-packages/httplib2/socks.py", line 412, in connect
    if (not type(destpair) in (list,tuple)) or (len(destpair) < 2) or (not isinstance(destpair[0], basestring)) or (type(destpair[1]) != int):
NameError: name 'basestring' is not defined

which may be an issue with how httplib2 is installed by pip on that machine? For consistency, I am using Python 3.5 just like stat1005 does.

Please follow up with me on helping fix the problem since there are credentials/access limitations with actually using the script.

In T172581#4082983, @mpopov wrote:

Code is currently in a private repo under my GitHub account. @dr0ptp4kt @Tbayer @chelsyx: I added you as collaborators so you should be able to see it.

Errr, why are we running private code in production? Everything we do is supposed to be open source and free software.

"All software should be installed via puppetized packages only. Do not install software via pip, gem, pulling down a git repo from an external source, etc. If software that you need is not available, please work with Ops to get the package added to our repository." (L3)

In T172581#4083106, @Legoktm wrote:

Errr, why are we running private code in production? Everything we do is supposed to be open source and free software.

I was just testing it out to see if it would work. It doesn't. I'm still finalizing it and will open it up when it's ready to be used for real.

In T172581#4083131, @Dzahn wrote:

"All software should be installed via puppetized packages only. Do not install software via pip, gem, pulling down a git repo from an external source, etc. If software that you need is not available, please work with Ops to get the package added to our repository." (L3)

Will do.

In T172581#4083167, @mpopov wrote:

In T172581#4083106, @Legoktm wrote:

Errr, why are we running private code in production? Everything we do is supposed to be open source and free software.

I was just testing it out to see if it would work. It doesn't. I'm still finalizing it and will open it up when it's ready to be used for real.

That doesn't matter at all? There's no exception for closed source code "until it's ready".

In T172581#4083181, @Legoktm wrote:

In T172581#4083167, @mpopov wrote:

In T172581#4083106, @Legoktm wrote:

Errr, why are we running private code in production? Everything we do is supposed to be open source and free software.

I was just testing it out to see if it would work. It doesn't. I'm still finalizing it and will open it up when it's ready to be used for real.

That doesn't matter at all? There's no exception for closed source code "until it's ready".

Could you be a bit more precise about what you mean by "in production" here?

In T172581#4083191, @Tbayer wrote:

In T172581#4083181, @Legoktm wrote:

In T172581#4083167, @mpopov wrote:

In T172581#4083106, @Legoktm wrote:

Errr, why are we running private code in production? Everything we do is supposed to be open source and free software.

I was just testing it out to see if it would work. It doesn't. I'm still finalizing it and will open it up when it's ready to be used for real.

That doesn't matter at all? There's no exception for closed source code "until it's ready".

Could you be a bit more precise about what you mean by "in production" here?

The bash prompt showed that the code was being run on stat1005. I was using production in terms where pages like https://wikitech.wikimedia.org/wiki/Production_shell_access, https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Production_access, etc. apply. Though to be clear, really, closed source code isn't allowed in Cloud Services either, so it wouldn't really matter how production was being defined.

Not sure which closed source stuff you referring too. The library @mpopov is asking for is Apache licensed: https://github.com/google/google-api-python-client/blob/master/LICENSE

I think @Legoktm refers to the fact that @mpopov 's code lives in a private repo in his Github account.

In T172581#4085488, @mforns wrote:

I think @Legoktm refers to the fact that @mpopov 's code lives in a private repo in his Github account.

Not anymore :) it's open now: https://github.com/bearloga/wmf-gsc

Thanks @mpopov :]

With respect to productionizing it and quotas, we're going to be fine. The only thing is that there's a delay of about 2 days before today's data is available in the search console, so we'd have to account for that.

The daily stats are:

clicks
impressions
clickthrough rate (clicks/impressions)
average position (e.g. 4.5 means on average the wiki shows up in the top 5 results)

For each site, the utility fetches:

overall stats
stats split by country
stats split by device
stats split by country & device

(The reason that we split by country/device separately from country & device is for the average position in the search results.)

There are two sets of results for which we get stats: all search results and results that appear as rich cards (not all sites have this enabled or support this).

NOC account has 994 verified domains (274 that are HTTP only, 10 that are HTTPS only, and 355 that we have both variants for).

So that's 994 sites × 4 sets of stats × 2 types of search result appearances = 7,952 API calls per day. There's a limit of 100,000,000 queries to the API per day, so we're good :P

Once we have the Google API Python library available (T190767) and @dr0ptp4kt reviews my code, we should be good to go. In the meantime, there's still the question of how to store and access this data. I just finished backfilling (stats for the most recent 90 days, which is the maximum possible) and have a bunch of CSVs that I wouldn't mind putting into a queryable database. @Nuria suggested Druid, which I like because it would make the data available in Pivot & Superset.

In T172581#4085814, @mpopov wrote:

In T172581#4085488, @mforns wrote:

I think @Legoktm refers to the fact that @mpopov 's code lives in a private repo in his Github account.

Not anymore :) it's open now: https://github.com/bearloga/wmf-gsc

Thanks, I think we're pretty close, it just needs a license now: https://github.com/bearloga/wmf-gsc/issues/4

In T172581#4086788, @Legoktm wrote:

In T172581#4085814, @mpopov wrote:

In T172581#4085488, @mforns wrote:

I think @Legoktm refers to the fact that @mpopov 's code lives in a private repo in his Github account.

Not anymore :) it's open now: https://github.com/bearloga/wmf-gsc

Thanks, I think we're pretty close, it just needs a license now: https://github.com/bearloga/wmf-gsc/issues/4

Done! :) Available under Apache License 2.0 https://github.com/bearloga/wmf-gsc/blob/master/LICENSE.md

Legoktm unsubscribed.Mar 28 2018, 1:04 AM

• Nuria closed subtask T190767: Make Google API Python Client Library available on stat* machines as Resolved.Apr 12 2018, 10:07 PM

• Phabricator_maintenance added a project: Product-Analytics.Apr 18 2018, 11:23 PM

mpopov claimed this task.Apr 23 2018, 11:03 PM

mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.

mpopov moved this task from Triage to Doing on the Product-Analytics board.

Dzahn unsubscribed.Apr 23 2018, 11:12 PM

mpopov renamed this task from Set up mechanism for archiving Google Search Console data to [EPIC] Set up mechanism for archiving Google Search Console data.Jun 21 2018, 8:50 PM

mpopov edited projects, added Epic; removed Discovery-Analysis (Current work), Discovery-ARCHIVED, Reading-analysis.

mpopov updated the task description. (Show Details)

mpopov removed subscribers: Ottomata, mforns, • DarTar and 5 others.

dr0ptp4kt closed subtask T197897: Review code for fetching Google Search Console data as Resolved.Jun 25 2018, 4:00 PM

mpopov mentioned this in T197986: Evaluate and Quantify the state of multilingual labels on Wikidata.Jun 28 2018, 8:16 PM

• Vvjjkkii reopened subtask T197897: Review code for fetching Google Search Console data as Open.Jul 1 2018, 1:02 AM

CommunityTechBot closed subtask T197897: Review code for fetching Google Search Console data as Resolved.Jul 2 2018, 11:14 AM

mpopov added a subtask: T192497: Add missing properties to NOC's Google Search Console.Jul 17 2018, 7:20 PM

Proof of concept dashboard up using the new test_gsc_all datasource in Druid: https://superset.wikimedia.org/superset/dashboard/wikipediagoogledemo/

Need to:

figure out how to clickthrough rate (CTR) in Superset
add desktop vs mobile as a dimension (right now have to split by subdomain -- e.g. en vs en.m)
add language as a dimension
add CTR from the rich card results (separate datasource test_gsc_rich also up)

mpopov moved this task from Doing to Blocked on the Product-Analytics board.Oct 18 2018, 8:13 PM

• Tbayer mentioned this in T214998: RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL.Mar 7 2019, 9:03 AM

mpopov closed subtask T197896: Make various auth libraries available on stat* machines as Resolved.Mar 14 2019, 5:42 PM

kzimmerman removed a subtask: T201510: Compute average search result placement.Jun 18 2019, 6:18 PM

kzimmerman mentioned this in T201510: Compute average search result placement.

GSC data is now scheduled to update weekly on Mikhail's account; we may consider a CRON job in the future but that would be separate work (which we should align with upcoming data warehousing best practices)

Dzahn mentioned this in T238090: Search Console access for he.wikisource.org.Nov 13 2019, 2:38 PM

kzimmerman mentioned this in T246694: Productionize Google Search Console data cube.Mar 2 2020, 6:21 PM

mpopov mentioned this in T283536: Request to delete test_gsc_* datasets from Druid (& Superset/Turnilo).May 24 2021, 8:06 PM