Page MenuHomePhabricator

Domain Ownership Verification on Various Search Properties
Closed, ResolvedPublic

Description

Dear SRE Staff,

As we're trying to become more aware and conscious of our ecosystem, we'd like to keep an eye on how we do on search platforms outside of Google. To that end, there are folks in the Foundation who'd like to have access to Bing and Yandex search / webmaster consoles, analogous to the Google Search Console whose data we already heavily use.

This ticket is about how we prove to these various webmaster consoles that we indeed own the various Wikis.

Google

This is already solved. Not sure what we did there, but we already have this verified in Google search console.

Bing

There are essentially two ways of verify this. You use OAuth to allow Bing to use your Google search console access to read the list of sites that Google verified for you and then Microsoft will just trust that implicitly. The problem is that the API scope for querying Google Search Console analytics data and for listing owned sites is the same, so theoretically Microsoft could just use the OAuth Token to read all analytics data for the Wikis into Bing, but that would be extraordinarily bad form. While we should probably be cautious, I'm just putting this option out there for completeness.

The other way to do this is adding a CNAME entry. Example (fudged):
b3506af62206ef714873baca9cadec92 with value verify.bing.com.

There are two other methods, namely a) putting a specified xml file in the web root, and putting a meta tag in the index page for a site. I'm going to assume that's going to be too much of a hassle to do.

Yandex

There's a Yandex webmaster console as well and they have, in addition to the xml file and meta tag methods, a DNS verification method. This time, it's with a TXT entry that contains this (fudged):

yandex-verification: c9c9d3ba2d2f5273

Questions
  1. How would SRE prefer to do this? DNS or XML file or meta tags?
  1. How would this scale to the 100-odd domains that host wikipedias? One way that comes to mind is that we could start with say the first 20 Wikipedias by language and then add the rest on demand?

If SRE is OK with proceeding, I could basically generate all of the data (XML files, DNS entries, or meta tags) in a set of text files and send it to you. Open to other ideas.

CC'd my manager Adam Baso given that this is somewhat sensitive.

Event Timeline

In case someone's wondering, DuckDuckGo doesn't actually have a webmaster console. Strange.

The answer to most of this was, for the most part, established at T298723- verification will be done through DNS in all cases- as it was done for Google (for centralization, uniformization and separation of concerns). DNS is handled through the repository: https://gerrit.wikimedia.org/g/operations/dns That is handled by SREs, that's a non-blocker. I can do that myself (with traffic's overview, which owns DNS servers).

What we need help with is to make sure traffic from search engines is not affected after setting it up- the different consoles are setup properly, errors are followed up, crawling happens normally afterwards, etc. something that people on Reading may have more control and awareness 0:-) Given a large amount of Wikipedia traffic comes with search engines, we need someone with SEO knowledge owning it, making sure everything looks ok afterwards and basically supports us (and we support you) during the process. :-) We should also organize/document the process for providing access the same way it is documented for google: https://wikitech.wikimedia.org/wiki/Google_Search_Console_access so we can process requests like T298723 for Andrew.

How would this scale to the 100-odd domains that host wikipedias

That's the part we also need help with- google was setup long time ago, so long that those who did it may not be working here anymore- you may have to help us search on git if you can see how it was handled back then. Once you get access at T302625 hopefully we can work hand in hand on this- I (or any other SRE) can handle the infra side (dns updates, access control), no problem on that. You can ping me on IRC to further syncing, I believe we may be on compatible timezones. :-) Thank you for your help!

Checking the DNS records, there seems to be entries only for the 7 or so top level domains (e.g. wikipedia.org), and maybe that was enough for all subdomains? Do you know if that would work for the other search engines?

You're absolutely right to be concerned about traffic from search engines. That said, I'm familiar enough with how this works to be comfortable owning it, and my PM counterpart and I (and a half dozen or so other people at the Foundation including @AndyRussG) gaze at this data often enough that we'd know if something were amiss, and we'll obviously be watchful if we decide make any changes at all.

Apropos DNS records: I'll do a bit of poking around and get back to you on IRC with the exact specifics of how we did it for Google and let's take it from there. Thanks for the prompt reply!

Thanks to you for working on this!

Thanks so much once again @SCherukuwada and @jcrespo for your careful attention to all these important details!!!! :) :)

You're absolutely right to be concerned about traffic from search engines. That said, I'm familiar enough with how this works to be comfortable owning it, and my PM counterpart and I (and a half dozen or so other people at the Foundation including @AndyRussG) gaze at this data often enough that we'd know if something were amiss, and we'll obviously be watchful if we decide make any changes at all.

Thank you for stepping up to own this! SRE has been safeguarding access to the Google console as the "gateway of last resort" with regards to access controls, but we have no real attachment and would absolutely welcome your involvement here. I see you're already in @jcrespo's good hands but let me know if you'd like any additional help from SRE. Happy to help in any way!

JMeybohm triaged this task as Medium priority.Mar 2 2022, 10:43 AM

Just had a discussion with @jcrespo. To understand what each of these webmaster consoles provides and what ACLs and such they support (so that we can have a process around giving access to it), we want to first add a few domains to each and learn more about the platforms. Stand by for a DNS patch.

Mentioned in SAL (#wikimedia-operations) [2022-03-07T09:36:18Z] <jynus> updated non-A wikipedia.org DNS records T302617

Looking good:

root@authdns1001:~$ for i in 0 1 2 ; do dig @ns${i}.wikimedia.org -t txt wikipedia.org ; done 
; <<>> DiG 9.11.5-P4-5.1+deb10u6-Debian <<>> @ns0.wikimedia.org -t txt wikipedia.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1490
;; flags: qr aa rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: 1ce54dc4ac55eccc09a389e901194e77 (good)
;; QUESTION SECTION:
;wikipedia.org.                 IN      TXT

;; ANSWER SECTION:
wikipedia.org.          600     IN      TXT     "google-site-verification=AMHkgs-4ViEvIJf5znZle-BSE2EPNFqM1nDJGRyn2qk"
wikipedia.org.          600     IN      TXT     "yandex-verification: 35c08d23099dc863"
wikipedia.org.          600     IN      TXT     "v=spf1 include:wikimedia.org ~all"

;; Query time: 0 msec
;; SERVER: 208.80.154.238#53(208.80.154.238)
;; WHEN: Mon Mar 07 09:34:40 UTC 2022
;; MSG SIZE  rcvd: 239


; <<>> DiG 9.11.5-P4-5.1+deb10u6-Debian <<>> @ns1.wikimedia.org -t txt wikipedia.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9809
;; flags: qr aa rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: 05949551fb0b5312859b4e307693af9a (good)
;; QUESTION SECTION:
;wikipedia.org.                 IN      TXT

;; ANSWER SECTION:
wikipedia.org.          600     IN      TXT     "google-site-verification=AMHkgs-4ViEvIJf5znZle-BSE2EPNFqM1nDJGRyn2qk"
wikipedia.org.          600     IN      TXT     "yandex-verification: 35c08d23099dc863"
wikipedia.org.          600     IN      TXT     "v=spf1 include:wikimedia.org ~all"

;; Query time: 0 msec
;; SERVER: 208.80.153.231#53(208.80.153.231)
;; WHEN: Mon Mar 07 09:34:40 UTC 2022
;; MSG SIZE  rcvd: 239


; <<>> DiG 9.11.5-P4-5.1+deb10u6-Debian <<>> @ns2.wikimedia.org -t txt wikipedia.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11588
;; flags: qr aa rd; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: 3f3a979d5e4655fbc74a69678e600d68 (good)
;; QUESTION SECTION:
;wikipedia.org.                 IN      TXT

;; ANSWER SECTION:
wikipedia.org.          600     IN      TXT     "google-site-verification=AMHkgs-4ViEvIJf5znZle-BSE2EPNFqM1nDJGRyn2qk"
wikipedia.org.          600     IN      TXT     "yandex-verification: 35c08d23099dc863"
wikipedia.org.          600     IN      TXT     "v=spf1 include:wikimedia.org ~all"

;; Query time: 0 msec
;; SERVER: 91.198.174.239#53(91.198.174.239)
;; WHEN: Mon Mar 07 09:34:40 UTC 2022
;; MSG SIZE  rcvd: 239
✔️ root@authdns1001:~$ for i in 0 1 2 ; do dig @ns${i}.wikimedia.org -t cname 57d67d15e4e75c82ea6260c959068739.wikipedia.org ; done

; <<>> DiG 9.11.5-P4-5.1+deb10u6-Debian <<>> @ns0.wikimedia.org -t cname 57d67d15e4e75c82ea6260c959068739.wikipedia.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17724
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: b9334400f009df05adab7e508cb5e126 (good)
;; QUESTION SECTION:
;57d67d15e4e75c82ea6260c959068739.wikipedia.org.        IN CNAME

;; ANSWER SECTION:
57d67d15e4e75c82ea6260c959068739.wikipedia.org. 600 IN CNAME verify.bing.com.

;; Query time: 0 msec
;; SERVER: 208.80.154.238#53(208.80.154.238)
;; WHEN: Mon Mar 07 09:38:18 UTC 2022
;; MSG SIZE  rcvd: 124


; <<>> DiG 9.11.5-P4-5.1+deb10u6-Debian <<>> @ns1.wikimedia.org -t cname 57d67d15e4e75c82ea6260c959068739.wikipedia.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12801
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: 8e175d0f4ef6456b3948402c3ae079d6 (good)
;; QUESTION SECTION:
;57d67d15e4e75c82ea6260c959068739.wikipedia.org.        IN CNAME

;; ANSWER SECTION:
57d67d15e4e75c82ea6260c959068739.wikipedia.org. 600 IN CNAME verify.bing.com.

;; Query time: 0 msec
;; SERVER: 208.80.153.231#53(208.80.153.231)
;; WHEN: Mon Mar 07 09:38:18 UTC 2022
;; MSG SIZE  rcvd: 124


; <<>> DiG 9.11.5-P4-5.1+deb10u6-Debian <<>> @ns2.wikimedia.org -t cname 57d67d15e4e75c82ea6260c959068739.wikipedia.org
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50989
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
; COOKIE: b25d91ca9bd39fe64cc0fdfd6b54ce17 (good)
;; QUESTION SECTION:
;57d67d15e4e75c82ea6260c959068739.wikipedia.org.        IN CNAME

;; ANSWER SECTION:
57d67d15e4e75c82ea6260c959068739.wikipedia.org. 600 IN CNAME verify.bing.com.

;; Query time: 0 msec
;; SERVER: 91.198.174.239#53(91.198.174.239)
;; WHEN: Mon Mar 07 09:38:18 UTC 2022
;; MSG SIZE  rcvd: 124

I confirm that Bing.com verification has worked properly. However, for Yandex it seems they need the TXT entry to be under www.wikipedia.org and not wikipedia.org. Sent out patch https://gerrit.wikimedia.org/r/c/operations/dns/+/768664 to that effect.

It got a bit trickier.

You can't add a TXT entry for www when www exists as a CNAME. And it might not even help in that case. Here's why:

On Yandex, when you request verification for wikipedia.org, they automatically change it to www.wikipedia.org with the justification,

"We replaced the site URL in the line with the main mirror, because only the main mirror of the site can participate in search results. The wikipedia.org URL is a secondary mirror. You can add it to Yandex.Webmaster after you add and verify the rights for https://www.wikipedia.org/"

and then ask you to add a TXT entry for that domain specifically. I think they basically read www.wikipedia.org off the 301 response to wikipedia.org. All their other verification methods are not DNS-based and involve uploading a file on www.wikipedia.org/ or including a meta tag on the page served by www.wikipedia.org.

This leads me to believe that they have no so-called "Domain Property" ownership mechanism and we will need to verify each [language].wikipedia.org domain separately. Which makes this a bigger hassle and also doesn't solve the fundamental problem given that each language is also a CNAME entry.

Investigating.

jbond added subscribers: BBlack, jbond.

You can't add a TXT entry for www when www exists as a CNAME.

indeed as per RFC1034 s3.6.2,

If a CNAME RR is present at a node, no other data should be present

This is quite fundamental to how we manage the CDN so i don't think there will be a way around this but @BBlack should be able to give an authoritative answer.

I think we can close this task for now.

In the time since filing it, there hasn't been a real need to understand how we're doing on Yandex or Bing. Google has already been set up appropriately.

SCherukuwada claimed this task.