Page MenuHomePhabricator

Consider using the alternate chain of Google Trust Services certificates
Open, Stalled, MediumPublic

Description

we are currently using the main chain provided by GTS via the ACME API:

Certificate chain                                                                       
 0 s:CN = upload.wikimedia.org
   i:C = US, O = Google Trust Services, CN = WR1
   a:PKEY: id-ecPublicKey, 256 (bit); sigalg: RSA-SHA256
   v:NotBefore: Jun 17 15:07:47 2025 GMT; NotAfter: Sep 15 15:07:46 2025 GMT
 1 s:C = US, O = Google Trust Services, CN = WR1
   i:C = US, O = Google Trust Services LLC, CN = GTS Root R1
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Dec 13 09:00:00 2023 GMT; NotAfter: Feb 20 14:00:00 2029 GMT
 2 s:C = US, O = Google Trust Services LLC, CN = GTS Root R1
   i:C = BE, O = GlobalSign nv-sa, OU = Root CA, CN = GlobalSign Root CA
   a:PKEY: rsaEncryption, 4096 (bit); sigalg: RSA-SHA256
   v:NotBefore: Jun 19 00:00:42 2020 GMT; NotAfter: Jan 28 00:00:42 2028 GMT
---

GTS also offers a shorter chain that assumes that GTS Root R1 is trusted by the user-agent, we need to check its availability to consider the change

Event Timeline

Vgutierrez triaged this task as Medium priority.Jul 3 2025, 9:27 AM

We’re evaluating switching from the current GTS certificate chain (which includes a cross-signed GlobalSign root) to the shorter chain that assumes GTS Root R1 is directly trusted.

GTS states:

“Currently Google Trust Services is trusted by Microsoft, Mozilla, Safari, Cisco, Oracle Java, Qihoo's 360 browser and Chrome. All browsers or operating systems that depend on these root programs are covered.” — source

Quick confirmations:

The Google Maps FAQ gives additional insight into mobile platform support:

  • Android: Support for GTS Root R1 may be limited in versions <10 (released 2019)
  • iOS: GTS Root R1 supported since iOS 12.1.3 (released Jan 2019)

This seems acceptable when aligned with our current TLS requirements:

  • TLSv1.2+
  • ECDSA-only certificates
  • Trust store includes Let’s Encrypt roots

⚠️ We’ll want to double-check real client compatibility, especially for legacy Android/iOS traffic, and make sure this doesn’t impact known edge cases.

Next step: audit real-world TLS client versions/user agents before rollout.

for Let's Encrypt issued certificates, they self-report the following compatibility for ISRG Root X1:

  • Windows >= XP SP3, Server 2008 (unless Automatic Root Certificate Updates have been disabled)
  • macOS >= 10.12.1 Sierra
  • iOS >= 10
  • Android >= 7.1.1
  • Firefox >= 50.0
  • Ubuntu >= 12.04 Precise Pangolin (with updates applied)
  • Debian >= 8 / Jessie (with updates applied)
  • RHEL >= 6.10, 7.4 (with updates applied), 8+
  • Java >= 7u151, 8u141, 9+
  • NSS >= 3.26
  • Chrome >= 105 (earlier versions use the operating system trust store)
  • PlayStation >= PS4 v8.0.0

I’m exploring a way to quantify the real-world impact of switching certificate chains for our Google Trust Services–issued TLS certificates. My proposal is to use probenet to collect meaningful, production-grade data from real clients. Here’s the rough idea:

  1. Issue a dedicated certificate for measure-$site.wm.o using GTS.
  2. Split a production CDN cluster (e.g. text@magru) 50/50 across its cp nodes:
    • Half of the nodes uses the standard/full GTS certificate (including cross-sign)
    • The other half use the shortened/alternate chain (assuming GTS Root R1 is trusted directly)
  3. Use probenet and our analytics pipeline to compare:
    • Success rates and handshake failures between the two chains.
    • Breakdown by User-Agent
    • Increase in NEL errors

The goal is to see if certain clients (e.g., older Android or embedded stacks) have a higher failure rate with the shorter chain, and, if so, how significant that group is in practice.

@CDanis, @ssingh do you think this could work?

I think overall this sounds quite reasonable! A few small notes:

  • NEL is still only Chromium-family browsers, so it won't tell us anything about Safari, Firefox, etc.
  • probenet works for all JS-enabled browsers, but we should double-check if/how it handles & reports cert errors specifically
  • Are you also planning to measure the latency impact?
  • Are you also planning to measure the latency impact?

This would be also relevant data for T394484, the impact should be pretty obvious given that the long chain triggers an additional RTT per TLS handshake

Change #1167143 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Issue dedicated certs for probenet endpoints

https://gerrit.wikimedia.org/r/1167143

Change #1167143 merged by Vgutierrez:

[operations/puppet@production] hiera: Issue dedicated certs for probenet endpoints

https://gerrit.wikimedia.org/r/1167143

Vgutierrez renamed this task from Consider using the alternative chain of Google Trust Services certificates to Consider using the alternate chain of Google Trust Services certificates.Jul 9 2025, 12:56 PM

Change #1167630 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: Switch esams/eqsin/drmrs to Let's Encrypt certs

https://gerrit.wikimedia.org/r/1167630

Change #1167630 merged by Vgutierrez:

[operations/puppet@production] hiera: Switch esams/eqsin/drmrs to Let's Encrypt certs

https://gerrit.wikimedia.org/r/1167630

Mentioned in SAL (#wikimedia-operations) [2025-07-09T16:02:17Z] <vgutierrez> switching esams, eqsin and drmrs to Let's Encrypt unified/upload certs - T398596

@CDanis our current measurement payload doesn't provide the hostname of the cp host that handled the measure requests, it is feasible to fetch it from the x-cache header / server-timing:

$ curl -v https://measure-eqsin.wikimedia.org/measure -s 2>&1 |grep cp5026
< x-cache: cp5026 int
< server-timing: cache;desc="int-front", host;desc="cp5026"

a quick check on MDN suggests that Performance API already exposes server-timing data, https://developer.mozilla.org/en-US/docs/Web/API/PerformanceResourceTiming/serverTiming:

performance.getEntriesByType("navigation")[0].serverTiming[1].description
'cp6013'

or we could just rely on turnilo data of successful requests to measure-magru.wm.o/measure but the sampling there could impact the quality of the data, is it ok if we modify the JS code to provide the involved cp host?

Change #1169149 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[mediawiki/extensions/WikimediaEvents@master] probenet: Report CDN host handling each measure request

https://gerrit.wikimedia.org/r/1169149

Change #1169200 had a related patch set uploaded (by Vgutierrez; author: Vgutierrez):

[operations/puppet@production] hiera: use the alt chain on half upload@magru for measure cert

https://gerrit.wikimedia.org/r/1169200

Change #1169200 merged by Vgutierrez:

[operations/puppet@production] hiera: use the alt chain on half upload@magru for measure cert

https://gerrit.wikimedia.org/r/1169200

Mentioned in SAL (#wikimedia-operations) [2025-07-15T07:38:22Z] <vgutierrez> use GTS alt chain for the measure cert on cp[7013-7016] - T398596

SELECT 
    CASE 
        WHEN hostname IN ('cp7009.magru.wmnet', 'cp7010.magru.wmnet', 'cp7011.magru.wmnet', 'cp7012.magru.wmnet') THEN 0
        WHEN hostname IN ('cp7013.magru.wmnet', 'cp7014.magru.wmnet', 'cp7015.magru.wmnet', 'cp7016.magru.wmnet') THEN 1
        ELSE -1
    END as short_chain,
    COUNT(*) as request_count
FROM wmf.webrequest
WHERE 
    webrequest_source = 'upload'
    AND year = 2025
    AND month = 7
    AND day >= 16
    AND (tls_map['vers'] = 'TLSv1.2' OR tls_map['vers'] = 'TLSv1.3')
    AND uri_host = 'measure-magru.wikimedia.org'
    AND hostname IN ('cp7009.magru.wmnet', 'cp7010.magru.wmnet', 'cp7011.magru.wmnet', 'cp7012.magru.wmnet',
                     'cp7013.magru.wmnet', 'cp7014.magru.wmnet', 'cp7015.magru.wmnet', 'cp7016.magru.wmnet')
GROUP BY 
    CASE 
        WHEN hostname IN ('cp7009.magru.wmnet', 'cp7010.magru.wmnet', 'cp7011.magru.wmnet', 'cp7012.magru.wmnet') THEN 0
        WHEN hostname IN ('cp7013.magru.wmnet', 'cp7014.magru.wmnet', 'cp7015.magru.wmnet', 'cp7016.magru.wmnet') THEN 1
        ELSE -1
    END;

A quick check grouping requests by hostnames using the long chain VS the short chain shows that since 2025-07-16 we handled more requests with the short chain:

short_chainrequest_count
01222703
11225825

FYI:

This change encountered this problem: https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#Complex_array_element_and_map_value_type_evolution_is_not_well_supported

If we had done T366487: Event Platform schemas should not support type changes to structs as array element or map value types, this would have been caught by CI.


To resolve and make sure the Hive event.development_network_probe table will continue to work, we need to do some manual changes to the table. The exact procedure will need some investigation. Perhaps we can just drop and allow the Refine system to recover? Or rename the old one and create allow Refine to create the new table? Or manually alter the new table correctly? Or Allow the system to create the new table and then insert all the old data?

Change #1169149 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] probenet: Report CDN host handling each measure request

https://gerrit.wikimedia.org/r/1169149

Change #1174016 had a related patch set uploaded (by CDanis; author: Vgutierrez):

[mediawiki/extensions/WikimediaEvents@wmf/1.45.0-wmf.12] probenet: Report CDN host handling each measure request

https://gerrit.wikimedia.org/r/1174016

Change #1174016 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@wmf/1.45.0-wmf.12] probenet: Report CDN host handling each measure request

https://gerrit.wikimedia.org/r/1174016

Mentioned in SAL (#wikimedia-operations) [2025-07-29T20:32:19Z] <cdanis@deploy1003> Started scap sync-world: Backport for [[gerrit:1174016|probenet: Report CDN host handling each measure request (T398596)]]

Mentioned in SAL (#wikimedia-operations) [2025-07-29T20:34:27Z] <cdanis@deploy1003> cdanis: Backport for [[gerrit:1174016|probenet: Report CDN host handling each measure request (T398596)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-07-29T20:42:47Z] <cdanis@deploy1003> Finished scap sync-world: Backport for [[gerrit:1174016|probenet: Report CDN host handling each measure request (T398596)]] (duration: 10m 27s)

image.png (249×535 px, 42 KB)

Works! Backported to wmf12, so it's currently live on group0, and will roll out with the train.

ssingh changed the task status from Open to Stalled.Oct 3 2025, 2:44 PM

We will be resuming work on this in Q3 2025-2026.