Configuring Anycast services on the new L3 to the ToR network design brought a limitation, which is easily seen as:
cr1-drmrs> show route 10.3.0.1 10.3.0.1/32 *[BGP/170] 5d 02:15:08, MED 0, localpref 100 AS path: (65001) 64605 I, validation-state: unverified > to 184.108.40.206 via xe-0/1/2.0 [BGP/170] 06:53:39, localpref 100 AS path: 4265006001 64605 I, validation-state: unverified > to 220.127.116.11 via et-0/0/1.0
As the servers peer with the ToR, they're now 1 hop further away from the core routers.
Furthermore the core routers are in a confederation, where for all things equal, prefixes learned from a confederation peer are preferred over "external" peers.
Which means that in the current state of things the drmrs core routers (but same will happen with the eqiad expansion) prefer to sent traffic to eqiad rather than keeping it local.
See Juniper's BGP path selection.
2 fixes are being considered to solve that issue:
1/ Set a local-pref on prefixes learned from the local switches, then remove that local pref when re-advertising prefixes to the other sites (making it local to drmrs)
Removing it is required as a local-pref is transitive through confederations peers. Setting it in drmrs without removing it would attract all traffic for the given prefix.
This solution is the easiest to implement as it requires config change in drmrs only.
2/ Do AS path prepending to anycast prefixes learned directly from the core routers to match the AS path length on the new design infra.
So 10.3.0.1 on cr1-eqiad, will be seen with AS path "64605 64605 I"
The same prefix will be received in drmrs with AS path "65001 64605 64605 I", longer (and less preferred) than the local path.
This 2nd option (proposed by Cathal, thanks!) seems cleaner to me on the longer run, with the downside of a more complex rollout, as traffic will shift briefly to a different site, except for the last site to tackle (as all the other paths will be longer).