Page MenuHomePhabricator

Expose 3 new dedicated WDQS endpoints
Closed, ResolvedPublic5 Estimated Story Points

Description

We'll be launching three net-new services, wdqs-main-graph, wdqs-full-graph and wdqs-scholarly-articles (real names TBD).

Each service will be served on a single host respectively, and thus we probably don't need LVS/pybal.

Servers are already available. Data load isn't part of this ticket.

We will need additional static web sites for the UI, with their configuration.

AC

  • Talk to Traffic team and come up with a concrete list of steps to get these services ready from a routing point of view
  • deploy UI for each endpoints
  • SPARQL endpoints and UI are accessible to the internet

Event Timeline

Gehel renamed this task from Launch two new dedicated WDQS endpoints to Launch 3 new dedicated WDQS endpoints.Nov 20 2023, 4:34 PM
Gehel updated the task description. (Show Details)
Gehel set the point value for this task to 5.
Gehel triaged this task as High priority.Nov 22 2023, 9:23 AM
Gehel moved this task from Incoming to Quarterly Goals on the Data-Platform-SRE board.

Alright, I had an initial meeting with Traffic team (Brandon & Valentin).

Traffic team meeting summary

The primary concern they had was related to the potential impact on the ATS side of things; in a scenario where Blazegraph is taking a consistently long time to respond, ATS would be impacted due to the large numbers of dangling sockets, which could theoretically impact the rest of production infrastructure (mediawiki, etc).

This isn't necessarily a new problem, it sounds like this is a potential concern they've had with WDQS in general for some time now. One of the possibilities we discussed was to bypass the caching layer entirely and just use LVS, so with respect to these net-new services backed by a single backend host each, it would look like a single backend host behind LVS but avoiding the ATS/caching layer entirely. That eliminates the concern around ATS but does introduce a few drawbacks:

  • (primary drawback) We lose requestctl which is a tremendously useful tool when managing WDQS outages. We'd presumably be going back to the old way of doing things where we'd manually ban at the nginx level when necessary.
  • (not discussed in meeting but realized after-the-fact) We currently rely on trafficserver to handle routing to the miscweb hosts; as such we'd need to spin off separate domains to replicate this functionality
  • Some extra latency would be introduced since we wouldn't be terminating TLS as close to the user. This probably isn't the hugest deal; adding up to 100ms of latency to the user end likely wouldn't break existing usecases.
  • There's some changes to puppet automation, etc that we'd have to make. It sounds like the main one is that tls certs would have to go thru acmechief rather than relying on the cdn. This generates some work on our (Search team) end in creating the corresponding puppet patch(es) but wouldn't be a showstopper.

Of the above 3 drawbacks the most painful one is losing requestctl; it's a really great tool. But it might be a worthwhile tradeoff to entirely avoid the possibility of a misbehaving query service impacting non-WDQS production infrastructure like MediaWiki itself. I'd note that I'm not aware of us specifically having encountered that problem (ATS backing up and impacting the rest of prod) in previous WDQS outages, but it's also not something we were going out of our way to look for either.

So, we'll need to discuss amongst Search team and see what the consensus is, then bring the discussion back to Traffic team for further feedback.

Other context
  • The existing request flow for WDQS as it exists currently is haproxy [traffic team manage certs] -> varnish -> ats -> envoy -> nginx -> blazegraph (these are from hastily transcribed notes + I filled in the missing gaps on the righthand side [nginx->blazegraph] so I'll want to follow up and validate that the above flow is correct)
  • As far as how things would look like after spinning up the new endpoints (sidestepping the question of whether to bypass the caching layer or not), query.wikidata.org would still be getting the vast majority of the traffic, with wdqs-scholarly-articles getting just a few % of total traffic at most. We expect actual usage of these new endpoints to be quite low - basically only the WDQS powerusers will actually be trying it out, at least initially - but given it'd still be a production service and exposed to the outside world there is of course always the potential for a malicious attacker wrt the concerns about ATS getting backed up.
  • There's an open question around how much benefit from the caching layer WDQS enjoys currently. Our assumption is that WDQS is very impractical to cache and thus we're getting very little contribution from the caching layer as far as performance is concerned currently, but I'll need to glance at Grafana etc and validate that hunch.

A few notes in reply to the comment above:

  • This ticket is specifically for 3 experimental endpoints that are temporary and will be used to gather feedback from our communities. We expect low amount of traffic, so this reduces the risk. We should still think about the long term solution if it needs to be different than what we currently have for WDQS / WCQS, but that can be done later.
  • We have fairly low TTL on the SPARQL requests (5 minutes). Not having caching at all is probably not a big deal (but we need to check the numbers - we might have a few hot spots that the cache helps). For the experimental phase, not having cache is very unlikely to be an issue. The UI is a bunch of static HTML / CSS / JS files that have a longer TTL, but they are also super cheap to serve.
  • In the long term, loosing requestctl seems to be a big deal. WDQS is one of the services where requestctl is regularly useful.
  • Increased latency due to SSL termination is not an issue. SPARQL queries tend to be at least an order of magnitude longer, so user impact is negligible.

Additional notes:

  • Timeline: we'd like to have experimental servers publicly accessible in mid-January
  • short term solution for experimental servers and long term solution for WDQS as a whole should probably be treated separately

Current status:

  • Met again with Traffic team last week and got approval for our proposal to spin up these 3 new endpoints [after december vacation], and address the longer-term questions around whether WDQS needs to be in the caching layer afterwards.
  • Data reloads have completed

Change 991091 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs graph-split: new trafficserver rewrite rules

https://gerrit.wikimedia.org/r/991091

Change 991091 merged by Ryan Kemper:

[operations/puppet@production] wdqs graph-split: new trafficserver rewrite rules

https://gerrit.wikimedia.org/r/991091

Change 991427 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs graph-split: don't use subdomain

https://gerrit.wikimedia.org/r/991427

Change 991427 merged by Ryan Kemper:

[operations/puppet@production] wdqs graph-split: don't use subdomain

https://gerrit.wikimedia.org/r/991427