Expose 3 new dedicated WDQS endpoints
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	RKemper
	Nov 20 2023, 3:56 PM

Description

We'll be launching three net-new services, wdqs-main-graph, wdqs-full-graph and wdqs-scholarly-articles (real names TBD).

Each service will be served on a single host respectively, and thus we probably don't need LVS/pybal.

Servers are already available. Data load isn't part of this ticket.

We will need additional static web sites for the UI, with their configuration.

Talk to Traffic team and come up with a concrete list of steps to get these services ready from a routing point of view
deploy UI for each endpoints
SPARQL endpoints and UI are accessible to the internet

Details

	Subject	Repo	Branch	Lines +/-
	wdqs graph-split: don't use subdomain	operations/puppet	production	+16 -16
	wdqs graph-split: new trafficserver rewrite rules	operations/puppet	production	+36 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T335067 Epic: Wikidata Query Service stabilization
Open	None	T337013 [Epic] Splitting the graph in WDQS
Resolved	Gehel	T350464 Expose SPARQL endpoints with full wikidata data set and with split graph to enable experimentation on federation with a split graph
Resolved	RKemper	T351650 Expose 3 new dedicated WDQS endpoints
Resolved	RKemper	T350465 Load Wikidata split graphs into test servers
Resolved	dr0ptp4kt	T350106 Implement a spark job that converts a RDF triples table into a RDF file format
Resolved	RKemper	T354658 Create 3 microsites for wdqs full graph, main graph, & scholarly articles
Resolved	Dzahn	T355273 ProbeDown - *_experimental_wikidata_org
Resolved	RKemper	T354661 Generate TLS certs for new WDQS endpoints
Resolved	RKemper	T354662 Create DNS records for 3 new WDQS endpoints
Open	None	T354043 Decide the name, domain and logo of WDQS for scholarly articles
Resolved	RKemper	T355593 Re-generate webserver-misc-apps.discovery.wmnet cergen certificate
Resolved	Gehel	T355888 Enable cross federation between experimental WDQS endpoints

Event Timeline

RKemper created this task.Nov 20 2023, 3:56 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 20 2023, 3:56 PM

Gehel edited projects, added Data-Platform-SRE; removed Epic.Nov 20 2023, 4:05 PM

Gehel moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.

bking subscribed.Nov 20 2023, 4:06 PM

Gehel renamed this task from Launch two new dedicated WDQS endpoints to Launch 3 new dedicated WDQS endpoints.Nov 20 2023, 4:34 PM

Gehel updated the task description. (Show Details)

Gehel set the point value for this task to 5.

Gehel moved this task from Incoming to Ready for Dev -- SRE/Ops on the Discovery-Search (Current work) board.

RKemper updated the task description. (Show Details)Nov 20 2023, 4:35 PM

Gehel renamed this task from Launch 3 new dedicated WDQS endpoints to Expose 3 new dedicated WDQS endpoints.Nov 20 2023, 7:40 PM

Gehel edited parent tasks, added: T350464: Expose SPARQL endpoints with full wikidata data set and with split graph to enable experimentation on federation with a split graph; removed: T337013: [Epic] Splitting the graph in WDQS.

Gehel triaged this task as High priority.Nov 22 2023, 9:23 AM

Gehel moved this task from Incoming to Quarterly Goals on the Data-Platform-SRE board.

Alright, I had an initial meeting with Traffic team (Brandon & Valentin).

Traffic team meeting summary

The primary concern they had was related to the potential impact on the ATS side of things; in a scenario where Blazegraph is taking a consistently long time to respond, ATS would be impacted due to the large numbers of dangling sockets, which could theoretically impact the rest of production infrastructure (mediawiki, etc).

This isn't necessarily a new problem, it sounds like this is a potential concern they've had with WDQS in general for some time now. One of the possibilities we discussed was to bypass the caching layer entirely and just use LVS, so with respect to these net-new services backed by a single backend host each, it would look like a single backend host behind LVS but avoiding the ATS/caching layer entirely. That eliminates the concern around ATS but does introduce a few drawbacks:

(primary drawback) We lose requestctl which is a tremendously useful tool when managing WDQS outages. We'd presumably be going back to the old way of doing things where we'd manually ban at the nginx level when necessary.
(not discussed in meeting but realized after-the-fact) We currently rely on trafficserver to handle routing to the miscweb hosts; as such we'd need to spin off separate domains to replicate this functionality
Some extra latency would be introduced since we wouldn't be terminating TLS as close to the user. This probably isn't the hugest deal; adding up to 100ms of latency to the user end likely wouldn't break existing usecases.
There's some changes to puppet automation, etc that we'd have to make. It sounds like the main one is that tls certs would have to go thru acmechief rather than relying on the cdn. This generates some work on our (Search team) end in creating the corresponding puppet patch(es) but wouldn't be a showstopper.

Of the above 3 drawbacks the most painful one is losing requestctl; it's a really great tool. But it might be a worthwhile tradeoff to entirely avoid the possibility of a misbehaving query service impacting non-WDQS production infrastructure like MediaWiki itself. I'd note that I'm not aware of us specifically having encountered that problem (ATS backing up and impacting the rest of prod) in previous WDQS outages, but it's also not something we were going out of our way to look for either.

So, we'll need to discuss amongst Search team and see what the consensus is, then bring the discussion back to Traffic team for further feedback.

Other context

The existing request flow for WDQS as it exists currently is haproxy [traffic team manage certs] -> varnish -> ats -> envoy -> nginx -> blazegraph (these are from hastily transcribed notes + I filled in the missing gaps on the righthand side [nginx->blazegraph] so I'll want to follow up and validate that the above flow is correct)

As far as how things would look like after spinning up the new endpoints (sidestepping the question of whether to bypass the caching layer or not), query.wikidata.org would still be getting the vast majority of the traffic, with wdqs-scholarly-articles getting just a few % of total traffic at most. We expect actual usage of these new endpoints to be quite low - basically only the WDQS powerusers will actually be trying it out, at least initially - but given it'd still be a production service and exposed to the outside world there is of course always the potential for a malicious attacker wrt the concerns about ATS getting backed up.

There's an open question around how much benefit from the caching layer WDQS enjoys currently. Our assumption is that WDQS is very impractical to cache and thus we're getting very little contribution from the caching layer as far as performance is concerned currently, but I'll need to glance at Grafana etc and validate that hunch.

A few notes in reply to the comment above:

This ticket is specifically for 3 experimental endpoints that are temporary and will be used to gather feedback from our communities. We expect low amount of traffic, so this reduces the risk. We should still think about the long term solution if it needs to be different than what we currently have for WDQS / WCQS, but that can be done later.
We have fairly low TTL on the SPARQL requests (5 minutes). Not having caching at all is probably not a big deal (but we need to check the numbers - we might have a few hot spots that the cache helps). For the experimental phase, not having cache is very unlikely to be an issue. The UI is a bunch of static HTML / CSS / JS files that have a longer TTL, but they are also super cheap to serve.
In the long term, loosing requestctl seems to be a big deal. WDQS is one of the services where requestctl is regularly useful.
Increased latency due to SSL termination is not an issue. SPARQL queries tend to be at least an order of magnitude longer, so user impact is negligible.

Additional notes:

Timeline: we'd like to have experimental servers publicly accessible in mid-January
short term solution for experimental servers and long term solution for WDQS as a whole should probably be treated separately

Gehel edited projects, added Data-Platform-SRE (2023.12.01 - 2023.12.31); removed Data-Platform-SRE.Dec 12 2023, 4:20 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2023.12.01 - 2023.12.31) board.

Gehel added a subtask: T350465: Load Wikidata split graphs into test servers.Dec 15 2023, 9:08 AM

Current status:

Met again with Traffic team last week and got approval for our proposal to spin up these 3 new endpoints [after december vacation], and address the longer-term questions around whether WDQS needs to be in the caching layer afterwards.
Data reloads have completed

Gehel edited projects, added Data-Platform-SRE (2024.01.01 - 2024.01.21); removed Data-Platform-SRE (2023.12.01 - 2023.12.31).Dec 19 2023, 4:54 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.

Gehel closed subtask T350465: Load Wikidata split graphs into test servers as Resolved.Dec 20 2023, 2:49 PM

RKemper updated the task description. (Show Details)Jan 9 2024, 3:51 PM

RKemper added a subtask: T354658: Create 3 microsites for wdqs full graph, main graph, & scholarly articles.

Gehel moved this task from Ready for Dev -- SRE/Ops to DPE-SRE on the Discovery-Search (Current work) board.Jan 16 2024, 3:18 PM

Change 991091 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] wdqs graph-split: new trafficserver rewrite rules

https://gerrit.wikimedia.org/r/991091

gerritbot added a project: Patch-For-Review.Jan 16 2024, 8:18 PM

Change 991091 merged by Ryan Kemper:

[operations/puppet@production] wdqs graph-split: new trafficserver rewrite rules