Page MenuHomePhabricator

Consider separating wdqs-categories from the rest of the wdqs stack
Closed, ResolvedPublic

Description

wdqs-categories , a separate blazegraph instance which powers the deepcat service , is currently deployed alongside the primary blazegraph instance (wdqs-main or wdqs-scholarly) on all WDQS hosts. We suspect this service may have contributed to an outage (see T374009 for more details).

Since we split the primary blazegraph instance itself in T337013, we've started wondering about how to handle wdqs-categories. Do we...

  • continue to run it on all hosts?
  • migrate it to its own environment?
    • If we migrate, would it be VMs or Kubernetes?

Creating this ticket to discuss why might want to move this service (easier to operate? fewer resources?) , level of effort involved with the status quo vs. migration, etc.

If we do decide to migrate, we'll handle the work in a subsequent ticket.

Event Timeline

Change #1070942 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] categories: source CATEGORIES_ENDPOINT from /etc/default/categories-endpoint

https://gerrit.wikimedia.org/r/1070942

Gehel triaged this task as High priority.Sep 6 2024, 12:23 PM

Change #1071877 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: fix CATEGORY_ENDPOINT env var

https://gerrit.wikimedia.org/r/1071877

Change #1071877 merged by Bking:

[operations/puppet@production] wdqs: fix CATEGORY_ENDPOINT env var

https://gerrit.wikimedia.org/r/1071877

If we migrate, would it be VMs or Kubernetes?

I prefer the idea of migrating to k8s, I think with VMs it's easy to forget that the service exists but with k8s it's more "in your face", and also there's more obvious ways to troubleshoot the service (problem with categories? try killing [restarting] the pods and see if that fixes it). Of course k8s adds some more complexity but it's something we're going to be relying upon more frequently in general across the foundation

It makes a lot of sense to me to move categories to its own servers. This would allow much simplification of how we manage those systems, and would allow simplification of our puppet code. I think the question is less "should we do it" and more "how do we do it".

If we migrate, would it be VMs or Kubernetes?

I don't have strong opinions there, but a few questions:

  • Where do we store the journal? Locally, ceph, ...? (as a random data point, the categories journal on wdqs1011 is 38G)
  • I suspect that disk latency is critical to the performance of Blazegraph. Would the storage solution above have an impact on response times?
  • If we move to k8s, should we review how we update the categories? We know that Blazegraph is not too good at recovering disk space, so there might be an opportunity to start from scratch when an update is needed instead of happending to the existing journal. I'm not sure how that would look within the k8s update process (reloading the categories is still a fairly long process).

Per today's conversation with @Gehel , we are retiring the full graph hosts in ~6 mo or so.

That means we don't really have a choice here; we will be migrating the service somewhere.

As such, I'm closing out this ticket. The discussion on where to migrate the service (VMs or Kubernetes) continues in T374967 .

bking claimed this task.

Change #1070942 merged by jenkins-bot:

[wikidata/query/rdf@master] categories: source CATEGORY_ENDPOINT from /etc/default/category-endpoint

https://gerrit.wikimedia.org/r/1070942