Page MenuHomePhabricator

wdqs: database node logs should be pushed to logstash
Open, Needs TriagePublic

Description

Qlever should be able to push to our logstash endpoint. This is currently done directly in Blazegraph using Logback (configuration)

Wikitech reference on Logstash is here

We should consult with observability-platform SRE about deploying this, either as part of the triplestore or a separate Debian package (or something else).

In fact, being on Kubernetes might make this very straightforward.

AC

  • Logs meet Common Logging Schema requirements.
  • Logs from the chosen triplestore are pushed to logstash
  • Observability-platform SRE consulted re. deployment, and a deployment exists

Related:

Event Timeline

gmodena renamed this task from Triplestore: logs pushed to logstash to database node logs should be pushed to logstash.May 13 2026, 12:05 PM
gmodena updated the task description. (Show Details)

We serve about 150 qps at peak. From may 13 to may 20 I see about 7 million log entry in https://logstash.wikimedia.org/app/discover#/?_g=h@d7d2a59&_a=h@900612c. This figure is lower than it should be. We also forward query logs to event platform and for the same period we logged 72,774,019 events for the external endpoint alone.

Do we sample at ingestion time? I wonder if we hit some logstash rate limit? This tracks with behavior seen recently
when logs we wanted to inspect (during an incident) where not available.

Currently logs land in the logstash-* index. In wdqs v2 we should log in ECS format, but the payload should be similar. With caveats.

Logstash would be useful for real-time troubleshooting. However, if the ingest volume
is too high for logstash maybe we should consider running real-time analytics in kafka. T425989: [NEEDS GROOMING] wdqs real-time monitoring and analytics.

The risk with only logging via EventPlatform in WDQS v2 is that we'd do it at wdqs-proxy level, likely losing database host info.

gmodena renamed this task from database node logs should be pushed to logstash to wdqs: database node logs should be pushed to logstash.Wed, May 20, 2:10 PM
gmodena added a project: SRE Observability.

In fact, being on Kubernetes might make this very straightforward.

If the service will be running in Kubernetes, then the logs will go to logstash for free!

We serve about 150 qps at peak. From may 13 to may 20 I see about 7 million log entry in https://logstash.wikimedia.org/app/discover#/?_g=h@d7d2a59&_a=h@900612c. This figure is lower than it should be. We also forward query logs to event platform and for the same period we logged 72,774,019 events for the external endpoint alone.

That link has expired, could you give a shortlink for that query please? There are some specific rules around how logs are managed for wdqs but I don't think they should result in sampling. They're also around 10 years old so I'm not even sure if they're used at this point 😅

In fact, being on Kubernetes might make this very straightforward.

If the service will be running in Kubernetes, then the logs will go to logstash for free!

When we drafted this task the database deployment was a TBH, but we will deploy on k8s in the end.
In terms of schema changes, is it enough to format whatever our POD generates in ECS?

We serve about 150 qps at peak. From may 13 to may 20 I see about 7 million log entry in https://logstash.wikimedia.org/app/discover#/?_g=h@d7d2a59&_a=h@900612c. This figure is lower than it should be. We also forward query logs to event platform and for the same period we logged 72,774,019 events for the external endpoint alone.

That link has expired, could you give a shortlink for that query please? There are some specific rules around how logs are managed for wdqs but I don't think they should result in sampling. They're also around 10 years old so I'm not even sure if they're used at this point 😅

Err... thanks for the pointer. TIL.

For counting logs, I did a dummy query in OpenSearch with HOSTNAME : wdqs* (https://logstash.wikimedia.org/goto/28f8acb76ce48b797b395786dd2618f5). But this consistently reports an order of magnitude less (query) logs
than what we push to EventGate. Am I reading the Hits count wrong? Is there a more appropriate (and reliable) way to query the OpenSearch index?

Cc @bking that has more experience than me with these logs.

For counting logs, I did a dummy query in OpenSearch with HOSTNAME : wdqs* (https://logstash.wikimedia.org/goto/28f8acb76ce48b797b395786dd2618f5). But this consistently reports an order of magnitude less (query) logs

I can see in the query service logback configuration that the logstash appender ratelimits itself to 100msg/sec.

For counting logs, I did a dummy query in OpenSearch with HOSTNAME : wdqs* (https://logstash.wikimedia.org/goto/28f8acb76ce48b797b395786dd2618f5). But this consistently reports an order of magnitude less (query) logs
than what we push to EventGate. Am I reading the Hits count wrong?

Maybe the org.wikidata.query.rdf.common.log.RateLimitFilter could explain the discrepancy? Can you share a link to the Hits counter?

Is there a more appropriate (and reliable) way to query the OpenSearch index?

The logs in Logstash don't appear to be very enriched, i.e. I can't find a way to filter between wdqs instances (blazegraph, updater, categories). This is probably because the instances are not emitting this information to rsyslog.

For counting logs, I did a dummy query in OpenSearch with HOSTNAME : wdqs* (https://logstash.wikimedia.org/goto/28f8acb76ce48b797b395786dd2618f5). But this consistently reports an order of magnitude less (query) logs

Errr... major clarification: my original query was borked, because I was comparing logs from all WDQS services vs EventPlatform query logs.
Plus, in logstash it looks like wdqs only ships ERRORed queries.

There is still a discrepancy in number of hits, but lower than I initially reported (sorry!).

I can see in the query service logback configuration that the logstash appender ratelimits itself to 100msg/sec.

Ah. Thanks for the pointer. This does explain the delta, since during outages we see ERROR log volumes spike above that threshold.

It this a hard limit or could we lift it? If wanted to target higher qps (say, > 1k qps), would logstash/opensearch be a suitable target?

For counting logs, I did a dummy query in OpenSearch with HOSTNAME : wdqs* (https://logstash.wikimedia.org/goto/28f8acb76ce48b797b395786dd2618f5). But this consistently reports an order of magnitude less (query) logs
than what we push to EventGate. Am I reading the Hits count wrong?

Maybe the org.wikidata.query.rdf.common.log.RateLimitFilter could explain the discrepancy?

I was able to validate that there is no data lost on the hosts (e.g. the logs in /var/logs/wdqs match what we ship to eventplatform), so I think the
org.wikidata.query.rdf.common.log.RateLimitFilter could also definitely contribute.

Can you share a link to the Hits counter?

For the hits counter I looked at what was reported for the query:

HOSTNAME : wdqs* AND logger_name: "com.bigdata.rdf.sail.webapp.BigdataRDFServlet" AND level: ERROR:

2,185,848 hits
May 28, 2026 @ 09:07:09.651 - Jun 4, 2026 @ 09:07:09.651

Is there a more appropriate (and reliable) way to query the OpenSearch index?

The logs in Logstash don't appear to be very enriched, i.e. I can't find a way to filter between wdqs instances (blazegraph, updater, categories). This is probably because the instances are not emitting this information to rsyslog.

Yep that something we are aware of and would like to improve in WDQSv2 (starting with logging in ECS format).

It this a hard limit or could we lift it? If wanted to target higher qps (say, > 1k qps), would logstash/opensearch be a suitable target?

Ideally we should be able to support that volume, but not at this time: T390215: Logstash is still overwhelmed (since March 2025) We're still working through capacity issues since the k8s migration and ECS adoption has been slow.