Page MenuHomePhabricator

ApiFeatureUsage data is not being populated in the Beta Cluster
Closed, ResolvedPublic

Description

https://en.wikipedia.beta.wmflabs.org/wiki/Special:ApiFeatureUsage reports no data is available for any date range.

Also, I note that going to deployment-elastic05.eqiad.wmflabs (or 06 or 07) and executing curl -v localhost:9200/_cat/indices?v does not list any apifeatureusage indexes.

Event Timeline

I ran into this again when trying to test ApiFeatureUsage. :(

@Anomie who would you ask if this were broken in production? What tags? I'm uncertain (why I'm asking you).

In production I'd probably tag with Wikimedia-Logstash and/or Elasticsearch. Team-wise, probably SRE (for logstash) and Search (for ES), assuming the component tags didn't get a response.

Here's what I know:

  • Logs from MediaWiki are written to the 'api-feature-usage' channel.
  • Logstash is supposed to munge these messages to an 'api-feature-usage-sanitized' channel.
    • I see what looks like the right configuration for that in deployment-logstash02:/etc/logstash/conf.d/55-filter_apifeatureusage.conf
  • Logstash is supposed to write that 'api-feature-usage-sanitized' channel to the daily apifeatureusage index on the "search" ES instances, deployment-elastic05 through 07.
    • I see deployment-logstash02:/etc/logstash/conf.d/95-output-elasticsearch-apifeatureusage-deployment-logstash2-deployment-prep-eqiad-wmflabs.conf looks like it's supposed to do that, but the fact that it has hosts => ["deployment-logstash2.deployment-prep.eqiad.wmflabs:9200"] seems off. Unfortunately I don't seem to have access to production logstash hosts to compare the configs there.
    • At any rate, I don't see the apifeatureusage indexes in ES on deployment-elastic05 or on deployment-logstash2. No errors in deployment-logstash02:/var/log/logstash/ either.
    • You can see these indexes in production with curl https://search.svc.eqiad.wmnet:9243/_cat/indices?v | grep apifeatureusage from mwmaint1002.

poked this a little today. The logstash api claims it is outputting events, the numbers are low because i restarted logstash in beta cluster very recently to make sure i was reading the config it was using.

ebernhardson@deployment-logstash2:/etc/logstash/conf.d$ curl -s localhost:9600/_node/stats/pipeline | jq '.pipeline.plugins.outputs | map(if (.id == "output/elasticsearch/apifeatureusage-deployment-logstash2.deployment-prep.eqiad.wmflabs") then . else empty end)'
[
  {
    "id": "output/elasticsearch/apifeatureusage-deployment-logstash2.deployment-prep.eqiad.wmflabs",
    "events": {
      "duration_in_millis": 33,
      "in": 14,
      "out": 14
    },
    "name": "elasticsearch"
  }
]

My best guess for whats happening here is elasticsearch on logstash is configured to only auto-create logstash indices, not sure why the indexing failure is not logged by logstash anywhere though. This is the standard production configuration, where apifeatureusage logs are expected to go to the search cluster rather than the logging cluster, since we consider the logging cluster private and not queryable from mediawiki.

action.auto_create_index: +logstash-*,-*

For whatever reason the search cluster on beta does not seem to accept apifeatureusage indices either, it has auto_create_index completely disabled.

Steps to fix:

  • puppet (hiera) in deployment-prep needs to provide appropriate auto_create_index config to elasticsearch instances
  • logstash needs to be changed to send logs to deployment-elastic??.deployment-prep.eqiad.wmflabs

I was trying to do that, but I can't seem to login to horizon which is preventing me from fixing deployment-prep hiera.

Did you figure out logging in to horizon?

After looking in horizon, I realized this is actually configured from puppet. I've submitted an appropriate puppet patch which should get this working, or at least to the next step of debugging. Both the elasticsearch and logstash daemons will need to be restarted after puppet is deployed, puppet will not restart the daemons automagically.

Change 496093 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Allow apifeatureusage to work in beta cluster

https://gerrit.wikimedia.org/r/496093

Change 496093 merged by Gehel:
[operations/puppet@production] Allow apifeatureusage to work in beta cluster

https://gerrit.wikimedia.org/r/496093

Mentioned in SAL (#wikimedia-cloud) [2019-03-13T18:09:42Z] <ebernhardson> restart elasticsearch on deployment-elastic* to deploy apifeature usage fix (T183156)

due to unrelated issues the restart to pickup the config change is still pending.

Change 496557 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[operations/puppet@production] Update apifeatureusage es template to match 5.6.x+

https://gerrit.wikimedia.org/r/496557

This looks to be working again and ingesting apifeatureusage logs into the beta-search cluster. I manually updated the template to match the above patch, it will need to be merged before we can call this complete.

We might also need to check back in a month and make sure the pruning process is also removing the old indices.

Change 496557 merged by Gehel:
[operations/puppet@production] Update apifeatureusage es template to match 5.6.x+

https://gerrit.wikimedia.org/r/496557

Confirmed that deployment-elastic05 now contains indexes for apifeatureusage for the past two days, and seems to be receiving new entries. Thanks!

Let's call this resolved then, unless you want to keep it open for your check in 30 days?