Page MenuHomePhabricator

Error in apifeatureusage curator "forcemerge" step
Open, Needs TriagePublic

Description

2021-04-21 00:42:02,219 INFO      Preparing Action ID: 1, "delete_indices"
2021-04-21 00:42:02,234 INFO      Trying Action ID: 1, "delete_indices": apifeatureusage: delete older than 91 days
2021-04-21 00:42:06,824 INFO      Deleting selected indices: ['apifeatureusage-2021.01.20']
2021-04-21 00:42:06,824 INFO      ---deleting index apifeatureusage-2021.01.20
2021-04-21 00:42:07,920 INFO      Action ID: 1, "delete_indices" completed.
2021-04-21 00:42:07,920 INFO      Preparing Action ID: 2, "replicas"
2021-04-21 00:42:07,928 INFO      Trying Action ID: 2, "replicas": apifeatureusage: set replicas to 1 after 31 days
2021-04-21 00:42:12,161 INFO      Setting the replica count to 1 for indices: ['apifeatureusage-2021.01.30', 'apifeatureusage-2021.02.14', 'apifeatureusage-2021.02.18', 'apifeatureusage-2021.02.24', 'apifeatureusage-2021.02.22', 'apifeatureusage-2021.01.27', 'apifeatureusage-2021.01.23', 'apifeatureusage-2021.02.19', 'apifeatureusage-2021.03.01', 'apifeatureusage-2021.02.21', 'apifeatureusage-2021.01.21', 'apifeatureusage-2021.02.26', 'apifeatureusage-2021.02.08', 'apifeatureusage-2021.01.28', 'apifeatureusage-2021.01.29', 'apifeatureusage-2021.02.28', 'apifeatureusage-2021.02.27', 'apifeatureusage-2021.02.13', 'apifeatureusage-2021.02.04', 'apifeatureusage-2021.02.15', 'apifeatureusage-2021.01.31', 'apifeatureusage-2021.01.22', 'apifeatureusage-2021.02.10', 'apifeatureusage-2021.02.09', 'apifeatureusage-2021.02.05', 'apifeatureusage-2021.02.07', 'apifeatureusage-2021.01.24', 'apifeatureusage-2021.02.25', 'apifeatureusage-2021.02.12', 'apifeatureusage-2021.02.11', 'apifeatureusage-2021.02.16', 'apifeatureusage-2021.01.26', 'apifeatureusage-2021.02.02', 'apifeatureusage-2021.02.23', 'apifeatureusage-2021.02.03', 'apifeatureusage-2021.01.25', 'apifeatureusage-2021.02.01', 'apifeatureusage-2021.02.06', 'apifeatureusage-2021.02.17', 'apifeatureusage-2021.02.20']
2021-04-21 00:42:12,673 INFO      Action ID: 2, "replicas" completed.
2021-04-21 00:42:12,674 INFO      Preparing Action ID: 5, "forcemerge"
2021-04-21 00:42:12,686 INFO      Trying Action ID: 5, "forcemerge": forcemerge indexes older than 2 days
2021-04-21 00:42:17,019 ERROR     Unable to complete action "forcemerge".  No actionable items in list: <class 'curator.exceptions.NoIndices'>

Event Timeline

Any idea what these logs look like when it works?

Poking around a bit more, it looks like we don't have any apifeatureusage indices after 2021.03.01 for eqiad or codfw. Seems plausible it's complaining because there are no new indices? These are supposed to be automatically created by logstash shipping data to the cirrussearch clusters. II poked the SAL and puppet repos, but not finding any reason yet why these stopped flowing.

It is likely (but not tested yet) that this breakage is related to upgrading Logstash to version 7, which could have broken the compatibility with Elasticsearch 6, and so Logstash isn't able to send the data to the Search cluster anymore. If that's the case, there is probably no easy fix except upgrading the Search cluster to Elasticsearch 7.

This strong coupling between Logstash and Search is problematic and should be broken. The ApiFeatureUsage feature could be simplified a lot by removing the dependency on Logstash completely and relying on the usual event platform instead.

It is also unclear what the value of this feature is and how much efforts should be put in fixing it.

The ApiFeatureUsage extension sends data to Elastic, and it also actively queries this data back out from a public MediaWiki-powered user interface (Special:ApiFeatureUsage). Example query: https://meta.wikimedia.org/wiki/Special:ApiFeatureUsage?wpagent=Twisted+PageGetter&wpstartdate=2021-02-01&wpenddate=2021-02-28. These are public so as to allow the community to find and work through these deprecations on their own.

As such, I would say it is more similar in nature to CirrussSearch, than to Logstash or EventGate.

Logstash is mentioned because that's how the logs get to elasticsearch. api-feature-usage goes through the standard mediawiki logging pipeline. Logstash is expected to forward the api-feature-usage on to the cirrussearch clusters. The logs currently make it to the production logging clusters, but they are not being repeated to the cirrussearch clusters. Something changed, potentially with logstash, late march 1st that stopped sending data to our clusters.

The separate question seems to be, this has been broken since March 1st and no-one noticed. Is a feature that breaks for months without being noticed something we need to fix, or is this a sign that it's almost unused?

While we're on the topic (ah!) of apifeatureusage, with mediawiki logs on kafka we don't strictly need logstash anymore to ingest kafka -> cirrussearch if the feature stays based on mw logs (as opposed to event platform).

Change 684969 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: restore rsyslog-udp-localhost inputs on legacy logstash cluster

https://gerrit.wikimedia.org/r/684969

It was unintentionally broken but it can be restored.

The separate question seems to be, this has been broken since March 1st and no-one noticed. Is a feature that breaks for months without being noticed something we need to fix, or is this a sign that it's almost unused?

ApiFeatureUsage may be seldom used. Logstash says (as of this writing) 77 hits in the last 90 days.

Is it possible users did not notice because an empty result indicates an acceptable answer?

I'm not sure who can determine whether or not the feature can be deprecated, but from our perspective it would be great if possible to remove it.

It was unintentionally broken but it can be restored.

Thanks for looking. It seems the appropriate course of action would be to restore the functionality if it's mostly about keeping some tech debt without any significant reworking. It appears the data that was not indexed to the CirrusSearch clusters is found in the logging clusters, we could in theory backfill but I expect that to be a bit time consuming as we don't have an existing process so it will be fairly manual. I'd be tempted to restore but not backfill unless there is a strong argument for backfilling.

The separate question seems to be, this has been broken since March 1st and no-one noticed. Is a feature that breaks for months without being noticed something we need to fix, or is this a sign that it's almost unused?

ApiFeatureUsage may be seldom used. Logstash says (as of this writing) 77 hits in the last 90 days.

Is it possible users did not notice because an empty result indicates an acceptable answer?

I'm not sure who can determine whether or not the feature can be deprecated, but from our perspective it would be great if possible to remove it.

I'm not sure either, from Krinkle's post above the use case is helping bots find when they are using deprecated features? If so then it seems possible that this has value while also having extremely low traffic. That would also support the idea that an empty result is a desirable response. Overall though I don't know enough about this feature to say much.

Change 684969 merged by Cwhite:

[operations/puppet@production] profile: restore rsyslog-udp-localhost inputs on legacy logstash cluster

https://gerrit.wikimedia.org/r/684969

ApiFeatureUsage logs should be forwarding again. I'm hopeful we'll see this error clear up in ~2 days.

The separate question seems to be, this has been broken since March 1st and no-one noticed. Is a feature that breaks for months without being noticed something we need to fix, or is this a sign that it's almost unused? […] I'm not sure who can determine whether or not the feature can be deprecated, but from our perspective it would be great if possible to remove it.

I'm not sure either, from Krinkle's post above the use case is helping bots find when they are using deprecated features? If so then it seems possible that this has value while also having extremely low traffic. That would also support the idea that an empty result is a desirable response.

I believe this is correct indeed. The ApiFeatureUsage system should not be deprecated I think, in part because it is what allows us to do deprecations — in the API.

If it has very few entries at the moment that's most likely because 1) we haven't been making as many breaking changes in the API recently which is nice I suppose, and 2) the API clients have caught up with most of our on-going deprecations and/or got cut off when we removed whatever deprecated thing they were using so those are now errors instead of ApiFeatureUsage-logged warnings.

I would expect that, come next time we plan to deprecate an API feature, these logs will fill up again for a period of time, etc.

Tagging per mw:Maintainers. It looks like the issue has been restored and I've tentatively answered the above question question on your behalf. Might be good to be aware of this having happened and to elaborate/correct anything we said as you see fit!