Page MenuHomePhabricator

Sunset ApiFeatureUsage (TDMP)
Closed, DeclinedPublic

Description

Instructions

  1. Define the problem or opportunity (WHAT).
  2. Outline the importance of addressing the problem or opportunity (WHY).

What?

Write your problem statement using layperson's terminology.
In one sentence, what is the problem or opportunity?

Netflix Problem Statement Example: Going to the video store requires fighting traffic, wandering the aisles, and waiting in long lines just to get a single movie.

  • ApiFeatureUsage is a rarely used feature that requires effort to maintain, especially when Search needs to update ElasticSearch, as we are currently doing.
  • related ticket: https://phabricator.wikimedia.org/T301724

What does the future look like if this is achieved?

  • The Search team invests a little bit more time up front to sunset this feature, reducing drag in search development and/or software incompatibilities.
  • Users must rely on sysadmins to inform them if their bot/framework is using deprecated API queries, rather than being empowered and proactive, slowing down API development, requiring further time spent on maintaining already deprecated features.

What happens if we do nothing?

  • Nothing fails.
  • Search spends slightly less effort in the short term updating ApiFeatureUsage during our process of updating ElasticSearch to 7.10
  • Search continues to spend more effort in the long term continuing to maintain this feature

Why?

Identify the value(s) this problem/opportunity provides. Add links to relevant OKRs.
Rank values in order of importance and be explicit about who this benefits and where the value is.

User Value/Organization Value AND Objective it supports and How

  1. Improves our code health by removing low/no-use features that cause software incompatibilities and/or development drag (i.e. tech debt)

Why are you bringing this decision to the Technical Forum?
What about the scope of this problem led you and your team to seek input across departments/organizations?

  • We are not sure who might be using this feature (internally and externally) currently (there seems to be some small amount of usage), and want to make sure we go through the right channels to make sure everybody is aware of this decision
  • There doesn't seem to be a clear process for how to sunset/undeploy features/code -

Related Objects

Event Timeline

MPhamWMF renamed this task from Sunset to Sunset ApiFeatureUsage.Feb 25 2022, 10:18 PM
MPhamWMF updated the task description. (Show Details)
MPhamWMF updated the task description. (Show Details)
MPhamWMF added a subscriber: Gehel.

(Please use subtasks and parent tasks for task dependencies - thanks!)

ApiFeatureUsage is pretty useful when helping bot and framework authors figure out what deprecated calls they're still using. If the integration with Elasticsearch is problematic, I think we could back it with MariaDB instead, as I had once suggested in T148484: Make ApiFeatureUsage extension easier to set up. Not sure we can keep all the same featureset, but pretty sure the core functionality would still work.

Making its backend to be MariaDB would solve that problem but also opens a lot of nasty issues such as locking and db writes on GET requests. It's not impossible to fix (for example, you hold them in memcached and flush it out once in a while) but it's a bit of work.

Another idea would be to use hadoop. Hadoop already have these data and I have used that in previous cases of deprecating API features in wikidata. It could use some love though and in order to make it usable to outside, it can have the data properly anonymized and aggregated in AQS so it could be used by public. I'm not sure public data would give much value as most of devs already have access to hadoop and if not, we can sort it out (and it's a good reason to ask for the access).

A hybrid solution could be to have a simple mariadb backend for third parties but not deploy it in production and use hadoop for that instead. Just a thought.

I'm not sure public data would give much value as most of devs already have access to hadoop and if not, we can sort it out (and it's a good reason to ask for the access).

The point of the current extension is to allow arbitrary API users (for example bot authors) to see the deprecation errors they are tripping on.

Making its backend to be MariaDB would solve that problem but also opens a lot of nasty issues such as locking and db writes on GET requests. It's not impossible to fix (for example, you hold them in memcached and flush it out once in a while) but it's a bit of work.

It can all be done postsend, where writes on GET requests are OK, AIUI. I imagine the table would be like date, feature (normalized), sanitized_user_agent, counter. The query would just be an upsert of counter=1 or counter=counter+1. If that causes too much locking contention, then instead of counting, we could just track if it was hit once or not (INSERT IGNORE ...). Less useful to users but still better than losing it entirely.

Krinkle renamed this task from Sunset ApiFeatureUsage to Sunset ApiFeatureUsage (TDMP).Feb 26 2022, 1:25 PM

ApiFeatureUsage is a rarely used feature that requires effort to maintain, especially when Search needs to update ElasticSearch, as we are currently doing.
related ticket: https://phabricator.wikimedia.org/T301724

I assume this is meant to refer to T263142: [EPIC] Upgrade Elasticsearch to version 7.10 instead of T301724: Sunset ApiFeatureUsage.

What is it about ApiFeatureUsage that complicates the Elastic upgrade? Is it its mere existence and thus the motiviation is to reduce the number of distinct Elastic indexes that we have? Or is there someting specific about ApiFeatureUsage that makes it unreasonably difficult to upgrade?

If we know what those aspects or options are, perhaps something can be simplified so as to still retain the basic concept of the feature usage messages being indexed and queryable from a public interface.

I'm not sure public data would give much value as most of devs already have access to hadoop and if not, we can sort it out (and it's a good reason to ask for the access).

The point of the current extension is to allow arbitrary API users (for example bot authors) to see the deprecation errors they are tripping on.

Thanks. I misunderstood it. So maybe having it in AQS would work?

Making its backend to be MariaDB would solve that problem but also opens a lot of nasty issues such as locking and db writes on GET requests. It's not impossible to fix (for example, you hold them in memcached and flush it out once in a while) but it's a bit of work.

It can all be done postsend, where writes on GET requests are OK, AIUI. I imagine the table would be like date, feature (normalized), sanitized_user_agent, counter. The query would just be an upsert of counter=1 or counter=counter+1. If that causes too much locking contention, then instead of counting, we could just track if it was hit once or not (INSERT IGNORE ...). Less useful to users but still better than losing it entirely.

That would solve that problem but this adds massive amount of writes to our databases which would drastically slow down replication (possibly to the point that replicas won't be able to catch up in s8 for example) given the fact that replication is serial and more complexities around that. This is not to say we absolutely must not move it to MariaDB but it's not acceptable to move it to MariaDB just as-is. For example a dedicate section would work but I don't think that can happen any time soon.

Random note: Have people here considered Cassandra?

I don't know exactly how/why it's hard to maintain in elastic (and I think at the current state, it'll be a nightmare to maintain anywhere it goes) but I think the problem statement is missing the underlying problem which is the state of our action APIs, why do we have around 1M hits on deprecated API endpoints every hour? Why no one is pushing to clear that? Community engagement, reaching out to the biggest offenders, making sure we eventually pull the plug, etc. can ease the maintenance burden drastically but this is not in the official capacity of any team.

Making its backend to be MariaDB would solve that problem but also opens a lot of nasty issues such as locking and db writes on GET requests. It's not impossible to fix (for example, you hold them in memcached and flush it out once in a while) but it's a bit of work.

It can all be done postsend, where writes on GET requests are OK, AIUI.

(Side note to this conversation: my understanding from T92357: Fix database master queries from HTTP GET/HEAD before active-active multi-dc is that writes on GET requests should not be done.)

ApiFeatureUsage is pretty useful when helping bot and framework authors figure out what deprecated calls they're still using.

While it definitely can be useful, it is not clear that it is actually used. Or used enough to justify keeping it up. Do you know of actual use cases? Or is this a guess that some people might find it useful?

The question I'd like to answer is whether ApiFeatureUsage is useful enough to justify the cost of keeping it. The cost isn't high, but my intuition is that the usefulness isn't high either.

What is it about ApiFeatureUsage that complicates the Elastic upgrade? Is it its mere existence and thus the motiviation is to reduce the number of distinct Elastic indexes that we have? Or is there someting specific about ApiFeatureUsage that makes it unreasonably difficult to upgrade?

The current implementation of ApiFeatureUsage is via Mediawiki logging, which is aggregated by logstash before being propagated to the Search Elasticsearch cluster. This brings a few technical problems:

  • It creates a dependency between our ELK cluster and our Search cluster
    • This dependency has caused trouble for upgrades already, as the versions of Logstash and Elasticsearch need to match (not a completely strict match, but an additional constraint nonetheless). This adds additional constraints on how we can upgrade our different clusters and how much we need to keep them in sync.
    • It creates a runtime dependency, where a failure of the Search cluster might stall the ELK pipeline. I think isolation has been improved semi-recently, but this has caused a full failure of our ELK stack in the past during planned maintenance of the Search cluster.
  • It creates some minor additional maintenance issues. For example, we recently had T300062, which wasn't strictly speaking triggered by ApiFeatureUsage, but since ApiFeatureUsage is less used and less understood, that part of the problem took longer to identify and fix. I could probably find a few other examples.
  • Since we are upgrading to Elasticsearch 7, we will need to validate that ApiFeatureUsage still works on that Elasticsearch version. It is unlikely that it will require major rework, but just testing is going to take some time.

To summarize, there isn't major blocking issues around ApiFeatureUsage, but an ongoing level of small annoyances and additional work required to support it (and with no clear ownership, this falls to the Search Platform & Observability teams by default).

For additional context, a quick look at Turnilo indicates that there are around 100 requests per day to ApiFeatureUsage across all wikis. This seems to be an incredibly small number in our context (which would argue for dropping the feature entirely). Given the nature of ApiFeatureUsage, those very few requests might or might not be super important for some use cases (which would argue for keeping the feature).

For additional context, a quick look at Turnilo indicates that there are around 100 requests per day to ApiFeatureUsage across all wikis. This seems to be an incredibly small number in our context (which would argue for dropping the feature entirely). Given the nature of ApiFeatureUsage, those very few requests might or might not be super important for some use cases (which would argue for keeping the feature).

This is expected given there are no active API deprecations/removals going on right now, so there's no reason for people to be checking the page.

ApiFeatureUsage is pretty useful when helping bot and framework authors figure out what deprecated calls they're still using.

While it definitely can be useful, it is not clear that it is actually used. Or used enough to justify keeping it up. Do you know of actual use cases? Or is this a guess that some people might find it useful?

The question I'd like to answer is whether ApiFeatureUsage is useful enough to justify the cost of keeping it. The cost isn't high, but my intuition is that the usefulness isn't high either.

I tend to sit on all three sides of the API development process: 1) MediaWiki API developer and former asst. maintainer 2) framework author, formerly Pywikibot and now mwbot-rs, and 3) bot operator.

ApiFeatureUsage is especially useful for audience #3, I just type in my bot's username "Legobot", which is in the user-agent of most frameworks, and can very easily see that I'm no using no deprecated features/parameters/endpoints or I am using some deprecated functionality need to fix it, or the hits are all in the past and my bot is no longer using those.

#3 is also the audience we have the toughest time reaching because it is very diverse, spread out, many API users don't subscribe to the mediawiki-api-announce mailing list, etc. So IMO the more proactive we can make it for users the better. If this tool didn't exist, we would likely see more breakage among bots and tools, plus increased work for developers to proactively notify users based on logs (which happens somewhat already).

For audience #2 it is not that useful, most framework authors subscribe to the -announce mailing list, and searching for e.g. "Pywikibot" isn't useful if all the hits are from old versions.

It provides no value to audience #1 because when removing deprecated functionality, we look at overall stats in api.log, not necessarily the api-feature-usage.log. But like I said it reduces the amount of work needed, since this tool is self-service and doesn't require sysadmins to grep in logs upon request.

All this said, there really is no one working on MediaWiki Action API development these days, which is a real shame, so this tool doesn't get used as often as it should, because no one is deprecating things or proposing removals of bad functionality.

I want to reframe this conversation slightly.

It sounds like this feature may be useful to a subset of users at least anecdotally, if not supported by more investigation into the usage data. It sounds like bot operators are an overlooked community of users who could stand to benefit from this feature the most.

The strategy around bot operators and MediaWiki APIs is out of scope of the Search team. We are currently overcommitted to too many priorities, and experiencing too much drag from tech debt and maintaining code that is not central to the Search team's goals and mandate -- I am looking to reduce commitments and drag on this team. This may not be a huge feature in the scheme of things, but it would still help in the long term.

As API Feature usage is not a search feature, I have little insight to its value. I think that if it is important, and/or we want bot operators and MWAPI developers to get more support, this is another product/tech portfolio, and I encourage another team to continue developing with these goals in mind. My immediate goal is to streamline the Search team's product portfolio and focus, and this feature does not fall within that scope.

To summarize, there isn't major blocking issues around ApiFeatureUsage, but an ongoing level of small annoyances and additional work required to support it (and with no clear ownership, this falls to the Search Platform & Observability teams by default).

My immediate goal is to streamline the Search team's product portfolio and focus, and this feature does not fall within that scope.

I appreciate the desire by both of you to protect the Search team from toil and ownership of abandonware. I also think that this ticket was a successful tactic for finding out if anyone at all cares about Extension:ApiFeatureUsage.

Some additional guidance is needed to help all of us who might want to try and keep this functionality from disappearing even if we cannot convince the Foundation that the Action API is part of their project mandate. Specifically we need to know what if any conditions must be met to continue to use Elasticsearch as the backing store for the (timestamp, feature, agent) tuples that are collected. Would breaking the (admittedly hacky and my fault as the original implementor) inter-dependency between the ELK cluster and the indices be enough? If not, what if any other technical and social corrections would be needed to continue to use consolidated full-text search infrastructure for this project?

The current implementation of ApiFeatureUsage is via Mediawiki logging, which is aggregated by logstash before being propagated to the Search Elasticsearch cluster. This brings a few technical problems:

  • It creates a dependency between our ELK cluster and our Search cluster
    • This dependency has caused trouble for upgrades already, as the versions of Logstash and Elasticsearch need to match (not a completely strict match, but an additional constraint nonetheless). This adds additional constraints on how we can upgrade our different clusters and how much we need to keep them in sync.
    • It creates a runtime dependency, where a failure of the Search cluster might stall the ELK pipeline. I think isolation has been improved semi-recently, but this has caused a full failure of our ELK stack in the past during planned maintenance of the Search cluster.

A Search cluster outage should no longer affect Logstash log ingest since T297239. This was solved by creating a separate apifeatureusage-specific Logstash instance and consumer group which can be upgraded independently of the Logstash instances supporting centralized logging.

For what it's worth, I went around and asked biggest offenders to fix their deprecated API calls. Already it's a bit smaller. Will push more later.

After fixing cyberbot's deprecated calls, we went from ~20K rows/min to 15K rows/min:

image.png (302×764 px, 112 KB)

I assume that's 7M less records a day, 216M less records a month. HTH.

And smaller use cases is things like T304870: Update usages of deprecated action=oathvalidate&totp=... And previously for much greater impact tasks like T280806: Remove old action api token parameters

This a good point. Probably could use some scoping:

  1. Is this a request to sunset MediaWiki's production of ApiFeatureUsage logs into the logging pipeline?
  2. Is this a request to sunset the Search team's ingestion and presentation of these logs via the ApiFeatureUsage extension?

If the former is proposed, these cases would be affected. If the latter the logs would continue to flow into Logstash.

bd808 changed the task status from Resolved to Declined.May 13 2022, 12:52 AM

LNguyen moved this task from Problem Statement Review to Withdrawn from Process on the tech-decision-forum board.

Changing state from resolved to declined to more clearly indicate that this was not implemented per the movement to the "Withdrawn from Process" column.

FYI, this discussion is continuing at T301724: Sunset ApiFeatureUsage:

The ticket was declined because @SWakiyama cleared my team to move forward with this work without pushing it through the rest of the Tech Decision process.
I was under the understanding that this unblocked this work for the search team.

So the withdrawal is "we're going to do it anyways".

For more clarity, at the point of withdrawal from the process, this ticket was at the stage where Search was unblocked from moving forward. We have a RACI card filled out based on feedback, which we still intend to use to communicate our plans as we move forward. There were multiple upcoming phases of shopping ideas around for feedback, but I was told that this feedback would not be binding or a blocker to continue our work, but as additional points of perspective to consider.
Given that this is a minor component that was holding up our larger Elasticsearch version upgrade, I expressed concern that these additional stages were creating busy work that put our timelines at risk. Based on the size and impact of removing API Feature Usage's dependency on Elasticsearch, I was given the go ahead to move forward with this work without moving through the final steps of the Tech Decision process, which were understood to be more of a formality at this stage.

Our initial decision to use this process to make this breaking change visible to other teams who may be affected by it, rather than seek permission for it. If that is not what this process was intended for, it was a mistake on my part in misunderstanding exactly what this process was set up to do, and once again, apologies for causing confusion.

@LNguyen: Please explain actions in an additional comment. "Resolving" this task led to confusion in T301724. Thanks.