Page MenuHomePhabricator

Long term plan for reducing maintenance workload on the Search Platform team of supporting APIFeatureUsage
Open, In Progress, MediumPublic

Description

As discussed in T302638 and T301724, the Search Platform team would like APIFeatureUsage to not slow down the work on Search. This feature seems to be critical, so decommissioning it does not seem like the path forward. The Search Platform team would like the following points to be addressed:

  • upgrades to the Elasticsearch Search cluster should not be impacted by APIFeatureUsage
  • operation of the Elasticsearch Search cluster should not be impacted by APIFeatureUsage

More concretely, the pain points experienced so far are:

  • operational issues, for example curator scripts failing on indices specific to APIFeatureUsage
  • upgrade of the Elasticsearch Search cluster delayed to address compatibility of APIFeatureUsage with ES 7.x
  • stall of the logstash ingestion pipeline related to APIFeatureUsage

Note that all the above are past examples that have individually been addressed, but the complexity of the current solution let us think that the similar issues are likely to arise again in the future. A more robust / simpler architecture is probably needed.

Event Timeline

@bd808 and I discussed this a bit. Technical Engagement can offer to help with simplifying the ingestion pipeline of APIFeatureUsage, and work on it likely early in the next quarter.

@Gehel and @EBernhardson: do either of you have concrete ideas for what could/should be changed here to reduce toil and backpressure on your team?

Some additional guidance is needed to help all of us who might want to try and keep this functionality from disappearing even if we cannot convince the Foundation that the Action API is part of their project mandate. Specifically we need to know what if any conditions must be met to continue to use Elasticsearch as the backing store for the (timestamp, feature, agent) tuples that are collected. Would breaking the (admittedly hacky and my fault as the original implementor) inter-dependency between the ELK cluster and the indices be enough? If not, what if any other technical and social corrections would be needed to continue to use consolidated full-text search infrastructure for this project?

This might not fully answer your question but I'll try :)
Based on the assumption that this feature is not vital to bot owners but rather to engineers working on the WMF infra willing to help bot owners change their tool because of perf issue or params deprecation:
Would using the hive table event.mediawiki_api_request sufficient for this purpose?

I don't think this assumption is a good one. The reason this feature was built was to make access to the deprecation reports self-service for the Action API using community.

Stepping back I wonder how this feature would look like if we had to design it today:

  • first I'm not entirely convinced that a search backend is the right option for this, none of the fields seem tokenized and quickly looking at the code the only feature it requires are range query on a date field and aggregation on a string field

This is a fair argument. At the time our thinking was basically that Elasticsearch could keep up with the rate of ingestion and that by creating an index per day we could "cheaply" drop older data. I agree that nothing here is actually using the full text search features of Elasticsearch.

  • It also seems to bypass the modern event platform which enforces a schema with using data from the logging platform which currently allows schema-less data ingestion

I think it predates any of the event platform work and has never been retrofitted to use it. I'm not sure what real benefit there is to enforced schemas for a data pipeline with one producer and one consumer, but the (timestamp, feature, agent) tuple certainly could be described formally if this makes something better or easier.

So perhaps I'd consider using the existing mediawiki_api_request and have a job that transforms this data and ingest it into a cassandra db with a small serving layer on top.

This could be a workable solution. One downside of moving to an ETL process from Hadoop to Cassandra to produce the dataset would be making it a batch reporting system rather than a real-time reporting system. This would not make the feature useless in any sense, but it would require users to be aware of the delays introduced by the data pipeline.

If we break the inter-dependency between the ELK cluster and the search cluster we would be in a better shape already but this might not be enough.
Constraints on the search side are during version upgrades and maintenance operations and sensitivity of the data:

  • all writes must be compatible with both versions (as we upgrade one DC at a time)
  • the MW extension code must be maintained when we upgrade the Elastica library
  • the indices must be monitored to make sure that the tools that purge them are properly running
  • size of the indices must be monitored to ensure they do not grow out of control
  • we should probably switch non-dynamic templates to avoid mapping explosion
  • having sensitive traffic data like the user-agent hosted on the search cluster without going through data-engineering checks is not something I'm very comfortable with

The User-Agent part is a hard requirement of the system. I do understand that UA data is currently considered PII. This is at least part of why Special:APIFeatureUsage does not have any canned reports showing the N biggest violators. Because the Action API does not require callers to be authenticated User-Agent is close to our only guaranteed signal for aggregating traffic by caller (IP being the other guaranteed signal). UA was chosen as an aggregation value was because of https://meta.wikimedia.org/wiki/User-Agent_policy. The UA policy is not rigorously enforced, but does exist as strongly recommended advice for API consumers.

The ApiFeatureUsage extension, or more specifically the SpecialPage frontend and Action API module, ultimately just need a table with incrementing counters.

I'll analyze below:

  • How simple or rich does this dataset need to be?
  • How large would this dataset be?
  • What is the fundamantal intake and write rate?
  • How do the needs compare to the current infrastructure and storage format?
  • What would it look like if we followed a "Use the platform" approach? Such approach would offer the simplest possible stack, at the lowest possible maintenance cost, and yet provide the maximum possible testability and ease of contributing and locally develop (zero config, zero dependencies, zero setup). The big question is: Is such simple setup sufficiently fast and scalable?

Status quo

  1. MediaWiki core currently submits the following to Logstash via the api-feature-usage log channel:
    • feature
    • agent
    • username, User->getName, optional.
    • clientip, IP address.
    • referer, HTTP Referer header value.
  2. This message, like all other MW debug logs, are sent to rsyslog locally on each of the ~400 MediaWiki server.
  3. rsyslog then produces it to Kafka for buffering.
  4. Logstash consumes it offline from Kafka, where it goes through the "apifeatureusage" Logstash filter (puppet: 50-filter-apifeatureusage.conf, which:
    • adds type" => "api-feature-usage-sanitized" to ensure it ends up in the correct Elastic index later.
    • parses the user agent into a key value map, using the useragent Logstash plugin.
    • discards all fields except type, @timestamp, feature, and agent. This means we don't store "username", "clientip" or "referer" in Elastic today.
  5. Logstash outputs messages to Elastic, based on matching [type] == "api-feature-usage-sanitized", and creating indexes named after apifeatureusage-%{+YYYY.MM.dd}.
  6. ApiFeatureUsage extension when developers interact with Special:ApiFeatureUsage, queries Elastic. In pseudo code, the current Elastic query looks like:
select count()
from apifeatureusage-* indexes,
where feature=A,
and date > B,
and date < C,
and agent prefix D,
group by date, feature

@EBernhardson helped me on IRC in #wikimedia-search to understand the size of the current dataset. We currently store about 1.7GB per day in Elastic. This is the index size, as reported by Elastic internally:

curl https://search.svc.eqiad.wmnet:9243/_cat/indices?bytes=b | grep apifeatureusage | awk 'BEGIN { cnt = 0; sum = 0} { cnt += 1; sum += $10/1024/1024/1024} END { print sum/cnt }

Dataset

The only database query this extension performs, is a count over the tuple (date, feature, agent).

Wherein:

  • date is YYYY-MM-DD (one value per day).
  • feature is a short and well-known identifier hardcoded in the software, for each deprecated API feature that we track usage of (low cardinality),
  • agent is a User-Agent string that we can be matched or prefix-matched (to whole words at least), e.g. WikipediaApp and WikipediaApp/2.7.165 should match WikipediaApp/2.7.165-r-2021-06-08 (Android 4.1.2; Phone). We can probably limit searches to 64 chars to ensure reasonable performance.

What we actually store, is currently significantly more than what we actually use or need.

I analyzed the data on mwlog1002, where we store a copy for a few days of the same raw data we submit to Logstash/Elastic pipeline. I used /srv/mw-log/archiv/api-feature-usage.log-20240101.gz as a sample.

The raw size after gzip decompression of this data is currently: 12 GB over 27 million events.

cat api-feature-usage.log-20240101.gz |  gunzip - | cut -d' ' -f9- | cut -d'{' -f1 > api-feature-usage.log-20240101.ssv
cat api-feature-usage.log-20240101.ssv | cut -d'"' -f2,10- --output-delimiter=' ' > api-feature-usage.log-20240101.index.ssv

After reducing it to only the feature and agent fields, we are left with 3.6 GB (no compression).

cat api-feature-usage.log-20240101.index.ssv | sort | uniq -c > api-feature-usage.log-20240101.uniq

After storing it as a count table of (count, feature, agent), we are left with a 62,444 rows occopying about 11 MB (or 0.011 GB) of space.

Note that this is without any compression, and without any truncation or other length restriction on user agent strings. Any optimization or restriction would reduce it significantly further given much of the space is occopyied by noise due to long/fake one-off user agents.

Misc data:
It ocurred to me we haven't yet reviewed the the cardinality and size of the feature field. While the overall database size already demonstrates that its cardinatlity or length can't be a storage concern, we do need to know it in order to draft a database schema.

cat api-feature-usage.log-20240101.ssv | cut -d' ' -f1 | sort -u

This yields 41 unique feature identifiers, where the longest identifer is 45 bytes in length ("action=query&prop=revisions&rvexpandtemplates").

$ cat api-feature-usage.log-20240101.ssv | cut -d' ' -f5- | sort -u > ua.log
$ node
node> log.split('\n').map(x=>x.length).sort((a,b)=>a-b)[Math.round(60371*0.10)]

For the user agent, the 62K rows contain 60K distinct user agent values. Indeed, as suspected, the vast vast majority are noise and fake user agents. The lengths, after consolidating by count, the user agent lengths distribute as follows. Note that this is how the unique values distribute. The events themselves distribute differently.

  • p10: 68 chars
  • p25: 119 chars
  • p50 median: 135 chars
  • p75: 160 chars
  • p99: 236 chars
  • p99.9: 468 chars
  • p99.99: 488 chars
  • p100: 1032 chars
$ cat api-feature-usage.log-20240101.uniq | grep -E '^\s+1 ' | wc -l
31757

32K of the 64K rows would have a count of 1, further showing the presence of fake and one-off user agents.

Conclusion:

Using Jan 1st as example, we stored 27 million events during that 24 hour window. This translates to an average of 300 writes per second. Formatting it as (feature,agent) pairs, we only need 62,000 rows per day, taking about 11 megabytes of space. Multiplying that by 90 days would give us perhaps ~6M rows and 1GB of space.

This seems trivial to store in a simple MySQL table, perhaps on x1.

Concrete proposal

  1. MediaWiki core emits a hook from which the extension queues a DeferredUpdate.
  2. DeferredUpdate creates or increments the count row in the table for (date, feature, agent).
  3. ApiFeatureUsage extension queries the table, akin to SELECT count WHERE date, feaure, agent.

Database schema:

  • id: auto increment int
  • date: varbinary(8)
  • feature: varbinary(64)
  • useragent: varbinary(255)
  • count: integer

Index:

  • On fields date,feature,useragent. Or, for improved efficiency, we could limit it to date,feature,useragent(60).

DeferredUpdate:

  • Insert or increment rows in this table from a post-send DeferredUpdate.
  • As a potential mitigation against abuse, the first measure I would suggest, if it comes up, is to use core's PingLimiter such that a given client IP cannot record more than one new row per minute.

Job:

  • On form submission of Special:ApiFeatureUsage, randomly queue a job to deterministically prune old data. The job selects upto N rows ($wgUpdateRowsPerQuery) that are older than the max age, and deletes them by primary key. If there were more than X*N rows to clean up, the job re-queues itself to perform another iteration later.

This would mean you can:

  • Remove apifeatureusage Logstash filter for rsyslog intake.
  • Remove apifeatureusage Logstash cluster.
  • Remove apifeatureusage Elastic cluster.
  • Remove software dependency on Elastica, making ApiFeatureUsage a standalone zero-config extension that is easy to develop, test, and maintain.
  • Remove hardware dependency from SRE, instead building on the shared platform with no dedicated maintenance or upgrades required just for this feature.

I'd rather not have the writes be cross-datacenter, tying up the working thread in the post-send stage of web requests, especially given the potential for spikes. If there was some aggregation service in the middle (batching/flushing counter updates), that would less risky, though more complex.

Maybe this could use circular replication similar like x2? It probably could just use x2, given the modest traffic there similar needs. The AUTOINCREMENT column would have to be removed. The write queries would use an "INSERT ON DUPLICATE KEY UPDATE" that would naturally merge any conflicting updates from different replication sources.

Rate limiting the insertion of new rows makes sense (maybe even rate limiting of count updates). The insertion limiting could be done using an initial existing check via SELECT and local datacenter memcached counters. Alternatively, it could entirely use memcached be keeping a per-minute key holding the set of 3-byte agent hashes used by that ip, disallowing table insertion it has more than X agents.

I'd rather randomized data pruning tie into writes rather than the reads. It makes the effectiveness a bit easier to reason about. Also, it would avoid extra job classes. Using the oldest date of rows, a purge probability that ramps up as a soft deadline approaches (e.g. something like WANCache refresh logic), and memcached ADD or mysql GET_LOCK(), the pruning can happen with the deferred updates. I also wonder if CREATE EVENT would be useful for pruning old rows.

I'd rather not have the writes be cross-datacenter, tying up the working thread in the post-send stage of web requests, especially given the potential for spikes. If there was some aggregation service in the middle (batching/flushing counter updates), that would less risky, though more complex.

Maybe this could use circular replication similar like x2? It probably could just use x2, given the modest traffic there similar needs. The AUTOINCREMENT column would have to be removed. The write queries would use an "INSERT ON DUPLICATE KEY UPDATE" that would naturally merge any conflicting updates from different replication sources.

The circular replication is quite a snowflake and has caused issues before, I'd avoid it if possible.

Rate limiting the insertion of new rows makes sense (maybe even rate limiting of count updates). The insertion limiting could be done using an initial existing check via SELECT and local datacenter memcached counters. Alternatively, it could entirely use memcached be keeping a per-minute key holding the set of 3-byte agent hashes used by that ip, disallowing table insertion it has more than X agents.

Since the exact number is not really the concern for a bot operator. You can have a much simpler setup:

  • Check memcached for a key made via hashing of the UA and other stuff
  • If exists, flip a coin and write to the database once every 1/10th time and bump the timestamp
  • If not, write it to mariadb and save the memcached key for 24 hours (or more) (and bump the timestamp)
    • of course, you should either do it in a job or deferred updated in the main dc.
  • When showing to the user, just multiply the value retrieved by 10 and call it a rough estimate.

This makes sure if a call to a deprecated functionality is happened, it gets written to the database while also avoiding a firehose of writes that could bring down mariadb and explode binlogs.

I'd rather randomized data pruning tie into writes rather than the reads. It makes the effectiveness a bit easier to reason about. Also, it would avoid extra job classes. Using the oldest date of rows, a purge probability that ramps up as a soft deadline approaches (e.g. something like WANCache refresh logic), and memcached ADD or mysql GET_LOCK(), the pruning can happen with the deferred updates. I also wonder if CREATE EVENT would be useful for pruning old rows.

Events and stored procedures work well in a small one db setup but not really in HA environment with a lot of moving parts. I'd recommend against it.

The circular replication is quite a snowflake and has caused issues before, I'd avoid it if possible.

I'm curious what those issues are, since we use it for the mainstash DB on x2. Are there tasks that mention them?

Since the exact number is not really the concern for a bot operator. You can have a much simpler setup:

  • Check memcached for a key made via hashing of the UA and other stuff
  • If exists, flip a coin and write to the database once every 1/10th time and bump the timestamp
  • If not, write it to mariadb and save the memcached key for 24 hours (or more) (and bump the timestamp)
    • of course, you should either do it in a job or deferred updated in the main dc.
  • When showing to the user, just multiply the value retrieved by 10 and call it a rough estimate.

I think quick-and-dirty sampling like that is OK, though I don't see how that scheme would limit a flood of different UAs from an IP causing new row spam. I think the sampling would still need to be combined with the "new rows/min" rate limiting, which easy enough implement.

I was thinking about sample rates that decrease as the number "hits" enters high range tiers, but unsure how to keep it reasonably statistical, presentable, and simple. The 1/10 sampling idea is basically a two-tier version of that concept. The sample rate = {1/1 for first agent/feature use per day; 1/10 for second+ use}. One way to do that would involve having the row increment delta increase (e.g. init to 1, add 10 if >= 1) or there would be some formula to scale the counters (e.g. 5 = 1 + 4 estimates 1 + 10(4) = 41). Either way would have to risk or mitigate race-conditions. For example, the query could randomly be either "INSERT 1 ON DUPLICATE KEY UPDATE afu_count=afu_count" (with p=0.9) or "INSERT 1 ON DUPLICATE KEY UPDATE afu_count=afu_count+10" (with p=0.1). Seems doable.

The special page should probably present the afu_count=1 entries with some kind of asterisk or just say something like "~1-9 per day".

If we don't use master-master replication, then some kind of sampling like this becomes a stronger requirement.

Events and stored procedures work well in a small one db setup but not really in HA environment with a lot of moving parts. I'd recommend against it.

Even for simple idempotent events like pruning some oldest X expired rows with a limit, where even some divergence would be temporary and not matter in terms of data-integrity? I think stored procedures in their *fullest* generality are a mess and pushed for removing much of it from our Postgres code. The pitfalls aren't obvious for a "DELETE rows with afu_day < ..." event.

In any case, I'm fine with having periodic pruning on the INSERT/UPDATE. My main preference is to not have it tied to the special page SELECT.

The circular replication is quite a snowflake and has caused issues before, I'd avoid it if possible.

I'm curious what those issues are, since we use it for the mainstash DB on x2. Are there tasks that mention them?

First of all, our tooling doesn't work with it, specifically the switchover script so all the tests need to be done manually.
I just had to do a x2 switchover to be able to accommodate a network maintenance and it took me 1 hour, while normal switchovers (the switch itself) is like 1 minute. It is extremely complex as we need to gather replication position from both masters, stop slave on the future master, unconfigure replication on all its slaves, and the other DC master. Then we need to reconfigure all those, we have to deal with pt-hearbeat at the same time. Doing all this manually is extremely dangerous and I am sure I've caused user errors during all this time, but that was the fastest I could do it: 1 hour.

x2 might be good on paper, but from an administration point of view is extremely complex, error prone and dangerous. Thankfully the data is volatile, which makes it a bit less of an issue, if the data would needed to be persistent this would be a totally no-go from my side.

Not to mention we just had a page due to the switchover.

Even for simple idempotent events like pruning some oldest X expired rows with a limit, where even some divergence would be temporary and not matter in terms of data-integrity?

Yes. It's not just replication, there are many more pieces in our infra.

In any case, I'm fine with having periodic pruning on the INSERT/UPDATE. My main preference is to not have it tied to the special page SELECT.

Yes, That's how other pruning of old rows works in mediawiki (RC, CU, etc.) and for the sake of simplicity of architecture, it should just reuse the same concepts if not the same code.

I think quick-and-dirty sampling like that is OK, though I don't see how that scheme would limit a flood of different UAs from an IP causing new row spam. I think the sampling would still need to be combined with the "new rows/min" rate limiting, which easy enough implement.

YAGNI, if we see the issue of flood of different UAs hitting the database, we will find a solution for it. Probably via the existing ratelimit infra, no need to re-invent the wheel.

I was thinking about sample rates that decrease as the number "hits" enters high range tiers, but unsure how to keep it reasonably statistical, presentable, and simple. The 1/10 sampling idea is basically a two-tier version of that concept. The sample rate = {1/1 for first agent/feature use per day; 1/10 for second+ use}. One way to do that would involve having the row increment delta increase (e.g. init to 1, add 10 if >= 1) or there would be some formula to scale the counters (e.g. 5 = 1 + 4 estimates 1 + 10(4) = 41). Either way would have to risk or mitigate race-conditions. For example, the query could randomly be either "INSERT 1 ON DUPLICATE KEY UPDATE afu_count=afu_count" (with p=0.9) or "INSERT 1 ON DUPLICATE KEY UPDATE afu_count=afu_count+10" (with p=0.1). Seems doable.

That's adding way too much complexity for what doesn't need it. The user needs to know whether their bot is making a lot of deprecated API calls or not. Adding anything more is just extra complexity without any gain.

Yes, That's how other pruning of old rows works in mediawiki (RC, CU, etc.) and for the sake of simplicity of architecture, it should just reuse the same concepts if not the same code.

I only mentioned that preferred pruning on the write path in response to https://phabricator.wikimedia.org/T313731#9443313 which proposed doing it via jobs triggered from Special:ApiFeatureUsage .

YAGNI, if we see the issue of flood of different UAs hitting the database, we will find a solution for it. Probably via the existing ratelimit infra, no need to re-invent the wheel.

The existing pingLimiter() is how I'd imagine that a row-insertion or row-update rate limit would work for this table.

That's adding way too much complexity for what doesn't need it. The user needs to know whether their bot is making a lot of deprecated API calls or not. Adding anything more is just extra complexity without any gain.

I'm just describing the kind of things that implementing that kind of "first use" + sampling approach would entail. It could use MediaWikiServices::getMicroStash() for the memcached timestamp key checks and use "INSERT IGNORE" / "UPDATE" instead (to at least not error out on races). I suppose using ObjectCache::getLocalClusterInstance() is faster though has more race conditions. In any case, similar complexity either way, and seems manageable.

daniel triaged this task as Medium priority.Feb 29 2024, 4:46 PM
FJoseph-WMF changed the task status from Open to In Progress.Apr 9 2024, 1:55 PM
FJoseph-WMF assigned this task to aaron.

Change #1020383 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] [WIP] Add ApiLogFeatureUsage hook

https://gerrit.wikimedia.org/r/1020383

Change #1020385 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/extensions/ApiFeatureUsage@master] WIP: add option to use a SQL table instead of ElasticSearch

https://gerrit.wikimedia.org/r/1020385

Change #1020383 merged by jenkins-bot:

[mediawiki/core@master] Add ApiLogFeatureUsage hook

https://gerrit.wikimedia.org/r/1020383