Error
- mwversion: 1.44.0-wmf.20
- reqId: a77f11f1-b24a-9b0a-a5f0-8b89ad8d3cba
- Find reqId in Logstash
Expectation (readQueryRows <= 10000) … not met (actual: 12307) in …
SELECT rt_page FROM `revtag` JOIN `page` ON ((rt_page = page_id) AND (rt_revision = page_latest) AND rt_type IN ('X') ) GROUP BY rt_pageImpact
When simply viewing https://meta.wikimedia.org/wiki/Wikimedia_servers, the Translate extension sometimes fetch 12,307 rows from the Translate extensions' revtag database table during the synchronous portion of a GET request.
This is above the generous threshold of 10K rows and thus triggers a DBPerformance warning in Logstash.
The above is a sample from when both getWithSet calls are a cache miss.
Notes
See also " Design for cache miss" at https://wikitech.wikimedia.org/wiki/MediaWiki_Engineering/Guides/Backend_performance_practices.
Depending on the usage pattern for this data, there may be a more scalable solution.
The following are my assumptions, which I am asking Language and Product Localization to confirm/reject:
- on wikis where the Translate extension, the TranslatablePage::isSourcePage function is called on most or all page views.
- the TranslatablePage::isSourcePage function needs this dataset on most or all such calls.
- the TranslatablePage::isSourcePage function is generally only called once, or with only one specific page on a given pageview (i.e. it is not called in a loop with 20 different pages).
- the TranslatablePage::isSourcePage function requires information from the revtag table. That is, any page could be considered a source page. There is nothing we can guess/compute in PHP from the Title object, PageProps, or LinkCache information. Instead, the only way to know, is to check if the page has an entry in this table.
Exploring some naive ideas:
- Query just this one row, with a dedicated cache key.
- Downside: on the long tail, this would usually cache-miss, and thus add 1 extra database query to each pageview, T347123. Currently the query is avoided on most requests at the cost of 1) a massive query in the critical path on miss, and 2) a ton of bandwidth and memory that loads the entire database into memory, T366455.
- Consider storing this information as a page property instead.
- This means it can be pre-fetched for the current page in a batch and would thus already be in memory. This data is fetched on page views regardless and thus would not cost extra. (This assumes isSourcePage is mostly called for the current page.)
- Consider caching this information inside the ParserOutput.
- This means when the Translate onBeforePageDisplay hook can consult that in-memory, as the page view will have already computed or retreived this from the ParserCache. (This assumes isSourcePage is mostly called for the current page.)
- Consider a deterministic split such that only one slice is needed.
- This is different from the ideas in T366455 which depended on a source-side split where all slices are needed by call callers, which was reverted as equal/worse than status quo. If the data is split such that the caller can be satisfied with a single slice at both cache and DB layer, then it would make a big difference. Mapping a page title or page ID to a small distributed set is fairly easy (e.g. mod 2 or mod 3 would already cut MySQL rows and network bandwidth and PHP memory by 50-60%, bias notwithstanding; or e.g. first letter of MD5 hash of title or ID for a 90% cut to 1/16th).
- The difficulty will be in how to formulate this as performant database query on the revtag table.
- One way to get here would be to take the query out of the critical path entirely and use a JobQueue job instead. For example, you could have a job that queries this table, where in jobs we tolerate reads upto 100K rows, and either way you could paginate here since it isn't blocking user latency. Then the job would split this into e.g. 10-20 different keys in the MainStash DB (instead of Memcached, so that it is more strongly persisted). The function could then be rewritten as LocalServerCache->getWithSet(one slice) -> get 1 pre-computed slice via MainStash. And if the MainStash row is missing, fallback to a cheap 1-row query from revtags, and queue a job to compute MainStash for next time.
- If there is a latency-sensitive use case for answering isSourcePage of many different pages in the same web request, then it may be worth providing an batch lookup method for this purpose. Assuming that this will not be on article pageviews, it might not even warrant caching since a single batch query for 10-20 random access titles is likely fine and not a notable addition to the queries that use case already makes.
See also: