Page MenuHomePhabricator

Keep edit counts in separate database table and update on edit
Open, LowPublic

Description

As part of the discussion of T235572, @jcrespo pointed out that making big database queries to get counts of edits is inefficient. Instead, we should store the counts in a separate database table, and update that table when a new revision is created.

We've currently got API endpoints for the following counts:

  • Bot-authored revisions
  • Anonymous-authored revisions
  • Minor revisions
  • Total revisions

A revision_counts table could contain the page ID as the primary key, and each of these counts as a column (bot_revisions, anon_revisions, minor_revisions, total_revisions).

When a page is saved and a new revision is created, we should increment the total edits for that page, and also the applicable edits (for example, if it's a minor revision, increment the minor revisions count).

To backfill counts that don't exist, we can do the necessary queries to determine the counts at edit time or at API call time. Since edit volume is much much bigger than expected API traffic, it probably makes sense to do it at API call time, and only increment at edit time if the revision_counts row for the page already exists.

We'll also need code to decrement the counts when a revision is deleted, and increment the counts when a revision is undeleted.

Event Timeline

A few thoughts for (eventual) discussion:

  • Do we want this to be more generic than revision_counts? Do we want a more generic approach to statistics (and is there any existing statistics storage we could leverage)?
  • A column for each count type could lead to a very wide table if the number of counts continues to grow.
  • What is the expected eventual size of this table once it backfills, and does that cause us any concern? (I realize that by using the page_id as the primary key, it is bounded by the number of pages.)
  • Is having this in mediawiki core worthwhile for mediawiki installations in general, or are we working around an issue specific to WMF's production data size? If this is specific to WMF, should we consider a caching/storage mechanism outside core (which opens up storage possibilities other than a relational db)? The API, of course, would have to work on non-WMF wikis, but perhaps could just do a query in those cases.
  • Can we get away with backfilling via API calls? That'd still involve some expensive queries, but maybe on a once-per-page basis it'd be okay? That's probably a DBA question.

I'm not suggesting we figure all that out right now. I just wanted to record those thoughts for when we eventually do talk about it.

Okay, one more thought - we might be able to backfill the pages with the largest number of revisions asynchronously before we activate the new system, so that the real-time backfill is acceptable. This concern, and the potential need to remediate it, might affect wikis other than enwiki, not sure what data sizes on the various other wikis look like.

Is having this in mediawiki core worthwhile for mediawiki installations in general, or are we working around an issue specific to WMF's production data size? If this is specific to WMF, should we consider a caching/storage mechanism outside core (which opens up storage possibilities other than a relational db)? The API, of course, would have to work on non-WMF wikis, but perhaps could just do a query in those cases.

Sorry, I should've thought about it way earlier. We should talk to Analytics and @JAllemandou in particular regarding what's possible with Druid - they already have an edits dataset and they have made a very significant effort into cleaning up the dataset (history is not as easy as we assume, so our counts might be quite off). They already provide endpoints for edits and editors counts, per month/year for selected time periods. I'm wondering if we could piggy-back on their infrastructure to get what we want here?

We could keep the code we have for third parties and consult Druid for WMF.

Do we want this to be more generic than revision_counts? Do we want a more generic approach to statistics (and is there any existing statistics storage we could leverage)?

Just out of some random code lookups, there is a SiteStats class and table. The table contains 1 row with all the site statistics (total number of pages, number of edits etc. It's updated via deferred updates in multiple places, including from some extensions (Flow, Translate, CentralAuth etc). I guess we could take this as an opportunity to generalize that and convert it into a Mediawiki services.

I'm wondering if we could piggy-back on their infrastructure to get what we want here?

SRE tip: Please note analytics and production networks are separate both logically, physically and in ownership. That doesn't mean you are wrong about potentially leverage it, but that could imply a duplicate infrastructure in production to resource it and handle load adequately (millions of requests per second vs. thousands). Also, someone should be responsible to maintain it there (would platform be willing to maintain it? Or platform + analytics?)- it is my understanding that analytics services have a much lower SLA due to even more constrained resources (based on my understanding and the "experimental" API claims but I may be wrong)- You cannot depend a service with high SLA on another with lower SLA. Note also analytics services (and thus, technology) are generally ok with having off-by-one errors and eventual consistency (e.g. they can be from a few seconds to a month delayed), while in general production counters are indeed more limited because they tend to be "more real time" and closer to canonical data (with exceptions due to performance)- for example, a page having 15M views or 15M+1 views are not that relevant when the measuring device already has intrinsic errors. This is not a criticism, just understanding the original scope of analytics vs. production, so we understand each envs. limitations and original purpose so we don't get out of scope.

Ideally they would be bound to the page information assuming they are 1 or 2 counters. If one intends to have an arbitrary large number of counters in the future, not normally used, indeed a separate table would be ideal. Note there is already a page_properties table, that allows arbitrary properties for pages, although that is only supposed to be used at the moment for special properties not found everywhere (locking). Without knowing much, probably a dedicated page_counters table could do the thing- one could even use it for pages with, eg. more than 500 edits and count in real time for smaller pages (?). Again, lots of design decisions to think about, depending on the original API expectations (consistency, update expectancy, accuracy, latency, volume).

@jcrespo pointed out that making big database queries to get counts of edits is inefficient

One last corrections, making big database queries to get counts of edits is inefficient if an index hasn't been created in advance/designed for new functionality. Talking quickly, It is usually group byS and multi-colum/fuzzy filtering that MySQL is not great for, and either an alternative analytics engine/column store (eg. druid) or search engine (elastic) are preferred in those cases. However, please be aware of the overhead of maintaining multiple storage backends when there are not great requirements.

Thank you for comments @jcrespo

Please note analytics and production networks are separate both logically, physically and in ownership.

This part already lives in production network. AQS is deliberately put into prod network, is already exposed publicly via an API through RESTBase and is already hit by production apps. All the counts are asynchronously calculated and exported into a basically static data store, so in theory it should scale with no problems. The only difference here vs what is already being done, is that it will be proxied not by RESTBase, but by MW core.

Ah, cool. But please note the other scope/SLA issues I mentioned (potentially) still stand.

T237430: Add object caching and PoolCounter to minor edit count endpoint proposes adding object caching and PoolCounter, at least for the minor edit count endpoint. We should consider our overall plan for the counts before implementing either that task or this one, so that we don't do unnecessary work.

The main wonder I have around using analytics infra for serving page-count is on the delay we have in update. We update the data once every month, so for many pages data will be drifting. Depending on the purpose of the count, maybe it's not critical, but I speak as to that. If we decide to try to use Druid, let's discuss this in a meeting :)

Yes- that was my point, not a criticism to the great work you do, but the real time but limited aspect vs delayed but more feature-full scope.