Page MenuHomePhabricator

Add datetime versions of timestamp fields to Wikireplica databases
Closed, DeclinedPublicFeature

Description

Feature summary (what you would like to be able to do and where):
For timestamp fields of MediaWiki tables that are replicated into Wikireplica databases, add a column that shows the timestamp in datetime format. For example, we could have an rc_timestamp and an rc_timestamp_datetime field for the recentchanges table.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):
Wikireplica databases are frequently used for analytical queries. These queries often include a component in which a data or time comparison or aggregation is performed. Examples include counting edits by a user or on a page after a certain date, or counting the number of actions on the wiki broken down by the hours of the day, etc.

Benefits (why should this be implemented?):

  • Instead of wasting compute on casting the timestamp string into datetime fields many times across many tools and queries, we can do it once at the time data is added or updated on Wikireplica databases
  • Features such as timezone conversion or date/time diff can be easily used via SQL.
  • Specific indexes can be created on the datetime field to accelerate queries that use it for joins or filters.

Downsides (what are the barriers to implementation?)
There will be a difference between the schema of the tables on Wikireplica and on MediaWiki itself. This already happens to some degree, but with this proposal it becomes more common.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Volans triaged this task as Medium priority.Jan 12 2026, 11:13 AM
Volans subscribed.

Adding Data-Persistence and Data-Engineering for visibility and feasibility feedbacks.

I believe at the moment the only schema differences are indexes but for columns these would become a lot more problematic. Starting with the fact that we use RBR on the wikireplicas, which means we'd need to add this column to their master (sanitarium) and to their master's master (which is a production database serving MW traffic). Otherwise replication would break when inserting data from production via the replication thread.
Having a dedicated replica with different columns from MW schemas is a no-go unfortunately.

Maybe you can explore virtual columns? https://mariadb.com/docs/server/reference/sql-statements/data-definition/create/generated-columns

FWIW, having non-string datetime types would be quite helpful for Data Platform usages as well. T372677: Figure a performant way to read all data from revision table via Spark

normally, we would be able to use, say, 64 executors and split the revision table to be read by partitioning on the rev_timestamp column
[...]
This Spark mechanism only works with TIMESTAMPs, DATEs and the family of INTEGERs, because Spark needs to know how to define the partition boundaries automatically

There will be a difference between the schema of the tables on Wikireplica and on MediaWiki itself

I wonder if there has ever been a discussion of just fixing this in MediaWiki itself? Sounds like a big project but maybe there is a way?

Virtual columns are calculated at runtime, so they don't provide performance improvements and cannot be indexed.

I wonder if there has ever been a discussion of just fixing this in MediaWiki itself? Sounds like a big project but maybe there is a way?

I think the choice of a string timestamp instead of an actual DATETIME field in the DB was deliberate when MediaWiki code was first written; the goal was to not be at the mercy of the different RDBMS providers' implementation of datetime and maximize the number of database engines on which MediaWiki can run. This may or may not be relevant today (over the last 23 years since MediaWiki started, RDBMS engines have come a significant way in conformity to standards like ANSI-SQL though there are still major obstacles), but I agree it is going to be such a major change to the code that it would probably be a no-go. Especially given that the value of searching by timestamps is really significant for analytical use cases, but not so for transactional use cases. MediaWiki's prod DBs are designed for OLTP, not OLAP.

The real solution here is to have OLAP databases provided on Data-Services which are not just a close replica of production databases, but a near-real-time version with data models optimized for analytics.

The real solution here is to have OLAP databases provided on Data-Services which are not just a close replica of production databases, but a near-real-time version with data models optimized for analytics.

That would be T215858: Plan a replacement for wiki replicas that is better suited to typical OLAP use cases than the MediaWiki OLTP schema.

Yeah, to me this feels a bit of x/y problem. A proper solution would be having an OLAP infra which would also include cross-wiki queries and historical queries (going back to snapshots in time) and this too.

+1 to Data-Services OLAP infra.

BTW, We started a project in 2018 to do this, but canned it due to lack of operational (monitoring, alerting, etc.) support in Cloud Services. T204950: Public Edit Data Lake: Mediawiki history snapshots available in SQL data store to cloud (labs) users

At the time, we did not consider hosting the infra in WMF prod and exposing it to Cloud IPs via firewall rules (this is how Cloud Elastic works). We were misinformed at the time and thought this was not allowed for security reasons.

We know generally how to do this, it would only take WMF priority to do so (which given org goals, may be unlikely any time soon :/ )

CC @Milimetric

taavi subscribed.

OLAP infrastructure is tracked elsewhere as indicated above. It seems like there isn't much interest in adding this amount complexity to the existing replicas infrastructure, so declining.

Suggestion: If OLAP infra in Cloud Services is something a significant portion of the developer community is excited about, it may be more easy to prioritize via WMF's Community WishList.

It is very difficult to convince regular managers of the value of this kind of stuff, because no one inside of WMF is explicitly asking for it. If community was loud enough about it, we might have more luck! :)

I'll post this on the OLAP infra ticket too.