Page MenuHomePhabricator

Special:ContentTranslationStats is slow and getting crowded
Open, MediumPublic

Description

The Special:ContentTranslationStats page is increasingly getting slow because of huge amount of historic data it need to query and fetch. As per the current design it displays data since the launch of CX in 2015. Occassional spikes in the translation caused the display of weekly translations less useful.

As per 22 Dec 2022, the whole page took 22 seconds to load for me.

image.png (440×1 px, 53 KB)

At the same time, all these data are important and interesting to the team. Realtime trends in translation per language is important in many aspects and decision making. Cumulative translation and translator stats are also helpful.

Some ideas:

  1. Only fetch and display data for recent time- for example, last one year
  2. Remove this special page completely use dashboards that are not part of mediawiki extension. Example, a superset dashboard. or any such visualization that allows date range selection and language selection
  3. Improve Special:CXStats to have date range selection and language selection

I won't recommend more work on Special:CXStats because a MW extension and special page is not a good place to display such information(Historically, we had reasons to have this special page). There are analytics dashboards for the purpose. Moving this outside of an extension deployed in almost all wikis make its maintanance, design constraints more manageable.
There are many open tickets to improve this page, all requiring development effort on enhancing data visualization and processing on top of the CX Mediawiki Extension. It is better to reuse our existing analytics infrastructure for these needs instead of building in a MediaWiki extension.

Event Timeline

Thanks @santhosh for creating this task. I agree that reusing existing analytics infrastructure vs a MW extension would more manageable if possible.

I'm currently investigating some possible options that can be used as a replacement to Special:CXStats. See some initial notes summarized below.

Private Superset Instance

  • Many of the stats tracked on Special:CXStats could be moved to our existing Content Translation Unified dashboard. This dashboard already tracks some similar metrics such as cumulative translations and this would be an opportunity to remove the current duplication of dashboards by moving all metrics to a single place for tracking.
  • The superset dashboard is only accessible internally to WMF. We could provide a snapshot or summary of data trends to the community via a Mediawiki page but they would not be able to filter and view themselves.
  • Currently, the superset dashboard relies on data from edit_hourly, which is only available monthly. To obtain real-time updates, we would need to create a job to aggregate CX metrics for dashboarding. There's currently a task open to do this: T287306

Public Superset Instance

  • There's a new public superset instance currently being tested on Wikimedia Cloud Services. See wikitech page for more info.
  • This would have many of the same features as the private instance so we could potentially recreate the existing Unified Content Translation Dashboard using the public instance.
  • Currently this public instance only includes access to data on redacted replicas (similar to Quarry), which does not include translations data.
  • We could investigate getting a version of cx translations data into the redacted replicas using this process. This would require some engineering resources and time but if feasible would provide an opportunity to have a one dashboard resource that we could use internally and be shared with the community.

cc @Pginer-WMF

@MNeisler Thanks for listing this options. There are a few "must have" requirements from my perspective

  1. We should be able to see near real time data related to translations. This is very crucial to detect sudden spikes we had observed in the past and watching those wikis closely. Communities doing campaigns would also love to see the stats in real time. Our current dashboard data is real time data.
  2. The data/graph/visualization should be publicly available for any community members.
  3. Currently the graphs we show in Special:CXStats is not same in every wiki. For example, es wiki will see translation stats for all languages plus stats for spanish wiki. Similary hi wiki will see globals stats plus hiwiki stats. If we replace this with a central stats system, say in public supersets, it is important to do filtering and get this wiki or language specific stats.
  4. Filtering by dates would be required because not having that feature caused our current dashboard very crowded with historic data since 2015

Creating custom graphs or visualization would be nice. But if that is difficult to provide in a public superset, exposing the data as json would allow people to create their visualization in different platforms

Currently, we have different APIs in CX that expose every part of database and I don't think there is any private data in our database tables to redact.

There is a feature in superset where we can just embed any dashboards in any web page. That seems the easiest approach here. https://github.com/apache/superset/tree/master/superset-embedded-sdk

Requesting @KCVelaga to investigate the feasibility of this option.

Update:

I had a conversation with the Cloud Services team about the feasibility for https://superset.wmcloud.org/. I forwarded the original response on the Slack thread, but the summary of it is: Kubernetes installation of Superset is very restricted, which includes rebuilding of all images with customisation, and do it every time with each upgrade. As embedded SDK uses npm for its installation, it falls to such category. It is possible to do this if we move to VMs, but there is no current plan to do that.


Tagging @BTullis to see if this a possibility for the internal instance, superset.wikimedia.org. Hi Ben, we would like to know if Superset Embedded SDK is an option for the internal instance of Superset, especially if the frame will be embedded in a public page.

As a band-aid, I suggest we add a layer of caching to the slow queries made by this page. This will give us more time thinking about solutions while keeping this page working and reducing load on the databases. Currently there is no caching at all. We can use MainWANObjectCache service to cache the *processed* query results (numbers grouped by week/month) that should be relatively small. We can give this cache a long cache time, like a day or week. We can augment the cached data by overlaying it with a data from a live query for the past week to ensure that the recent days are accurate, while any changes affecting historical data (which should be rare) could be slightly off.

As an additional step, preventative cache regeneration could be applied through job queue in case the initial cache filling requests get too slow for web requests, but we can leave that decision for later.

Summary of the Slack discussion: Seems like the best option we have is http://superset.wmcloud.org/, which requires a CentralAuth login, similar to Quarry, anybody can register instantly and view the dashboard. Superset has some options for anonymous viewing, will need to check with the Cloud Services team about that. The first step will be to get translations data into wiki replicas. This can probably be prioritized for the next quarter.

Change #1036572 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/ContentTranslation@master] Add simple caching for slow stats queries

https://gerrit.wikimedia.org/r/1036572

Pginer-WMF triaged this task as Medium priority.Wed, May 29, 7:20 PM
Pginer-WMF moved this task from Needs Triage to Bugs on the ContentTranslation board.

Change #1036572 merged by jenkins-bot:

[mediawiki/extensions/ContentTranslation@master] Add simple caching for slow stats queries

https://gerrit.wikimedia.org/r/1036572

Change #1041413 had a related patch set uploaded (by Nikerabbit; author: Nikerabbit):

[mediawiki/extensions/ContentTranslation@master] Combine two slow queries into one

https://gerrit.wikimedia.org/r/1041413