Page MenuHomePhabricator

[SPIKE] Investigate how Wikilink could handle different DB providers [8HRS]
Closed, ResolvedPublicSpike

Description

Django supports the use of multiple databases in one Django project. The objective of this spike is to investigate whether using multiple databases for different Wikilink tables will improve performance and how much work would it take to implement/migrate our project. Essentially, this is a cost/benefit analysis.

We have some prior art in a proof of concept project that uses a separate db for facts tables, eg.

DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.sqlite3",
        "NAME": BASE_DIR / "data/django.sqlite3",
    },
    "facts": {
        "ENGINE": "django.db.backends.sqlite3",
        "NAME": BASE_DIR / "data/facts.sqlite3",
        "OPTIONS": {
            "init_command": "PRAGMA journal_mode=wal;",
        },
    },
}

https://github.com/jsnshrmn/mw_revscore/blob/60505cafb762ff868d93e8ffe9a67ec1c58a119b/project/settings.py#L111-L123

In the example, both dbs use the same engine, but this is not a requirement.

Questions:

  • Do we have different tables that would benefit from different DB engines?
  • Could we improve reliability by splitting the workload across separately resourced DBs?
  • If so, what would be the steps to implement such a change without data loss?
  • What would the level of effort be for such an implementation?
  • Do you recommend such a change? Feel free to add context not covered by the answers to the preceding questions.

Note:
The questions here closely relate to questions in T370969; feel free to use that to inform this or vice versa

Resources:

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJul 25 2024, 1:02 AM
jsn.sherman renamed this task from [SPIKE] Investigate how Wikilink could handle different DB providers to [SPIKE] Investigate how Wikilink could handle different DB providers [4HRS].Mar 19 2025, 3:34 PM
jsn.sherman renamed this task from [SPIKE] Investigate how Wikilink could handle different DB providers [4HRS] to [SPIKE] Investigate how Wikilink could handle different DB providers [16HRS].
jsn.sherman renamed this task from [SPIKE] Investigate how Wikilink could handle different DB providers [16HRS] to [SPIKE] Investigate how Wikilink could handle different DB providers [8HRS].
jsn.sherman updated the task description. (Show Details)
DMburugu raised the priority of this task from Low to Medium.May 13 2025, 4:13 PM
Amdrel changed the task status from Open to In Progress.May 21 2025, 11:57 PM
Amdrel claimed this task.

I spent some time looking into this. I performed the investigation under the assumption that any new databases we would potentially add would be running on the same instance and disk as the current database as for this project everything runs together on the same instance to help lower costs.

Do we have different tables that would benefit from different DB engines?

The tables most likely to benefit from running under a different DB engine, SQLite specifically, would be links_linkevent and links_linksearchtotal. These are the most write heavy tables during all hours of the day as the event stream writes to these tables whenever links change events are received. I haven't run any benchmarks with SQLite (with WAL) to compare it to MariaDB, but it's my understanding that SQLite potentially has less overhead for high-frequency small writes. As far as read and write contention on the instance goes, if a new database engine ends up running on the same instance then we might not see much benefit at all if the number of overall I/O operations is a performance bottleneck for us.

Could we improve reliability by splitting the workload across separately resourced DBs?

Potentially yes. Since the MariaDB and SQLite database processes would be separate, if one database starts having issues then the other one wouldn't be affected. For example if the MariaDB database fails (not due to I/O issues) this wouldn't affect the event stream service making writes to the SQLite database storing our link events.

If so, what would be the steps to implement such a change without data loss?

Since our link tables are not very large due to recent archival changes we could probably perform a data migration from MariaDB to SQLite very quickly using a management script while the containers are off and toggle a feature flag so that the new database is used instead for link events. If we're worried about the reliability of writes to the new database we could mirror writes to the SQLite database and continue to read from the MariaDB tables instead to evaluate how well it works. We should be able to mirror writes with Django using a 'post_save' signal to avoid any lengthy refactors. It's worth noting that in the meantime the performance of the MariaDB database would be impacted if the SQLite database is present on the same disk.

What would the level of effort be for such an implementation?

Medium. Thankfully our links tables that would most benefit from being moved to SQLite are fairly isolated from other models in the schema apart from referencing collections and users, but those relationships don't look problematic for me and I don't see any JOINs that would complicate things. Django has routing capabilities that should allow us to have a split schema without having to modify all code that references LinkEvents. I think most of the effort would be spent on testing to ensure that we don't create any regressions, and also benchmarking to evaluate any performance gains or losses.

Do you recommend such a change? Feel free to add context not covered by the answers to the preceding questions.

Under the current constraint of running everything on the same instance I don't think this will net us enough performance gains to be worth it. If link events were to be stored on a separate instance or disk from aggregates then it might be worth it as that could help with reducing read and write contention, though we can also accomplish this using a read replica which would require less work to implement. Recent performance enhancements that have been made such as warehousing program aggregates (still an open PR if I recall) and truncating the link events tables have been very impactful and I don't expect adding another database into the mix would produce the same kinds of gains we saw from those changes. In my opinion continuing to optimize the current setup makes more sense unless our constraints change.

Do you recommend such a change? Feel free to add context not covered by the answers to the preceding questions.

Under the current constraint of running everything on the same instance I don't think this will net us enough performance gains to be worth it. If link events were to be stored on a separate instance or disk from aggregates then it might be worth it as that could help with reducing read and write contention, though we can also accomplish this using a read replica which would require less work to implement.

Thanks for your thoughtful write-up. I agree with this conclusion.
I think the next steps would be to investigate adding read replicas.
This ticket T370946: SPIKE: Investigate implementing read-only replicas using a trove managed database in Cloud VPS [8H] is about using trove managed databases in particular.

I'll leave this open until another engineer takes a look, but thanks for your work!

We talked about this in our sync today and agreed that it would make sense to do this in concert with with moving the other tables to a managed trove instance, should we find that it's worth doing in T370946: SPIKE: Investigate implementing read-only replicas using a trove managed database in Cloud VPS [8H]. That way we keep "ephemeral" data on the local instance and only bother with persistent data in the managed service