Page MenuHomePhabricator

For global elections, stop creating eligible voters table for each election on every wiki and keeping them forever
Open, Needs TriagePublic

Description

Currently, only on each replica on s3 we have 9477 tables that are created because of secure poll elections. For example:

root@db1157:/srv/sqldata/fawikiquote# ls -Ssh *_edits.ibd 
368K bv2015_edits.ibd  176K bv2022_edits.ibd   176K ucoc2023_edits.ibd	160K bv2021_edits.ibd	 144K bv2017_edits.ibd	 96K pif_edits.ibd
192K bv2011_edits.ibd  176K u4c2024_edits.ibd  176K ucoc_edits.ibd	160K mcdc2021_edits.ibd  128K bv2009_edits.ibd

This is still holding data for elections from 15 years ago and is adding a lot of maintenance burden (e.g. contributing to issues in T349360: Clean up dbbackups.backup_files table (backups tracking table growing really large), files opened by mariadb, inode table, etc.). It's hard to track down what these tables supposed to be (it took me a while to figure out what pif_edits is) and it's inefficient: The table from the 2009 election has been backed up thousands of times now.

And this will get worse every year with more elections and is not sustainable.

We either need to stop creating a new table for each election or phase out previous ones. If it's needed for legal reasons, we can create a swift container and dump them there.

Acceptance criteria

  • Current workflow is to run multiple maintenance scripts in a certain order, and these maintenance scripts first write their data to the intermediate tables such as ucoc2023_edits, then a final maintenance scripts writes its results to centralauth db -> securepoll_lists. Rewrite the maintenance scripts to skip writing to intermediate tables such as ucoc2023_edits and to write directly to centralauth db -> securepoll_lists. List of maintenance scripts to re-write:
    • TBD

Ideas for additional tickets

  • Investigate if / how many years the data in the intermediate tables such as ucoc2023_edits needs to be kept (by consulting T&S, legal, DBAs, etc). Make a decision on which / how many of these tables to delete. Then delete them. Expected workflow impact from deleting these tables: zero impact on completed elections. The valuable data (final voter eligibility whitelist) seems to be stored in centralauth db -> securepoll_lists. These intermediate tables are just a middle step used to generate the final voter eligibility whitelist.
  • T393247: SecurePoll: Drop old rows from the securepoll_lists table on WMF production - Investigate / decide if it's worth deleting a couple million rows of old voter eligibility data from securepoll_lists tables on non-centalauth wikis.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Where in the code are these tables created? I don't recall the core code needing to create separate tables for elections. I know WMF has occasionally created tables (see ./cli/wm-scripts) but those are not for every single election either. And at least the present code (same link) only relates to 2022 and later.

Wikitech documentation details the process: https://wikitech.wikimedia.org/wiki/SecurePoll

For a global election we need to ensure voters are eligible no matter which project they made their edits on. Since that could be hundreds of separate projects (in theory), we need to make these tables on every wiki. It is far from an ideal solution of course but that's the reasoning for it right now.

We reduced a bit of maintenance debt, for future devs need to find a better solution to this.

Has a "one set of tables per wiki instead of per election" approach been considered? We could add an election_id column to each table. I think this would be much more scalable in the long run than a separate set of tables for each election.

It's because the columns depends on the election. You might have a different criteria that could not be fit inside the table.

OTOH, it can turn into a json blob instead (depends on the query patterns but I highly doubt you'd need to query anything beside looking up values of a given user)

Or it could be pivoted into a long format (one row for each property, as opposed to one column for each property).

Can we at least prefix each table with securepoll_ or whatever (in future) so they're more visibly grouped together in reports?

Can we at least prefix each table with securepoll_ or whatever (in future) so they're more visibly grouped together in reports?

We can update the technical docs to advise that; I've boldly done so on Wikitech.

As for the task itself I briefly spoke with @Ladsgroup and I think generally we can just drop these tables after some arbitrary time after the election is over (and the people elected are seated, etc., just in case we need to re-tally). He mentioned a year, and I think that's fine, since we can just re-run the scripts if necessary.

Can we at least prefix each table with securepoll_ or whatever (in future) so they're more visibly grouped together in reports?

We can update the technical docs to advise that; I've boldly done so on Wikitech.

Thanks, that'd be great!

As for the task itself I briefly spoke with @Ladsgroup and I think generally we can just drop these tables after some arbitrary time after the election is over (and the people elected are seated, etc., just in case we need to re-tally). He mentioned a year, and I think that's fine, since we can just re-run the scripts if necessary.

Ack. +1

Thank you, I turn on the chainsaw now.

Novem_Linguae renamed this task from SecurePoll creates a table for each election and keeps it forever to For global elections, SecurePoll creates an eligible voters table for each election on every wiki and keeps it forever.Oct 29 2024, 10:09 AM
kostajh subscribed.

There are at least two concerns:

  • high number of tables in production, affects stability of MariaDB in some clusters
  • ad hoc tables per election means the tables aren't listed in a new table catalog. So DBA preference is one table, with rows per election

So, from TSP point of view, we prefer a single table with rows per election.

There's a related issue of filtering out who is eligible for votes, that would help reduce the amount of rows created.

I'm moving this to "Needs engineering specification" so that we update the task to use the TSP task template, to clarify how we'd like to implement this.

kostajh changed the task status from Open to Stalled.Jan 29 2025, 10:25 AM

Thanks. Let me know if I can help on anything and provide further information regarding this issue

@Ladsgroup Would it be sufficient/possibly to cron job and delete vote eligibility tables a year after they've been used? SecurePoll doesn't enforce naming on its tables but it's documented that they should be consistently named now so it should be safe to delete securepoll_bv* as long as they could be filtered by timestamp. If this is a viable solution, we could also look into technically enforcing naming as part of this work.

@STran the mw database users don't have DROP rights (because due to series of unfortunate events it dropped an important and useful table bringing down all of Wikipedia for an hour). If it's in one table, sure we can drop the rows via cron job (that requires DELETE right and that's how CU and RC tables are cleaned up).

oh and it won't really be helpful. We do have issues with number of tables in s3 that have many wikis. Adding one table per wiki might not sound much even for a year but it's causing strain on prod (e.g. currently we have four tables per wiki because we had four global elections last year)

So, from TSP point of view, we prefer a single table with rows per election.

Given the comment by Ladsgroup above and the comment above by Kosta, could we focus on one table per wiki for all elections? Alternatively some kind of central table or extension1 table.

I get that it would be more work, but it feels like this is a better solution for production.

Given the comment by Ladsgroup above and the comment above by Kosta, could we focus on one table per wiki for all elections? Alternatively some kind of central table or extension1 table.

Yes. my question was moreso to evaluate if we could pass along some more well-scoped work to contractors to bandaid the solution until we had more engineering resources for this. If we can't, then it is what it is and we should investigate the proposed solution more before committing to any engineering.

Hi! I was updating the list of tables that are not cataloged (T363581: Build a machine-readable catalogue of mariadb tables in production) and I saw the report ballooning, it seems we now have a securepoll_ucocar2025_edits table on every wiki. Is this intentional? if so, can this task be prioritized? We now have 5K tables in production just for random securepoll elections in the past year.

Dropping u4c2024_edits and u4c202404_edits everywhere since T&S confirmed it's okay to drop them.

Research

Here's a sampling of table schemas, to give us an idea of if they're using the same fields or not.

cli/wm-scripts/ucocar2025/ucocar2025_tables.sql

CREATE TABLE securepoll_ucocar2025_edits (
	bv_user int not null primary key,
	bv_long_edits int not null,
	bv_short_edits int not null
);

cli/wm-scripts/wmcr2024/wmcr2024_tables.sql

CREATE TABLE wmcr2024_edits (
	bv_user int not null primary key,
	bv_long_edits int not null,
	bv_short_edits int not null
);

cli/wm-scripts/bv2024/bv2024_tables.sql

CREATE TABLE bv2024_edits (
	bv_user int not null primary key,
	bv_long_edits int not null,
	bv_short_edits int not null
);

More old scripts to look at, if needed: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/750803

Proposal

These all look to be using the same fields. So maybe the fix here is to create one securepoll_eligible_voters table, with the same fields, plus an election_wiki (votewiki) and election_id field, plus a field with the end date of the election so that maintenance scripts can easily delete all rows before a certain date when cleaning up old data. And I guess we can throw in a primary key too. So something like...

CREATE TABLE securepoll_eligible_voters (
	ev_id int not null primary key,
	ev_election_wiki VARBINARY(32) NOT NULL,
	ev_election_id int not null,
	ev_election_end_date binary(14) null default null,
	ev_user_id int not null,
	ev_long_edits int not null,
	ev_short_edits int not null
);

Then we just create one table per wiki, then we update maintenance scripts to use the new table going forward. Then in a couple years, we go delete all the old tables without bothering to move the data into the new securepoll_eligible_voters table.

Acceptance criteria / subtickets

Having a key value column which the value would be a json blob would be much simpler and future proof for different elections with different criteria. These tables are quite small so they don't take much space to be worried about that.

Should this ticket be changed from stalled to open? I don't think this ticket is blocked on anything external such as an upstream change or subtickets. I think it just needs a bit of discussion to hash out the approach.

mszabo changed the task status from Stalled to Open.Apr 23 2025, 7:09 AM
mszabo subscribed.

Unstalling per discussion.

After discussion with myself, Mate, Amir, and Dreamy Jazz at the hackathon...

  • We've discovered that there's a securepoll_lists table that already exists on every wiki, and it is used by CLI scripts to store eligible voters for global elections. The global election CLI scripts write to this table in the centralauth database as a final step in generating the global election voter lists.
  • Checking production, the securepoll_lists tables on regular wikis are also populated with many thousands of rows. This is perhaps left over from when data was being generated by CLI scripts but before we had CentralAuth.
  • In terms of approach, we are thinking of skipping having an intermediate table such as securepoll_ucocar2025_edits, securepoll_eligible_voters, etc. and just refactoring the CLI scripts to write directly to the centralauth securepoll_lists table.

It seems like we can fix this without doing any database schema changes.

Another option is we could create an additional SQL table to make side-by-side testing easier. That is, for an upcoming global election or for a global test election, we'd run the old CLI scripts using the old tables and procedures, and run the new CLI scripts using the new tables and procedures, and make sure that their outputs are identical.

One challenge with side-by-side testing though is we'd need to make sure that we are able to get an accurate snapshot that isn't dependent on the timing of running the voter eligibility scripts. For example, sometimes one of the voter eligibility requirements is that the user is not a bot. If hypothetically the user's bot group changed and the CLI scripts processed that wiki at different times, then the eligible voter lists would not be identical.

Note that securepoll_lists will start getting filled once wikis start having local elections. Securepoll_lists stores the various whitelists / lists that can be imported via the Voter Eligibility page in SecurePoll. I originally had a hypothesis that securepoll_lists was only used by CLI, but this hypothesis was incorrect.

Thanks for the investigation and comments. Would someone be willing to update the task description based on the discussion?

Thanks for the investigation and comments. Would someone be willing to update the task description based on the discussion?

I took a stab at it. Others should feel free to refine it.

If there is some capacity, can this be picked up sooner? The reason I'm asking is that T&S is planning to run a next global election really soon and if we don't fix it by then, then we will have yet another table in every wiki: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/1151383 if there is no capacity, that's fine. Also T395429: Stop the need to manually run write queries on primary databases when running global elections

If there is some capacity, can this be picked up sooner? The reason I'm asking is that T&S is planning to run a next global election really soon and if we don't fix it by then, then we will have yet another table in every wiki: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/1151383 if there is no capacity, that's fine. Also T395429: Stop the need to manually run write queries on primary databases when running global elections

We are running low on engineering capacity. We're gearing up for temporary accounts deployments (happening this quarter and the next one). With the elections being in August, it feels like a very short window to make a major change like this one. Would it be alright to delay this until after the elections so we have a bit more breathing room to do this?

We are running low on engineering capacity. We're gearing up for temporary accounts deployments (happening this quarter and the next one). With the elections being in August, it feels like a very short window to make a major change like this one. Would it be alright to delay this until after the elections so we have a bit more breathing room to do this?

FTR the next global election is actually next week. (The U4C elections)

We are running low on engineering capacity. We're gearing up for temporary accounts deployments (happening this quarter and the next one). With the elections being in August, it feels like a very short window to make a major change like this one. Would it be alright to delay this until after the elections so we have a bit more breathing room to do this?

FTR the next global election is actually next week. (The U4C elections)

As long as this is fixed before the next global election after this, I have no objections.

@jrbs Err we discussed end of June as our deadline to make changes on the SecurePoll codebase. Don't we have the board elections in August? Should we stop making changes during the U4C election window?

@jrbs Err we discussed end of June as our deadline to make changes on the SecurePoll codebase. Don't we have the board elections in August? Should we stop making changes during the U4C election window?

I think this can be done concurrently with the U4C election. The creation of the new tables etc has already been done for that election so moving to a new method shouldn't impact it. It does mean later headaches for SRE which Amir wanted to avoid (sorry).

@jrbs Err we discussed end of June as our deadline to make changes on the SecurePoll codebase. Don't we have the board elections in August? Should we stop making changes during the U4C election window?

I think this can be done concurrently with the U4C election. The creation of the new tables etc has already been done for that election so moving to a new method shouldn't impact it. It does mean later headaches for SRE which Amir wanted to avoid (sorry).

@Ladsgroup @jrbs To clarify our team does not have capacity to take on this task before end of June which will then puts us dangerously close to the August elections. My proposal is to complete this task after the board elections (in October). Does this sound acceptable?

I need to look at numbers on s3 health, We could drop a couple of last year's election tables to free up number of files opened but given that this is two elections (u4c and board), it will be still a lot.

@Ladsgroup @jrbs To clarify our team does not have capacity to take on this task before end of June which will then puts us dangerously close to the August elections. My proposal is to complete this task after the board elections (in October). Does this sound acceptable?

Ah, I see what you mean. I think this is a concern for SRE so would defer to them on how urgent this is.

I think since the lists are still available you can take the chainsaw to earlier ones. Those lists live in centralauth's securepoll_lists table and my understanding, which may be wrong!, is that those should contain the voters for each election anyway. I don't think they depend on the tables.

MariaDB [centralauth]> select distinct li_name from securepoll_lists;
+-----------------------+
| li_name               |
+-----------------------+
| board-vote-2015       |
| board-vote-2015a      |
| board-vote-2017       |
| board-vote-2021       |
| board-vote-2022       |
| board-vote-2024       |
| botgroup-users        |
| bots-                 |
| bots-2021             |
| bots-2022             |
| bots-2023             |
| bots-2024             |
| mcdc-vote-2021        |
| testvoteu4c-vote-2024 |
| u4c-vote-2024         |
| u4c-vote-202404       |
| u4c-vote-2025         |
| u4c202404-vote-2024   |
| ucoc-vote-2022        |
| ucoc-vote-2023        |
| ucocar-vote-2025      |
| wmcr2024-vote-2024    |
| botgroup-users      |
+-----------------------+
23 rows in set (0.001 sec)

(I don't know how to see the lists' contents to confirm my guess though)

These are tables that are still in production: P75886

In s3 currently there ~190,000 files that need to be opened by mariadb and around ~30,000 files are just securepoll:

For example this is hywiki:

root@db1212:/srv/sqldata/hywiki# ls | grep -i securepoll
securepoll_cookie_match.frm
securepoll_cookie_match.ibd
securepoll_elections.frm
securepoll_elections.ibd
securepoll_entity.frm
securepoll_entity.ibd
securepoll_lists.frm
securepoll_lists.ibd
securepoll_msgs.frm
securepoll_msgs.ibd
securepoll_options.frm
securepoll_options.ibd
securepoll_properties.frm
securepoll_properties.ibd
securepoll_questions.frm
securepoll_questions.ibd
securepoll_strike.frm
securepoll_strike.ibd
securepoll_u4c2025_edits.frm
securepoll_u4c2025_edits.ibd
securepoll_ucocar2025_edits.frm
securepoll_ucocar2025_edits.ibd
securepoll_voters.frm
securepoll_voters.ibd
securepoll_votes.frm
securepoll_votes.ibd

(plus some other tables like bv2024_edits.frm and wmcr2024_edits.frm)

Can you at least drop some of these tables? Do we really need securepoll_msgs, securepoll_questions or securepoll_cookie_match?

Expanding on this idea that SecurePoll has too many SQL tables in its default installation, I've filed:

I also added this task to the parent task.

SD0001 renamed this task from For global elections, SecurePoll creates an eligible voters table for each election on every wiki and keeps it forever to For global elections, stop creating eligible voters table for each election on every wiki and keeping them forever.Jun 10 2025, 7:06 PM

(Changed title for clarity - SecurePoll by itself doesn't create any tables. The tables are being created by running maintenance scripts and have nothing to do with MediaWiki-extensions-SecurePoll except that the scripts are being added to SecurePoll repo. I'd argue these should actually go in WikimediaMaintenance as they're all WMF-specific.)

(Changed title for clarity - SecurePoll by itself doesn't create any tables. The tables are being created by running maintenance scripts and have nothing to do with MediaWiki-extensions-SecurePoll except that the scripts are being added to SecurePoll repo. I'd argue these should actually go in WikimediaMaintenance as they're all WMF-specific.)

This feels like an unnecessary separation. SecurePoll itself is WMF-specific.

SecurePoll itself is WMF-specific.

Nowadays it works fairly well as a generic voting tool. Don't think there's any wikimedia-specific code (except in the /cli directory).


Back to the task itself – is there a reason the production db is used for this? Is it just to avoid the replag of wiki replicas from impacting a voter's eligibility?

Crunching the data on a Cloud VPS machine, using Trove (or even SQLite?) to store the intermediate tables, and exporting the final list as a CSV seems the obviously *right* way to do this. There's already a UI (Special:SecurePoll/votereligibility/<id>/edit/voter) to upload the final list to production's securepoll_lists table. If the list is too big for the web UI to handle, we need at most one maintenance script to do it programmatically – and that can be made configurable and reused, also stopping the need to add new maint scripts for every election. EDIT: That UI isn't used for global elections which instead use the need-central-list property - for which there's no UI, see T288183, but there's importGlobalVoterList maint script to populate it from an externally generated list of usernames.

I have been poking away at a port of the list generation code that can be run on a Cloud Services instance. I ported the PHP code into Node.js – it's at https://gitlab.wikimedia.org/sd/global-election-list-builder. I initially used SQLite to store the per-wiki edit counts, but it was rather slow, processing only 60–80 users/second. I switched to redis (valkey 8.0 to be precise), which improves throughput to 150 users/second. The EditCountStore was made an interface into which any data store can be easily plugged in.

As it's querying the wiki replicas, I also did some optimizations that probably won't be possible with prod dbs:

  • instead of fetching the list of global users in batches of 1000, I use a single query that fetches all 78 million users and streams it to a file. Remarkably, this takes just 37 seconds.
  • the php script fetches the actor_id for each user_id individually. Instead, I similarly prefetch all user_ids + actor_ids upfront and stream it to a file. A user_editcount > 0 condition is applied which cuts the number of users from 41 million on enwiki to just 14 million.
  • parallelized the queries for counting the "long edits" and "short edits", which can't be done in PHP as it's single-threaded.

Tests were run on a g4.cores16.ram32.disk20 instance belonging to Quarry project. I estimate it would take about a day to process enwiki's users. As that's s1, and we can process other db slices in parallel, it would take one day overall. And then a couple of hours for the make-global-list script to sum up counts from all wikis and produce the final list.

@jrbs This is faster than the existing scripts I think? If there's interest, I can set up a dedicated VPS project, finish up the testing and help running this for the next election.

I have been poking away at a port of the list generation code that can be run on a Cloud Services instance. I ported the PHP code into Node.js – it's at https://gitlab.wikimedia.org/sd/global-election-list-builder. I initially used SQLite to store the per-wiki edit counts, but it was rather slow, processing only 60–80 users/second. I switched to redis (valkey 8.0 to be precise), which improves throughput to 150 users/second. The EditCountStore was made an interface into which any data store can be easily plugged in.

As it's querying the wiki replicas, I also did some optimizations that probably won't be possible with prod dbs:

  • instead of fetching the list of global users in batches of 1000, I use a single query that fetches all 78 million users and streams it to a file. Remarkably, this takes just 37 seconds.
  • the php script fetches the actor_id for each user_id individually. Instead, I similarly prefetch all user_ids + actor_ids upfront and stream it to a file. A user_editcount > 0 condition is applied which cuts the number of users from 41 million on enwiki to just 14 million.
  • parallelized the queries for counting the "long edits" and "short edits", which can't be done in PHP as it's single-threaded.

Tests were run on a g4.cores16.ram32.disk20 instance belonging to Quarry project. I estimate it would take about a day to process enwiki's users. As that's s1, and we can process other db slices in parallel, it would take one day overall. And then a couple of hours for the make-global-list script to sum up counts from all wikis and produce the final list.

@jrbs This is faster than the existing scripts I think? If there's interest, I can set up a dedicated VPS project, finish up the testing and help running this for the next election.

I'm not really technical enough to evaluate this but it does sound extremely promising. I spoke with Tim some years ago about a similar project.

In the meantime I have filed T398900 for this year's election since it will begin quite soon (August 27, 2025)

I created a subtask to explore the above approach: T404506. The TL;DR is that the populate-edit-count part took only 9 hours, much faster than the earlier estimate of 1 day!

@jrbs can I drop securepoll_u4c2025_edits everywhere now?

@jrbs can I drop securepoll_u4c2025_edits everywhere now?

Yes. Get the chainsaw :)

Mentioned in SAL (#wikimedia-operations) [2025-11-17T21:22:06Z] <Amir1> DROP table if exists securepoll_u4c2025_edits; on all wikis (T355594)