Page MenuHomePhabricator

etherpad table size is 233GB / plan to delete all etherpads in May 2026
Open, HighPublic

Description

etherpad database (etherpadlite) lives in m1.
The current design (I don't know if this has changed in different versions) has just one table: store, which design isn't very efficient:

cumin2024@db2160.codfw.wmnet[etherpadlite]> show create table store\G
*************************** 1. row ***************************
       Table: store
Create Table: CREATE TABLE `store` (
  `key` varchar(100) NOT NULL DEFAULT '',
  `value` longtext NOT NULL,
  PRIMARY KEY (`key`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.031 sec)

This table is huge on disk:

root@db2160:/srv/sqldata.m1/etherpadlite# ls -lh store.ibd
-rw-r----- 1 mysql mysql 233G Jan 22 06:13 store.ibd

This is an approximate number of rows (it is probably higher):

cumin2024@db2160.codfw.wmnet[mysql]> select n_rows from innodb_table_stats where table_name='store';
+-----------+
| n_rows    |
+-----------+
| 490755404 |
+-----------+
1 row in set (0.031 sec)

While right now this is not causing any inmediate issues (other than making backups way slower) - we are concerned about this model, scalability and future. I don't think we've ever (in 10 years) purged etherpad or at least do a clean up.
We should discuss options and approaches if we want to maintain this tool.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

How has this never been an issue with any other Etherpad instance I'm aware of, is not mentioned in https://github.com/ether/etherpad-lite/issues , but is now an issue specifically with the one hosted on wikimedia.org? If you really can't provide a reliable service without regular complete purges, I agree with @MGChecker that opting for an external Etherpad instance (or even some other real-time collaboration text editor) would be the better choice.

There is https://github.com/ether/etherpad-lite/issues/6194 (at least) about problems in this space.

Google Docs is always a better choice

Google Docs is always a bad choice :) 🌈 https://meta.wikimedia.org/wiki/Wikimedians_for_software_freedom#Google_Suite

Yeah, it is optional and you can live without it if you don't like it: if you don't need real-time collaboration, just place everything on-wiki; otherwise, you can use third-party etherpad instance (which is FOSS), if you know they are less stable and reliable than Google Docs and you can tolerate that content will disappear in the future.

Anyway, it is just too risky to host an Etherpad in Wikimedia, and no longer risk for WMF if 3rd party services are used.

Update: some statements are considered retracted. They should be considered informative only and no longer endorsed by me.

(@Bugreporter: We are going off-topic - but real-time collaboration is available in a stable and reliable way, through Collabora Online + Nextcloud. 100% Libre Software. No need to use Google Docs for a Wikimedia organization in 2026. We should "just" do not expect that a service provider offers a gratis Open Source service to billions of humans, gratis, forever, with infinite space; etc. - since this kind of gratis things only happen in the Big Techs - where we pay through our attention and with our data, losing software control. Anyway - Let's discuss elsewhere. Just please do not endorse Google Docs lol ❤️)

Edited: totally agree if somebody wants to explore budget for an external Etherpad-as-a-service!

(@Bugreporter: We are going off-topic - but real-time collaboration is available in a stable and reliable way, through Collabora Online + Nextcloud. 100% Libre Software. No need to use Google Docs for a Wikimedia organization in 2026. We should "just" do not expect that a service provider offers a gratis Open Source service to billions of humans, gratis, forever, with infinite space; etc. - since this kind of gratis things only happen in the Big Techs - where we pay through our attention and with our data, losing software control. Anyway - Let's discuss elsewhere. Just please do not endorse Google Docs lol ❤️)

Edited: totally agree if somebody wants to explore budget for an external Etherpad-as-a-service!

Yeah, any 3rd party services may probably be good (whether to use FOSS/non-free, gratis/paid is your choice, if you understand their pros and cons), just do not install an anonymously editable Etherpad on Wikimedia server.

That means if you need to run an event, you can easily check whether it'll fall under the purge time or not.
At any given point in time, there might be corruption of data or other issues that force us to remove pads or everything altogether.

How has this never been an issue with any other Etherpad instance I'm aware of, is not mentioned in https://github.com/ether/etherpad-lite/issues , but is now an issue specifically with the one hosted on wikimedia.org? If you really can't provide a reliable service without regular complete purges, I agree with @MGChecker that opting for an external Etherpad instance (or even some other real-time collaboration text editor) would be the better choice.

Etherpad's reliability has always been a concern. A good example of this is during master switchovers on the DB hosts where Etherpad db runs: because the tool handles connection pooling poorly, it may remain connected to the old master. To avoid split-brain scenarios, we have to kill existing connections and sometimes restart the Etherpad service entirely — this is even documented in our procedure (https://wikitech.wikimedia.org/wiki/MariaDB/misc). When a restart is necessary, some pads or recent changes may be lost, and we have no way of knowing how many or which ones are affected. There is no way for us to redirect connections to the new master without doing that.

If I see it correctly, we did not try to do a global search across Phabricator yet, is this correct? The standard search seems to be a bit lacking for that purpose, but I don't know whether there is something better. It certainly finds stuff.

Etherpads found on Phabricator

P88888 (notice the nice round number ...)

This was generated using Maniphest search rather than global search since the latter has no API, but I don't think there's anything interesting other than tasks on this instance.

If I see it correctly, we did not try to do a global search across Phabricator yet, is this correct? The standard search seems to be a bit lacking for that purpose, but I don't know whether there is something better. It certainly finds stuff.

Etherpads found on Phabricator

P88888 (notice the nice round number ...)

This was generated using Maniphest search rather than global search since the latter has no API, but I don't think there's anything interesting other than tasks on this instance.

834 of those were valid and previously unsaved pads, which I now added to /mnt/nfs/labstore-secondary-tools-project/etherpad-backup/public_html/p/
The remaining ones were either mentioned in other lists before or non-existant / empty.

LSobanski triaged this task as Medium priority.Feb 20 2026, 12:25 PM
LSobanski raised the priority of this task from Medium to High.
LSobanski moved this task from Incoming to Work in Progress on the collaboration-services board.

Wikimedia Etherpads have been the WMF-recommended storage location for Wikimedia community organizing and governing notes since somewhere between 2011-2015.

Deleting these notes is not not primarily a technical issue, but rather a social or ethical issue. Can I request that someone from the Wikimedia Foundation who feels empowered to speak on social and ethical issues in community governance please identify themselves, make themselves a point of contact for social questions, and make a statement on deleting this content? Phabricator is not a conventional place to have non-technical discussions on Wikimedia community governance matters. The etherpad notes have a lot of functions, including serving as WMF-mandated evidence of community meetings as required in metrics and impact reporting for Wikimedia grants.

For years I have migrated etherpad conversations into Meta-Wiki. Doing this raises its own set of ethical issues, including that meeting participants often consent to notes in Etherpad but object or decline to consent to have etherpad notes moved or ported to other platforms, like Meta-Wiki. Consent to be in etherpad is not necessarily consent to be quoted, archived in Meta-Wiki, or migrated elsewhere. Also, we never settled the issue of copyright in Etherpad instances. Are notes in Etherpad created with Wikimedia compatible licenses? What are we doing with these?

Some requests: WMF names a point of contact for the etherpad deletion project, WMF amnesty in grant reporting for lost community records in deleted Etherpads, WMF makes a recommendation for using AI tool prompts to convert Etherpad notes to Wikimedia markdown for posting into Meta, WMF shares a brief impact statement of how many pads containing notes will be deleted so that we have shared facts in discussing how to manage this.

There was recently a wide-ranging brainstorming session in Frankfurt about what Wikimedia should become.
One thing it should probably not become is "unreliable at storing the knowledge outputs of its communities".

The original phab issue here was well framed: this widely used service needs a new approach.
But phab feels like the wrong place to decide on deletion without backup.

How many non-empty pads are there? What stats do we have about them?
It's nice that a few thousand have been backed up, but that seems a fraction of pads made by wikimedians over time. There are also a lot that are just spam. But I know that the long, non-spam, community pads are a gold mine of collective knowledge, some as historical artefacts and some as captured insight with continuing relevance.

Whatever else happens, we should give a copy to IA or another archive to clean up and preserve. Fear of undiscovered and unspecified (and unlikely) Bad Text is not a good reason to prevent archiving, as that is a challenge archivists of the open web deal with every day. 🌺

On clickthrough licenses and regular deletion: this is a good moment to intentionally change the social norms that rely on Etherpad, or choose an alternative.
We shouldn't make a change for purely technical reasons and update the norms as an afterthought. Questions of "risk budget" affect users of the system just as much as they do the hosts of the resulting pads.

  • Clarifying license compatibility is an easy upgrade. CC-By style attribution can list all non-anon contributors on migration.
  • Deletion can be done in a staged way that limits surprise, guarantees preservation, and helps authors migrate.

Either way, the strategy currently suggested here won't be sustainable in practice. If pads can be deleted _at any time_, like while I am writing them or the day after, etherpad will be so unreliable that it would be irresponsible to use it, and people will opt to use Google Docs and other tools outside of our ecosystem as a replacement.

The regular purges will be on regular interval and per-determined and communicated properly when you create a pad. That means if you need to run an event, you can easily check whether it'll fall under the purge time or not. One idea I have is to delete pad rows after 60 or 90 days and do a full purge on Jan 1/Dec 31 to clean up orphan dangling rows.

I'm with @MGChecker , the defaults here will determine whether people use the service or not.
@Ladsgroup here is a less variable approach, to ensure no good faith pads are deleted in less than 90 days, and that creators can define on creation a predictable archival URL where the pad will end up:

a) Copy riseup: Let people choose how they want pads kept, via name-suffix. When a pad is being dropped:
~ Check for and drop spam (w/ standard filters).
~ Snapshot TITLE-archive pads with IA's Archive-It.
~ Export TITLE-meta pads to, say, meta:Etherpad/TITLE.

b) When purging: create a new etherpad instance, rotate the last year to etherpad-old.wm.org, and make it read-only.
~ Keep etherpad-old up for 3 months. Show a banner with process & timeline, and one-click meta & Archive-It exports.
~ For people who just started new pads, you can include a "copy to active etherpad" option to return to the new eth.wm.o.

Deleting these notes is not not primarily a technical issue, but rather a social or ethical issue. Can I request that someone from the Wikimedia Foundation who feels empowered to speak on social and ethical issues in community governance please identify themselves, make themselves a point of contact for social questions, and make a statement on deleting this content?

I gave some basic perspective on the security and safety implications of hosting years of essentially non-auditable anonymously contributed content in my comment above: https://phabricator.wikimedia.org/T415237#11621428

It is clear, both from your comment as well as others on this thread, that even though WMF has notionally presented this service as potentially ephemeral, our lack of enforcing that over many years has led people to assume it can be used indefinitely anyway. While I don't know of specific cases, from what you're saying you've heard, I would not be surprised to learn that WMF staff may have also slipped into assuming this, and made recommendations consistent with that, over the years.

We very likely should have been enforcing ephemerality more clearly from the beginning, to avoid getting into this situation. But we, like I think a lot of open internet organizations over the last 10-15 years, have gotten more sensitive to these kind of online safety risks over time, and we do now need to start enforcing it.

Fear of undiscovered and unspecified (and unlikely) Bad Text is not a good reason to prevent archiving, as that is a challenge archivists of the open web deal with every day. 🌺

I deeply admire the internet's public interest archivists and they do amazing work. But I disagree that the chances of Bad Text are unlikely. We dealt with an urgent distressed-user-initiated request for the removal of PII just a few months ago. As these things go, that was a pretty simple situation. As @Audiodude mentioned in https://phabricator.wikimedia.org/T415237#11628351, people use anonymous online scratch space for all kinds of things, and it doesn't take a very high % of the text to be Bad for it to be bad.

@EMill-WMF Thanks Eric for being present to talk. A situation that I do not want is for all these notes to be treated as a janitorial matter with no thought into what we are throwing out.

The issue from the Wikimedia community view is that we have flown people around the world for years and at strategy meetings and conferences directed volunteers to put notes into these etherpads. Besides events directly organized by the Wikimedia Foundation, there are other events such as the North American WikiConference and special meetings of Wikimedia affiliate organizations which kept their notes in these pads. It would be unfortunate to delete them without having a public conversation about what it would cost to save them.

The notes in these documents are supposed to be less than 1GB of text, and somehow it is 230GB of spam and problems. I understand the mess, even if I have no awareness of how we got to this point.

Can you please do a calculation of the costs and benefits of preserving these notes? I do not want the 230GB of spam or the failing system. I want the notes only, in any format that makes them accessible to the community. If it is too costly to save them, then I can accept and understand that. Some information which would make this a lot easier for me to accept and understand is if you could assign a value to these notes and give an estimate to the cost of checking the collection to extract them. I am going to throw out some guesses - the notes are key records for important strategic conversations where we paid millions of dollars to convene. The notes are worth $50,000. Is there a budget of approximately that much labor available to save them? Is there a price point for saving 70% of them that we can afford, even if we cannot afford 99% of them? If I am off in my calculation, can you correct me with whatever insight you have as to the value of the notes and the resources available to collect them?

thanks

(FWIW, everything that has been linked on any wiki or anywhere else that could have been found is already backed up in a separate toolforge tool. See T415237#11617021 (Thanks‌ @Tkarcher)

(FWIW, everything that has been linked on any wiki or anywhere else that could have been found is already backed up in a separate toolforge tool. See T415237#11617021 (Thanks‌ @Tkarcher)

Summary of the lists compiled by @Pppery above and the current status:

List Description No. of padsStatus
P88800Etherpad URLs - global search4040✅ Done
P88780Etherpads - misc720✅ Done
P88802Etherpad - templates930✅ Done
P88781Etherpads - Wikimania463✅ Done
P88782Etherpads - index59✅ Done

All exported pads are saved as HTML files in /mnt/nfs/labstore-secondary-tools-project/etherpad-backup/public_html/p/ , accessible via https://etherpad-backup.toolforge.org/p/<title>

The remaining pads should be done by tomorrow.

I am sorry, how do I access any of these? That link is 404 forbidden and while I see the list of titles in other Phabricator tickets, it is not apparent to me where the etherpad content went or how to access it. Is there a way for me to click through an find an etherpad file online?

This comment was removed by Pppery.

(FWIW, everything that has been linked on any wiki or anywhere else that could have been found is already backed up in a separate toolforge tool. See T415237#11617021 (Thanks‌ @Tkarcher)

Overconfidence ("anywhere") makes me worry. Please be careful with community-produced knowledge, when making irreversible decisions.

Here is a historically important pad linked from a backed-up pad:

https://etherpad-backup.toolforge.org/p/MovementRoles

Here is a pad linked from meta:

https://etherpad.wikimedia.org/p/MR-definitions

Neither seems to be backed up on etherpad-backup. (I spent two minutes looking for these, there are many more.) @Pppery perhaps because the early pads (before the previous reboot of etherpad.wm.o) didn't have /p/ in the URL? Sources of etherpads that don't seem to have been backed up include: meetup pages, facebook pages, calendar invites, emails, google docs, and etherpads that reference other etherpads.

@EMill-WMF writes:

I deeply admire the internet's public interest archivists and they do amazing work. But I disagree that the chances of Bad Text are unlikely.

That's fair. But archivists are equipped to host public interest work that includes unwanted text, in non-public ways, while sifting them for good text.
If you are planning to bobby tables, please transmit a copy to such an archivist. Technical service maintenance and knowledge maintenance need not be at odds: both require graceful degradation of service and offramps for ethical preservation or self-hosting.

Here is a pad linked from meta:

https://etherpad.wikimedia.org/p/MR-definitions

Neither seems to be backed up on etherpad-backup. (I spent two minutes looking for these, there are many more.) @Pppery perhaps because the early pads (before the previous reboot of etherpad.wm.o) didn't have /p/ in the URL? Sources of etherpads that don't seem to have been backed up include: meetup pages, facebook pages, calendar invites, emails, google docs, and etherpads that reference other etherpads.

What happened there was that I was stupid and only included HTTPS urls not HTTP ones in my search. I still have the code I used to generate that list, and will re-run it to include HTTP urls too when I get the change. Finding other etherpads linked in backed-up ones looks doable.

And while I will fix my boneheaded mistakes there and find some more things to back up I agree with SJ's point that what Tkarcher and I are doing should be seen as best-effort not a full solution.

Finding other etherpads linked in backed-up ones looks doable.

It is, but it's a recursive problem. Here're the pads I found linked in other pads I had already backed up:

P89803 - Pads linked from other pads

Once I'll have added these to the backup (already in progress, but will take some time), I'll check again for new links in them. Let's see how deep the rabbit hole goes.

Quick update: ~1000 new pads added to the backup which were only linked from other pads and not included in previous lists. Currently in the second round (adding pads which were linked from pads which were linked from other pads). Identified ~700, but not sure yet how many of those actually exist, and which point to empty pads. Checking and backing them up right now.

Thanks for the progress and conversation, I was very anxious, and still am anxious, but I also see reasonable fixes to challenges

Created P89814 for HTTP etherpads, but that didn't find the one that SJ had pointed out. To be fair, that link is already broken, though.

Created P89815 for HTTP etherpads without "/p/" (these links have been broken for ages, but it does include the example SJ gave so I had to do it)

HTTPS without /p/ (not that many because of the anachronism)

https://etherpad.wikimedia.org/WLM-2013-documentation
https://etherpad.wikimedia.org/ro/r.Ia0q1hpsWWt5rxWq
https://etherpad.wikimedia.org/VertalingWikiIt
https://etherpad.wikimedia.org/BugTriage-Collection
https://etherpad.wikimedia.org/P/WBUG_2023_07_27
https://etherpad.wikimedia.org/P/WBUG_2023_09_28
https://etherpad.wikimedia.org/GLAM-Wiki-CH_OpenRefine_PAWS
https://etherpad.wikimedia.org/GLAMcampAmsterdamFri
https://etherpad.wikimedia.org/wlm2012-dresden
https://etherpad.wikimedia.org/P/WBUG_2023_09_28

Also, from a template that I overlooked dealing with:

https://etherpad.wikimedia.org/WIKISOO3
https://etherpad.wikimedia.org/WIKISOO4

Finally: After ~500 in round two and ~200 in round three, there're no more links from backed-up pads to other pads not saved before, so no other round is needed.

Now working on the additional pads identified by @Pppery above.

Ok, so I backed up all lists added by @Pppery yesterday, and just to be on the safe side, ran the "find links within pads" script again on the recently added pads. Another 125 popped up. 🙄 Working on those now.

If there are more fundamental concerns regarding having a world-writable platform available on WMF servers, I think we should address the root cause rather than focussing on stop gaps like regular deletion. Both with Etherpad and Hedgedoc, it wouldn't be difficult to require OAuth autentication with an SUL account, for instance.

Done: All pads mentioned / listed above (in total: 8337) are backed up on Toolforge: P89822

Feel free to browse through the list and let me know in case anything is missing. 😬

And just as a reminder: These are the pads which were created until ~2 weeks ago. We still don't have a solution for the pads which have been (or will be) created since then until end of April.

it wouldn't be difficult to require OAuth autentication with an SUL account, for instance.

Somebody believe OAuth is a bad idea, see T415237#11613890. But I think it is at least better than status quo.

But I think it is at least better than status quo.

How? The problem is unmoderated content, not anonymous users: With authentication, you'd still not prevent anyone from adding inappropriate content, and you'd still not be able to delete it from the history or block users from adding more. This would require moderation tools which do not exist in Etherpads.

And yes, I still think the discussion here should mainly focus on preventing data loss in the upcoming purge; the future / possible replacement of Etherpads should be discussed in a separate task or even outside Phabricator, as it will affect a lot of non-technical community members and is more a strategic question than a technical one.

Also, Etherpad is "off the shelf" third party open source software. It doesn't support OAuth, it is not meant for authenticated scenarios.

There are multuple etherpad plugins around that support oauth.

If keeping a read-only version at least for some time is not an option, can you *please* at least put a warning on the front page (https://etherpad.wikimedia.org/) and ideally also in the header of the editor page that everything will be deleted in 6 weeks? I see people still creating new pads and updating existing pads which were already backed-up (e.g. live vs. backup), making it close to impossible to keep track of what is saved already and what not.

Ok, so I backed up all lists added by @Pppery yesterday, and just to be on the safe side, ran the "find links within pads" script again on the recently added pads. Another 125 popped up. 🙄 Working on those now.

Hi, Is there any plan to import back all backups to the etherpad after the cleanup?

Deleting these notes is not not primarily a technical issue, but rather a social or ethical issue. Can I request that someone from the Wikimedia Foundation who feels empowered to speak on social and ethical issues in community governance please identify themselves, make themselves a point of contact for social questions, and make a statement on deleting this content?

I gave some basic perspective on the security and safety implications of hosting years of essentially non-auditable anonymously contributed content in my comment above: https://phabricator.wikimedia.org/T415237#11621428

It is clear, both from your comment as well as others on this thread, that even though WMF has notionally presented this service as potentially ephemeral, our lack of enforcing that over many years has led people to assume it can be used indefinitely anyway. While I don't know of specific cases, from what you're saying you've heard, I would not be surprised to learn that WMF staff may have also slipped into assuming this, and made recommendations consistent with that, over the years.

We very likely should have been enforcing ephemerality more clearly from the beginning, to avoid getting into this situation. But we, like I think a lot of open internet organizations over the last 10-15 years, have gotten more sensitive to these kind of online safety risks over time, and we do now need to start enforcing it.

Fear of undiscovered and unspecified (and unlikely) Bad Text is not a good reason to prevent archiving, as that is a challenge archivists of the open web deal with every day. 🌺

I deeply admire the internet's public interest archivists and they do amazing work. But I disagree that the chances of Bad Text are unlikely. We dealt with an urgent distressed-user-initiated request for the removal of PII just a few months ago. As these things go, that was a pretty simple situation. As @Audiodude mentioned in https://phabricator.wikimedia.org/T415237#11628351, people use anonymous online scratch space for all kinds of things, and it doesn't take a very high % of the text to be Bad for it to be bad.

Maybe it could be combined with the ephemerality: make a centralized list where Wikimedians put their etherpad links to be preserved. Everything else will be cleaned-up regularly. Pads listed in the backup-list will survive regular resets.

If it could be exported and it could be imported, then it could be preserved.

I would appreciate a statement whether there are plans to keep a backup in place for a limited period of time. WMF employees have expressed different views here, and it is unclear to me who's qualified to make the call.

Jelto removed Jelto as the assignee of this task.Mar 16 2026, 3:29 PM

I'm un-assigning the task from me until we have a decision about how to move forward in April 2026.

The deadline has been extended to April 30th.

I'm un-assigning the task from me until we have a decision about how to move forward in April 2026.

I would really like to work on some long term helpful tooling for archiving etherpads (T417207). @Tkarcher has done some very helpful and timely work to create https://etherpad-backup.toolforge.org/ and use it to save 8338 pads (T415237#11617021, T415237#11617713). I would like to build on that work by adding a self-service archiving frontend service (paste in an etherpad URL to trigger an archival dump). I would also like to work on a storage backend that scales better than Toolforge NFS (leaning towards S3 compatible object storage).

This is the kind of work that Wikimedia-Hackathon-2026 is well suited for. Unfortunately the 2026 Hackathon starts on 2026-05-01 which is the day after the currently planned deletion of all legacy content. I am wondering if there is some flexibility in the implementation date that could give myself and others the chance to improve archiving tooling and ideally give folks a bit of time to try it out before the initial purge? A few weeks would make some difference; a month or two might make more, but I also very much recognize that at some point we need to move forward.

@Tkarcher and @Pppery and @bd thank you! for your work on this.

Another place to look is mailing list archives. Many pads were seeded from an email thread.

To reiterate @MGChecker's point, please store a dark copy of current revisions w/ archive.org, where someone could respond to a research request for access to an old URL. For instance: I've had serious projects that started with a private group chat or telegram thread that used an etherpad for notes. These get referenced every year or so when the project is revisited, but people might not connect the dots that a load-bearing reference in old notes is about to be wiped out.

If keeping a read-only version at least for some time is not an option, can you *please* at least put a warning on the front page (https://etherpad.wikimedia.org/) and ideally also in the header of the editor page that everything will be deleted in 6 weeks? I see people still creating new pads and updating existing pads which were already backed-up

I think a read-only version should stay up for a while longer, but this is a good idea either way.
I split this into T420793 for clarity

Seems like we're trying to reverse engineer a list of etherpads to back up at https://etherpad-backup.toolforge.org/, and keep slowly discovering more and more.

Perhaps it'd make sense to instead ask engineers to run a database query and dump the list of etherpads somewhere, then that list can be given to the maintainer of https://etherpad-backup.toolforge.org/, and a thorough, one time crawl of everything could be performed?

Or are there way more spammy etherpads than useful ones?

Too bad the etherpad ID strings aren't just sequential numbers.

list of public etherpads

All etherpads are public, even if many of them may be empty or spam.

@Novem_Linguae : It would probably be technically possible, but at the same time quite risky: I know users who consider Etherpads with a long enough random title as completely "private" (though they never really are), so your approach could accidentally unveil *a lot* of personal data, passwords and other unexpected content.

@Novem_Linguae : It would probably be technically possible, but at the same time quite risky: I know users who consider Etherpads with a long enough random title as completely "private" (though they never really are), so your approach could accidentally unveil *a lot* of personal data, passwords and other unexpected content.

Also see concerns raised previously at T415237#11607940.

@Novem_Linguae : It would probably be technically possible, but at the same time quite risky: I know users who consider Etherpads with a long enough random title as completely "private" (though they never really are), so your approach could accidentally unveil *a lot* of personal data, passwords and other unexpected content.

If people are storing important data, isn't it likely some of them were hoping that copy would stay around? This seems like an argument to preserve a copy whose access list matches the current one: a dev with admin access and someone who knows the old title (and is looking it up by that title).

It seems arbitrary to say "we're deleting everything because we said up front this was impermanent"
while also saying ~ "we're blocking backups even though we said up front all pads are public".

I spent another hour+ looking through mailing lists and tweets and found another 50 or so pads of historical interest, just searching 2011-2014. I happened to be subscribed to the Wikimedia Kenya list when it existed, so I know that they kept many docs on etherpad... that's a particularly interesting historical example imo. I imagine dozens of other small community groups from around 2012-2016 also kept etherpad notes which might have been linked from email lists / social media / FB. Ideally someone would mine a mailing-list dump and major social-media archives.

I posted the list of links here, to keep this thread compact: https://meta.wikimedia.org/wiki/User:Sj/Etherpad_links

I posted the list of links here, to keep this thread compact: https://meta.wikimedia.org/wiki/User:Sj/Etherpad_links

All pads listed on this page are now backed up.

Hi. We have the Wikimedia-Hackathon-2026 coming up soon. The organizing team would like to request for this task to be delayed preferably a month after the Hackathon (May 15th, at least but I would aim for a month) since etherpad is a vital tool for the event and contains historical information that we need and that we would preferably like to archive. This tool has been crucial for documentation of sessions and presentations. It's not only about live note-taking but also about preserving important knowledge, note on agreements and bring accessibility for people who are not able to attend the sessions or the event itself.

Hi. We have the Wikimedia-Hackathon-2026 coming up soon. The organizing team would like to request for this task to be delayed preferably a month after the Hackathon (May 15th, at least but I would aim for a month) since etherpad is a vital tool for the event and contains historical information that we need and that we would preferably like to archive. This tool has been crucial for documentation of sessions and presentations. It's not only about live note-taking but also about preserving important knowledge, note on agreements and bring accessibility for people who are not able to attend the sessions or the event itself.

If delays are being requested, and for up to a month, I suggest delaying a bit further longer. ESEAP Conference is being held between 15 and 17 May; WikiCon Christchurch 16-18 May, and both are using Etherpad for notetaking. Please don't do the reset on the 15th, or during the events. Do allow the event organisers the time to copy the information out as well. I would suggest that if there are delays, please make it at the end of the May instead.

Here is the issue IMHO. After EASAP there will be another regional conference and then wikimania, and then WikiCon NA, and this will be delayed forever basically

My suggestion is to move the current etherpad to etherpad-old right now, and once the due time has passed, we remove it. That will fix the "we need some time to prepare for the change".

My suggestion is to move the current etherpad to etherpad-old right now, and once the due time has passed, we remove it. That will fix the "we need some time to prepare for the change".

This seems like a workable compromise to me.

  • Stand up new etherpad service with the proper plugins installed to enforce auto-deleting pads in the style of https://pad.riseup.net/
  • Point a new hostname at the legacy service
  • Point the etherpad.wikimedia.org hostname at the new service
  • Make the legacy service read-only if technically feasible
  • Announce a longer but still near-term enough deadline for the removal of the legacy deployment (6 months?)

To account for planned use of Etherpad for Wikimedia Hackathon 2026 as well as other scheduled events the Etherpad database truncation will be postponed until end of May (exact date to follow). This should provide enough buffer ahead of Wikimania in July.

Pppery renamed this task from etherpad table size is 233GB / plan to delete all etherpads in April 2026 to etherpad table size is 233GB / plan to delete all etherpads in May 2026.Apr 17 2026, 2:40 PM

Change #1273889 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] etherpad: update warning message about truncation to end of May

https://gerrit.wikimedia.org/r/1273889

Change #1273889 merged by Dzahn:

[operations/puppet@production] etherpad: update warning message about truncation to end of May

https://gerrit.wikimedia.org/r/1273889

I edited the warning message to reflect the "end of May" part and also injected a warning message directly into the index HTML at:

https://etherpad.wikimedia.org/

resolving T420793.

@Tkarcher I was over at T417207 and scraped a few more links that may or may not already be included ( I imagine you have a nice workflow for importing these now)

Phabricator etherpad links that i believe should be currently complete.
Looks like some of them have bad formatting and could be cleaned up https://phabricator.wikimedia.org/P92134

For example http://etherpad.wikimedia.org/WMF-TOC which would actually be http://etherpad.wikimedia.org/p/WMF-TOC etc.

And some more from some other sources https://phabricator.wikimedia.org/P92135

@Tkarcher I was over at T417207 and scraped a few more links that may or may not already be included ( I imagine you have a nice workflow for importing these now)

Phabricator etherpad links that i believe should be currently complete.
Looks like some of them have bad formatting and could be cleaned up https://phabricator.wikimedia.org/P92134

For example http://etherpad.wikimedia.org/WMF-TOC which would actually be http://etherpad.wikimedia.org/p/WMF-TOC etc.

And some more from some other sources https://phabricator.wikimedia.org/P92135

I'm going to try and get these imported today using the new object store backend. I will also use that as practice for documenting how others can do bulk imports.

@Tkarcher I was over at T417207 and scraped a few more links that may or may not already be included ( I imagine you have a nice workflow for importing these now)

Phabricator etherpad links that i believe should be currently complete.
Looks like some of them have bad formatting and could be cleaned up https://phabricator.wikimedia.org/P92134

For example http://etherpad.wikimedia.org/WMF-TOC which would actually be http://etherpad.wikimedia.org/p/WMF-TOC etc.

And some more from some other sources https://phabricator.wikimedia.org/P92135

I'm going to try and get these imported today using the new object store backend. I will also use that as practice for documenting how others can do bulk imports.

The pads that @Addshore found in Phabricator have been added to https://etherpad-backup.toolforge.org/.

@bd808 Could you please import all the pads referenced by mwstake.org? Special:LinkSearch can quickly list all of them.

@bd808 Could you please import all the pads referenced by mwstake.org? Special:LinkSearch can quickly list all of them.

The https://etherpad-backup.toolforge.org/ is ready for all y'all to use in archiving the html dump of individual pads now and in the future. If you have lists of pads that you feel will be too cumbersome to archive one at a time please consider creating a Phabricator ticket tagged with Tool-etherpad-backup providing your list as linked Phabricator Paste or a downloadable file link.

I'm continuing to use this sporadically to back up pages I can manually find, which is surely a small portion of the addressable space. A sizeable % of historically significant pads are not yet backed up.
For instance, found today: /covid19 and /Covering_George_Floyd .

Requesting again for those with access: please store a proper backup with IA or other dark archive that has good privacy norms, don't just destroy these documents permanently.