Page MenuHomePhabricator

Archive etherpads mentioned in hackathon/techconf tasks
Closed, ResolvedPublic

Description

At various past Hackathons and technical conferences we used a note taking convention where an etherpad would be assigned to each session/presentation and used for live note taking. When everyone was thinking clearly about such things there would be instructions to archive the etherpad content after the session. This advice however may not have always been present or if present may not have always been followed.

The project to purge the etherpad database (T415237) makes the always present possibility of losing access to a pad a more pressing issue. It should be possible to make some tooling and processes to help folks review tasks from past events for etherpad links, check to see if those pads are archived, and if not done already archive them. Starting with the worst thing that could possibly work, namely a list of manual steps, is likely the best approach here. Let's make it possible first and then look for ways to reduce toil rather than trying to build a magical automated pipeline from the start.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
cli(backup): Sleep between archive attemptstoolforge-repos/etherpad-backup!8bd808work/bd808/slow-the-rollmain
Customize query in GitLab

Event Timeline

There might be something worth stealing in https://github.com/okfn/etherpad-archiver/, but I think the API is disabled/hidden on our instance so the /export/html & /export/txt path parameters may really be all we can use.

A potentially interesting idea would be to build a Toolforge tool that could be used to request archiving of a pad and also view archived pads. S3 storage would be an interesting way to manage the saved content.

I pinged @Tkarcher's Wikitech talk page to see if there is reasonable overlap with his plans for https://toolsadmin.wikimedia.org/tools/id/etherpad-backup.

Maybe it is for another task, but we can get a list of all EtherdPad publicly mentioned on the wikis by using LinkSearch. Example:

The URL for pads from the early day did not have a /p prefix which leads to a non existing page, but if one inserts the /p then surely the pad is present. Then I guess it is matter of asking for a text/html export of each of those publicly known pads?

That's exactly what I did yesterday for the 927 pads linked from the German Wikipedia: All of them are now archived as text files in /mnt/nfs/labstore-secondary-tools-project/etherpad-backup/public_html/p/ , publicly available via https://etherpad-backup.toolforge.org/p/<title> (e.g. https://etherpad-backup.toolforge.org/p/Tech_on_Tour_Berlin ). But that was a one-time effort, and not a long-term solution for archiving future pads across all wikis.

I pinged @Tkarcher's Wikitech talk page to see if there is reasonable overlap with his plans for https://toolsadmin.wikimedia.org/tools/id/etherpad-backup.

As mentioned on my talk page already: So far, I didn't really have a "plan", but just wanted to quickly backup all German pads before the deadline - which is done now.

I still don't have a plan, but can offer at least some first random thoughts:

  • whatever solution we develop should ideally work automatically (or as easy as possible) for all pads created on wikimedia.org, regardless of whether and where they're linked onwiki. One option would be a custom etherpad plugin initiating the backup in the background for every newly created pad.
  • most pads are created as stubs before the event and then filled during or shortly after the event. It would be nice if our solution could not only do a one-time backup, but revisit previously saved pads on a regular basis to check for changes (and stop revisiting if there was no change in the last x months).
  • if possible, a successful backup should be documented/visible in the etherpad itself - either with a text at the bottom ("A backup of this pad was created on March 3rd, 2026 and is available at https://..."), or via the custom backup plugin mentioned above, which could add a backup status message somewhere in the menu, so an end-user could immediately find the backup and also easily notice if the backup didn't work for whatever reason.

More random thoughts I had today:

We could fork https://github.com/ether/ep_mediawiki and use that as the basis for our archiving tool:

  • Use the extension to automatically create an archive page for all newly created Etherpads in [[meta:Etherpad/Archive/<title>]] (will only contain the template text on the first day)
  • Make sure that the extension automatically adds a [[Category:Active Etherpad]] at the end of the export file
  • Use a bot to crawl all pages in Category:Active Etherpad on a daily basis and update them, if necessary (a new version will only be saved if there're changes).
  • If a pad wasn't updated for more than x months, the bot will automatically change the category to [[Category:Inactive Etherpad]]
  • These inactive pads will not be checked / updated any longer, and can then be deleted from the database in a separate cleanup / housekeeping process.

FYI: I didn't comment on the VPS project request, but I'm aware of it and already successfully logged on to the Horizon dashboard to check my access to etherpads3 - which works as expected. I've no expertise whatsoever with object storage, but I'm willing to learn, so let me know if there's anything I can do to support you.

FYI: I didn't comment on the VPS project request, but I'm aware of it and already successfully logged on to the Horizon dashboard to check my access to etherpads3 - which works as expected. I've no expertise whatsoever with object storage, but I'm willing to learn, so let me know if there's anything I can do to support you.

My plan is to create a service user for authentication and a bucket that we can stick data in. Once I have those things we can work on proof of concept use of the storage. I have a hope that we can make an initial frontend that works with both object storage and the NFS data you have already collected. With something like that we can keep work on changing the storage backend from blocking any work on archiving more content using the existing workflow you have established.

Sounds good. Just as a heads up: If you plan to copy / mirror all NFS content to S3, you'll quickly hit the storage limit of 4096 files mentioned in https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide#Quotas_and_other_limitations and might need to open a quota request ticket.

Oh, and in case you need it for whatever reason, I just officially made you co-maintainer of etherpad-backup. But there's really nothing in there other than the static HTML files.

CONTRIBUTING.md

Please don't delete anything. :-)

Sounds good. Just as a heads up: If you plan to copy / mirror all NFS content to S3, you'll quickly hit the storage limit of 4096 files mentioned in https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide#Quotas_and_other_limitations and might need to open a quota request ticket.

After T423354: Increase object storage object count for etherpads3 project the quota we have now is 65K files. Hopefully that will last for quite a while; we can always ask for more if it looks like we are in any danger of running out of that inode equivalent pool.

Phabricator etherpad links that i believe should be currently complete.
Looks like some of them have bad formatting and could be cleaned up https://phabricator.wikimedia.org/P92134

For example http://etherpad.wikimedia.org/WMF-TOC which would actually be http://etherpad.wikimedia.org/p/WMF-TOC etc.

I took downloaded the 2 pastes that @Addshore made to my laptop. I then performed quite a bit of data review and manual cleaning. The actual process was much messier than this, but the gist was:

  • sort unique the lines
  • delete all the obvious garbage lines that clump in the sorted results
  • normalize http: to https:
  • sort unique the lines
  • delete https://etherpad.wikimedia.org/ prefix from all lines
  • sort unique the lines
  • delete p/ prefix from lines where it exists (there was a point in time when p/ was not used in URLs)
  • sort unique the lines
  • delete all the obvious garbage lines that clump in the sorted results
  • search for lines matching [^a-zA-Z0-9_]$ (lines ending with non-alphanumeric or _) and manually clean
  • sort unique the lines

I thing that was most of it... I unfortunately did not log cleanings as I went.

The result is a list of 2385 unique (potential) pad names. These are currently available as batch1.txt in the tool's $HOME.

My cli archiving tool is too fast at the moment. I keep getting my egress IPs shadow banned at that Wikimedia CDN edge.

tools.etherpad-backup@tools-bastion-14:~$ webservice buildservice shell --mount all
I have no name!@shell-1777812357:/workspace$ cd $TOOL_DATA_DIR
I have no name!@shell-1777812357:~$ wc -l batch1.txt
2385 batch1.txt
I have no name!@shell-1777812357:~$ etherpad-backup -v backup --user BryanDavis --decode batch1.txt 2>&1 | tee batch1-202605031310.log
2026-05-03T13:14:32Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/0WdKUDRA58/export/html
2026-05-03T13:14:33Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/1.25wmf15-perf-regression/export/html
2026-05-03T13:14:34Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/1.27.0-wmf.13-issues/export/html
2026-05-03T13:14:35Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/1.36-release/export/html
2026-05-03T13:14:35Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/119-beta/export/html
2026-05-03T13:14:36Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/119917/export/html
2026-05-03T13:14:37Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/119deployment/export/html
2026-05-03T13:14:38Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/119triage/export/html
2026-05-03T13:14:39Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/120wmf2deployment/export/html
2026-05-03T13:14:39Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/123/export/html
2026-05-03T13:14:40Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/19-jun-2014-parsercache-outage/export/html
2026-05-03T13:15:45Z etherpad_backup.cli ERROR: Archiving https://etherpad.wikimedia.org/p/19-jun-2014-parsercache-outage/export/html failed
Traceback (most recent call last):
  File "/workspace/src/etherpad_backup/utils.py", line 70, in backup_pad
    r.raise_for_status()
    ~~~~~~~~~~~~~~~~~~^^
  File "/layers/heroku_python/venv/lib/python3.14/site-packages/requests/models.py", line 1028, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 504 Server Error: Gateway Timeout for url: https://etherpad.wikimedia.org/p/19-jun-2014-parsercache-outage/export/html

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/src/etherpad_backup/cli.py", line 100, in backup
    utils.backup_pad(pad, user)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/workspace/src/etherpad_backup/utils.py", line 78, in backup_pad
    raise BackupError(f"Backup of {pad} failed", url=backup_url) from e
etherpad_backup.utils.BackupError: ('Backup of 19-jun-2014-parsercache-outage failed', 'https://etherpad.wikimedia.org/p/19-jun-2014-parsercache-outage/export/html')
2026-05-03T13:15:45Z etherpad_backup.utils INFO: Backing up https://etherpad.wikimedia.org/p/1K-YSmk7PCIDjWlEQfMM/export/html
^C
I have no name!@shell-1777812357:~$

Mentioned in SAL (#wikimedia-cloud) [2026-05-03T13:59:55Z] <wmbot~bd808@tools-bastion-14> Built image from 746ac0bd and deployed (T417207)

Let's see how many of the new batch are actually new:

I have no name!@shell-1778013300:~$ etherpad-backup ls | cut -f 3 > saved.txt
I have no name!@shell-1778013300:~$ wc -l saved.txt
8397 saved.txt
I have no name!@shell-1778013300:~$ comm -13 saved.txt batch1.txt | wc -l
1204

Cutting down the batch size will help. I'm going to try a 30 second delay between

I have no name!@shell-1778013300:~$ comm -13 saved.txt batch1.txt > batch-$(date +%Y%m%d%H%M).txt
I have no name!@shell-1778013300:~$ ls batch-20260505*
batch-202605052117.txt
I have no name!@shell-1778013300:~$ etherpad-backup -v backup --user BryanDavis --decode --delay 30 batch-202605052117.txt 2>&1 | tee batch-202605052117.log

The run from T417207#11892073 stopped with a mwclient.errors.APIError raised because a pad title hit on an abuse filter on Wikitech while being logged.

tools.etherpad-backup@tools-bastion-14:~$ webservice buildservice shell --mount all
I have no name!@shell-1778109079:/workspace$ cd $TOOL_DATA_DIR
I have no name!@shell-1778109079:~$ etherpad-backup -v backup --user BryanDavis --decode --delay 30  resume-202605062310.txt 2>&1 | tee resume-202605062310.log
bd808 claimed this task.

Thanks for the help on this @Addshore and @Tkarcher!