Page MenuHomePhabricator

Set up UrlShortener dumps
Closed, ResolvedPublic

Description

URL shorteners can be a reliability problem because if/when they go down, it becomes impossible to figure out where they actually pointed. So, we should be proactive and offer dumps of our short URLs.

Providing a raw sql dump isn't going to be very useful because the PHP code does encoding and normalization of URLs at runtime. Also, ArchiveTeam/301works.org (link) use comma/pipe separated text files as dumps.

https://gerrit.wikimedia.org/r/#/c/248644/1 contains a maintenance script to generate a text file in that format. We would need to set up a cronjob for it, compress the result, and publish it somewhere.

Details

Related Gerrit Patches:
operations/puppet : productiondump url shorteners for wiki projects
mediawiki/extensions/UrlShortener : masterAdd maintenance script to generate a dump of short codes and targets

Event Timeline

Legoktm created this task.Oct 28 2015, 10:08 PM
Legoktm raised the priority of this task from to Normal.
Legoktm updated the task description. (Show Details)
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 28 2015, 10:08 PM

Change 248644 had a related patch set uploaded (by Legoktm):
Add maintenance script to generate a dump of short codes and targets

https://gerrit.wikimedia.org/r/248644

Hydriz added a subscriber: Hydriz.

@Nemo_bis We might have to liaise with URLTeam on archiving this. Do you know if they have a specific format that they want (as Legoktm mentioned) and how they want it to be uploaded to the Internet Archive (e.g. identifiers, collections, etc)?

@Nemo_bis We might have to liaise with URLTeam

Don't use such words or they might have an heart attack! The files can just be uploaded to any collection and Jason or Jeff will move them later. If particularly lazy one can also drop them in the rsync target.

Better only worry about the format... http://urlte.am/ + https://github.com/ArchiveTeam/terroroftinytown/blob/master/README.md#notes are not especially comforting. In particular we can't satisfy both "use latin1 encoding" and "don't use percent-encoding", clearly. Our percent-encoding is broken anyway (T106793), so better satisfy the former than the latter.

Change 248644 merged by jenkins-bot:
Add maintenance script to generate a dump of short codes and targets

https://gerrit.wikimedia.org/r/248644

How often do we want this to run? Is once a week often enough? And like the rest of our cron-generated dumps it will likely wind up in "Other files".

How often do we want this to run? Is once a week often enough?

Yeah, I think that should be fine.

One more question, are the contents of this table the same across all wikis (i.e. I only need to dump it from one of them)?

There is only going to be one database table that all wikis will read from, so the dump script only needs to be run from a single wiki.

Any wiki will do I guess. OK!

RATS, it's not deployed yet? Is there an eta/guess at which branch etc?

Not yet :( We still have apache/varnish stuff that I need to figure out, so next week I'll try and get the MW parts of it set up in a read only mode to unblock this task at least.

@ArielGlenn the extension is now deployed in read-only mode. I manually created a short url in the database so we can test and set up the dumps before allowing users to create new short urls.

legoktm@terbium:~$ mwscript extensions/UrlShortener/maintenance/dumpURLs.php --wiki=metawiki /home/legoktm/dump.txt
Writing to /home/legoktm/dump.txt...
Writing 1 entries...
Done!
legoktm@terbium:~$ cat dump.txt 
2|https://www.wikimedia.org/

https://gerrit.wikimedia.org/r/#/c/278400/ Here's the changeset that can go in as soon as there is real data to dump.

Change 278400 had a related patch set uploaded (by Legoktm):
dump url shorteners for wiki projects

https://gerrit.wikimedia.org/r/278400

Dzahn added a subscriber: Dzahn.Apr 12 2016, 6:16 PM

The Gerrit change says it's blocked by 'as soon as the extension is deployed with the ability to actually add url shorteners)" ?

As soon as the extension is deployed allowing users to add data to the tables. Otherwise we just dump empty tables for awhile, a waste.

Dzahn added a comment.Apr 12 2016, 9:26 PM

I saw we already have an entry in the database that has been added manually. How about just merging so we can see it works? The (almost) empty dumps don't seem to hurt.

I've already tested manually; this is how I know it works.

Dzahn added a comment.Apr 12 2016, 9:46 PM

Ok cool, so no blocker to resolve :)

When will this extension be live with users able to add shorteners? Any ETA?

Hey folks, do we have live data yet? Love to enable this whenever we do.

Just checking in. Any movement on enabling the extension on the wikis for url creation?

Izno added a subscriber: Izno.Sep 20 2016, 5:51 PM

Just checking in. Any movement on enabling the extension on the wikis for url creation?

Aren't you asking on the wrong task?

Dzahn added a comment.Sep 20 2016, 7:22 PM

Is T108557 the right one then?

Izno added a comment.Sep 20 2016, 8:24 PM

Is T108557 the right one then?

Prettyyyyyyyyyy sure.

Change 278400 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] dump url shorteners for wiki projects

https://gerrit.wikimedia.org/r/278400

TheDJ awarded a token.Apr 9 2019, 11:59 AM

It's unblocked now!

My changeset is ready to go, but there's one problem. There's already 5 thousand urls in there, or over 1 meg of data. Obviously these files are going to grow pretty big. So we need to be able to compress them; would you be willing to modify the maintenance script so that it will write to a compressed output file if that is specified?

Ah, it appears I can just turn the file into the right uri and have it work. Back shortly.

The above change is ready to go whenever you folks like. A sample available for download right now is at https://dumps.wikimedia.org/other/shorturls/shorturls-20190510.gz for your perusal.

Change 278400 merged by ArielGlenn:
[operations/puppet@production] dump url shorteners for wiki projects

https://gerrit.wikimedia.org/r/278400

ArielGlenn closed this task as Resolved.May 13 2019, 8:35 AM
ArielGlenn claimed this task.
ArielGlenn added a subscriber: Reedy.

OK I didn't bother to wait :-D

This is live, a dump has been generated today and is available for download, and I've announced it to the xmldatadumps-l list. @Reedy feel free to announce in other venues as you see fit. Closing at last!

Dzahn awarded a token.May 14 2019, 1:07 AM
ArielGlenn moved this task from Incoming to Done on the Datasets-Archiving board.May 15 2019, 4:31 AM