Page MenuHomePhabricator

scrape RT ticket HTML files
Open, LowPublic

Description

This task is to scrape HTML from our old ticket system https://rt.wikimedia.org and can be broken into these steps:

  • Find or make a tool/plugin/script to save the raw HTML of all the tickets, but in a way that makes it look pretty offline.
  • The paths to images/css etc need to keep working locally etc.

You get this desired behaviour for example if you use Firefox and manually click to save the page for offline use, but you will not get it with a simple wget or curl.

Also keep in mind you need to be a logged in user and have permissions for all tickets.

extra requirement:

  • Look at the "Queue:" field in tickets and save the HTML in a separate directory for each queue.

So for example https://rt.wikimedia.org/Ticket/Display.html?id=4802 should in a directory called "ops-requests" while https://rt.wikimedia.org/Ticket/Display.html?id=2 should be in a directory called "pmtpa" and so on.

Some tickets can be public but definitely not all tickets can be public and specifically not those in the queue "procurement". Some have been imported into Phabricator and then made public later in Phabricator, some have been imported but not made public and some have not been imported.

Keep the result in a private location but with access for SRE, for now.

Once we have those files the ticket is resolved. Later tickets will be about where we put them and how we shutdown the actual RT app.

This is limited to producing these "static dumps".

Event Timeline

Is there any reason for the choice of scraping compared to transferring the queues that havn't been over into phabricator to have a single location?

and some have not been imported.

Is that and T38#4840473 the reason this ticket was created, or are there additional reasons?

According to https://www.mediawiki.org/wiki/Phabricator/versus_RT , all queues except for access-requests@ and procurement@ were migrated to Phabricator in late 2014 so I'd expect those RT tickets to exist in Phabricator anyway. For procurement@, T93760 isn't clear to me if it's about migrating tickets or just moving the (future) workflow to Phab tooling. I don't know about access-requests@.

Tickets in the domains queue also weren't transferred

Is there any reason for the choice of scraping compared to transferring the queues that havn't been over into phabricator to have a single location?

Basically that this option means we can finally shut down a Perl application and figure it out later while the other option means opening a pandora's box of uncertainties, discussion and unknown tools and the people who did it in the past aren't here anymore. It's just being pragmatic to avoid the "perfect is the enemy of good" and lack of resources that has led to the current situatiin.

LSobanski updated the task description. (Show Details)

An alternative proposal here is to create a static dump of the database only. As far as I'm aware there is no requirement for regular access to the information in RT and the purpose of this task is to have a data copy that can be accessed in an unlikely emergency. @wiki_willy do you have any thoughts on this?

Thanks for checking @LSobanski. It's definitely rare that we need to refer back to RT. In the last 5 years, the 2-3 cases that we've had to reference RT was typically due to tracking down information about core routers that we had purchased back then. In Netbox, we only have 24 active devices left that still reference RT tasks. As long as we're able to access these in someway (ideally quickly and easily) on the rare occasions that it's needed, you should be able to proceed with moving forward.

@wiki_willy thanks for the response, I have two more questions:

  • What are the EOL dates for the remaining devices (or if you could share a way to locate them I can check myself)?
  • Could you give me an example of information you would need from an RT ticket?

Sure, no prob @LSobanski. Here's the list of the 24 active devices that still reference RT tasks in Netbox, along with their purchase dates (network equipment usually EOLs every 8yrs):

NamePurchase dateProcurement ticketDevice Type
msw-b7-eqiad-temp2011-02-18RT #0534Management Switch
msw-e1-eqiad2011-02-18RT #0534Management Switch
msw-c6-eqiad2012-04-04RT #2763Management Switch
cr2-codfw2012-07-19RT #3069Core Router
cr1-codfw2012-07-19RT #3069Core Router
kvm-ulsfo2012-08-29RT #3463KVM/LCD Console
msw-c5-eqiad2011-02-18RT #534Management Switch
cr1-eqiad2011-02-17RT #552Core Router
cr2-eqiad2011-02-17RT #552Core Router
msw-f1-eqiad2013-10-28RT #5892Management Switch
qfx5100-spare1-codfw2014-07-09RT #7077Core/Access Switch
asw-c1-codfw2014-07-09RT #7077Core/Access Switch
asw-c3-codfw2014-07-09RT #7077Core/Access Switch
asw-c5-codfw2014-07-09RT #7077Core/Access Switch
asw-c6-codfw2014-07-09RT #7077Core/Access Switch
asw-d1-codfw2014-07-09RT #7077Core/Access Switch
asw-d3-codfw2014-07-09RT #7077Core/Access Switch
asw-d5-codfw2014-07-09RT #7077Core/Access Switch
asw-d6-codfw2014-07-09RT #7077Core/Access Switch
asw-d8-codfw2014-07-09RT #7077Core/Access Switch
asw-d2-codfw2014-07-09RT #7077Core/Access Switch
asw-d7-codfw2014-07-09RT #7077Core/Access Switch
asw-c7-codfw2014-07-09RT #7077Core/Access Switch
atlas-eqiad2014-07-11RT #7390Server

The core/access switches are scheduled to be replaced in FY24-25, and we'll be upgrading the management switches with old Juniper switches for cost savings...so I doubt we'll need to reference their related RT tickets. However, the core routers (which is just the router chassis) and probably the KVM/LCD Consoles will never EOL, so every once in a while, we might to refer back their procurement RT tasks to find information about the type of line cards we had purchased, the type of optics, their costs, things like that. If we're able to just save all the information captured in RT #3069 and RT #552, I think we should be fine.

We could easily save the listed tickets above by just manually clicking "save as" in Firefox without having to fix the scraping problem.

We could proceed setting up the static RT service and just copy HTML files in place in chunks.

I have been thinking if we split up the work it may not be thaaaat much to just click through all tickets on the side over some time.

If its so few, perhaps pdf or copy and paste into new phab procurement tasks and then update the netbox refs?

If its so few, perhaps pdf or copy and paste into new phab procurement tasks and then update the netbox refs?

This seems like the most reasonable approach to me and we can pair it with storing a db dump for any unexpected use cases showing up at a later stage.