scrape RT ticket HTML files
Open, LowPublic
Actions

Assigned To

None

Authored By

	Dzahn
	May 10 2023, 12:05 AM

Description

This task is to scrape HTML from our old ticket system https://rt.wikimedia.org and can be broken into these steps:

Go to https://rt.wikimedia.org/ and login (2 logins, IDP and local).

Look at the the oldest ticket, this is id 2 (compare https://rt.wikimedia.org/Ticket/Display.html?id=2 to https://rt.wikimedia.org/Ticket/Display.html?id=1 why it's not 1).

Look at the newest ticket, this should be id 11829 (https://rt.wikimedia.org/Ticket/Display.html?id=11829)

Find or make a tool/plugin/script to save the raw HTML of all the tickets, but in a way that makes it look pretty offline.

The paths to images/css etc need to keep working locally etc.

You get this desired behaviour for example if you use Firefox and manually click to save the page for offline use, but you will not get it with a simple wget or curl.

Also keep in mind you need to be a logged in user and have permissions for all tickets.

extra requirement:

Look at the "Queue:" field in tickets and save the HTML in a separate directory for each queue.

So for example https://rt.wikimedia.org/Ticket/Display.html?id=4802 should in a directory called "ops-requests" while https://rt.wikimedia.org/Ticket/Display.html?id=2 should be in a directory called "pmtpa" and so on.

Some tickets can be public but definitely not all tickets can be public and specifically not those in the queue "procurement". Some have been imported into Phabricator and then made public later in Phabricator, some have been imported but not made public and some have not been imported.

Keep the result in a private location but with access for SRE, for now.

Once we have those files the ticket is resolved. Later tickets will be about where we put them and how we shutdown the actual RT app.

This is limited to producing these "static dumps".

Related Objects

Mentioned Here: T38: Migrate RT to Phabricator
T93760: Moving procurement from RT to Phabricator

Event Timeline

Dzahn created this task.May 10 2023, 12:05 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2023, 12:05 AM

Dzahn updated the task description. (Show Details)May 10 2023, 12:06 AM

Is there any reason for the choice of scraping compared to transferring the queues that havn't been over into phabricator to have a single location?

and some have not been imported.

Is that and T38#4840473 the reason this ticket was created, or are there additional reasons?

According to https://www.mediawiki.org/wiki/Phabricator/versus_RT , all queues except for access-requests@ and procurement@ were migrated to Phabricator in late 2014 so I'd expect those RT tickets to exist in Phabricator anyway. For procurement@, T93760 isn't clear to me if it's about migrating tickets or just moving the (future) workflow to Phab tooling. I don't know about access-requests@.

Tickets in the domains queue also weren't transferred

In T336320#8839975, @Peachey88 wrote:

Is there any reason for the choice of scraping compared to transferring the queues that havn't been over into phabricator to have a single location?

Basically that this option means we can finally shut down a Perl application and figure it out later while the other option means opening a pandora's box of uncertainties, discussion and unknown tools and the people who did it in the past aren't here anymore. It's just being pragmatic to avoid the "perfect is the enemy of good" and lack of resources that has led to the current situatiin.

LSobanski triaged this task as Low priority.May 15 2023, 1:22 PM

LSobanski updated the task description. (Show Details)

LSobanski moved this task from Incoming to Backlog on the collaboration-services board.May 15 2023, 3:50 PM

An alternative proposal here is to create a static dump of the database only. As far as I'm aware there is no requirement for regular access to the information in RT and the purpose of this task is to have a data copy that can be accessed in an unlikely emergency. @wiki_willy do you have any thoughts on this?

Thanks for checking @LSobanski. It's definitely rare that we need to refer back to RT. In the last 5 years, the 2-3 cases that we've had to reference RT was typically due to tracking down information about core routers that we had purchased back then. In Netbox, we only have 24 active devices left that still reference RT tasks. As long as we're able to access these in someway (ideally quickly and easily) on the rare occasions that it's needed, you should be able to proceed with moving forward.

@wiki_willy thanks for the response, I have two more questions:

What are the EOL dates for the remaining devices (or if you could share a way to locate them I can check myself)?
Could you give me an example of information you would need from an RT ticket?

Sure, no prob @LSobanski. Here's the list of the 24 active devices that still reference RT tasks in Netbox, along with their purchase dates (network equipment usually EOLs every 8yrs):

Name	Purchase date	Procurement ticket	Device Type
msw-b7-eqiad-temp	2011-02-18	RT #0534	Management Switch
msw-e1-eqiad	2011-02-18	RT #0534	Management Switch
msw-c6-eqiad	2012-04-04	RT #2763	Management Switch
cr2-codfw	2012-07-19	RT #3069	Core Router
cr1-codfw	2012-07-19	RT #3069	Core Router
kvm-ulsfo	2012-08-29	RT #3463	KVM/LCD Console
msw-c5-eqiad	2011-02-18	RT #534	Management Switch
cr1-eqiad	2011-02-17	RT #552	Core Router
cr2-eqiad	2011-02-17	RT #552	Core Router
msw-f1-eqiad	2013-10-28	RT #5892	Management Switch
qfx5100-spare1-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-c1-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-c3-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-c5-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-c6-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-d1-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-d3-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-d5-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-d6-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-d8-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-d2-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-d7-codfw	2014-07-09	RT #7077	Core/Access Switch
asw-c7-codfw	2014-07-09	RT #7077	Core/Access Switch
atlas-eqiad	2014-07-11	RT #7390	Server

The core/access switches are scheduled to be replaced in FY24-25, and we'll be upgrading the management switches with old Juniper switches for cost savings...so I doubt we'll need to reference their related RT tickets. However, the core routers (which is just the router chassis) and probably the KVM/LCD Consoles will never EOL, so every once in a while, we might to refer back their procurement RT tasks to find information about the type of line cards we had purchased, the type of optics, their costs, things like that. If we're able to just save all the information captured in RT #3069 and RT #552, I think we should be fine.

We could easily save the listed tickets above by just manually clicking "save as" in Firefox without having to fix the scraping problem.

We could proceed setting up the static RT service and just copy HTML files in place in chunks.

I have been thinking if we split up the work it may not be thaaaat much to just click through all tickets on the side over some time.

If its so few, perhaps pdf or copy and paste into new phab procurement tasks and then update the netbox refs?

In T336320#9686453, @Peachey88 wrote:

If its so few, perhaps pdf or copy and paste into new phab procurement tasks and then update the netbox refs?

This seems like the most reasonable approach to me and we can pair it with storing a db dump for any unexpected use cases showing up at a later stage.

scrape RT ticket HTML filesOpen, LowPublicActions

Description

Related Objects

Event Timeline

scrape RT ticket HTML files
Open, LowPublic
Actions