Page MenuHomePhabricator

Make an HTML dump of the output of the CodeReview extension on MediaWiki.org
Open, HighPublic

Description

As demanded by @Legoktm.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I guess this task is stalled? I'll check with @Legoktm to see if we can move it forward.

Also checking in with @CCicalese_WMF to see if we can rebalance some resources.

WDoranWMF added a subscriber: WDoranWMF.

It appears this extension's data is accessible here
https://www.mediawiki.org/wiki/Special:Code

So we need to scrap all the data under these pages. However, the scale of the data is such that it would have to be paginated in some manner, which would then need to be ported to MW.org.

I'm moving this to the Core Platform Team Inbox so that it can be triaged and planned into a future sprint for one of our subteams.

We've done a similar archiving in the past when we shutdown our BugZilla installation.
Don't know if you want to go this route, but it's an idea I explored last year at T205482:

  • Maybe a MW maintenance script or Python scraper to render the pages without the skin (like action=render, but for Special:Code), and upload them somewhere (puppet microsite can be used).
  • Redirect the Special:Code urls for mediawiki.org to this static site, using an Apache rewrite rule.
  • Then, undeploy the extension (reduced to only log entries, as for EducationProgram, could be in WikimediaMessages repo).
WDoranWMF edited projects, added Platform Team Legacy; removed Platform Engineering.
WDoranWMF raised the priority of this task from Low to High.Dec 10 2019, 3:20 PM

I have everything dumped locally, it's about 4GB. I'll rsync it to people.wm.o so people can review it before we place it in its final location. I mostly did what @Krinkle suggested but with a few regex tweaks to fix URLs.

Mentioned in SAL (#wikimedia-operations) [2020-01-16T02:35:18Z] <Krinkle> krinkle@mwmaint1002 Change code_repo.repo_viewvc from 'https://svn.wikimedia.org/viewvc/mediawiki' to '' for 'MediaWiki' repo_name. Ref 2162cf2fc46cfe, T205361.

@Legoktm Awesome. I do have a few small nit picks:

  • Per T205361#5080437, I've applied the repo_viewvc change. This results in some of the broken interface links, being omitted. If you re-run the script now, those links will be gone, and the paths will remain as plain text.
  • The archive pages could do with a basic <h1> heading. Maybe make them a simple copy of the <title> that you have already?
  • I noticed the relative links are currently absolute e.g. <a href="/~legoktm/CodeReview/MediaWiki/rev/2.html">. If these were relative, like ./2.html, the archive would be more portable (without needing string replacements or regeneration).
  • I have a few minor CSS tweaks (e.g. hide the no-op "purge" link), but I'll stuff that in a patch later.

Change 565805 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/tools/codereview-archiver@master] Initial commit

https://gerrit.wikimedia.org/r/565805

Change 565805 merged by Legoktm:
[mediawiki/tools/codereview-archiver@master] Initial commit

https://gerrit.wikimedia.org/r/565805

I updated the dump last night with fixes for @Krinkle's feedback.

Marking as Resolved as it is in the Done column. Feel free to reopen if there is remaining work.

Where is HTML dump located?

Thanks @Dzahn. I looked in https://people.wikimedia.org/~legoktm/CodeReview/MediaWiki/rev/35.html (example) and found that some URL's are to MediaWiki.org like history, "purge"...

@Legoktm Is the dump available somewhere more public or documented somewhere? Could you please add a link to the final location somewhere and re-resolve this task?

Maybe talk to @ArielGlenn about getting it on the official dumps servers (dumps.wikimedia.org) under "misc". That would be more stable than the people VM.

Maybe talk to @ArielGlenn about getting it on the official dumps servers (dumps.wikimedia.org) under "misc". That would be more stable than the people VM.

We could host a tarball of the hmtl pages but that's different than a static copy that people can browse online.

Since the SQL dumps for codereview are also on dumps servers (T243055) doesn't it fit to also have the HTML together with it?

The HTML dump can be in a tarball for download, sure. But that is separate from what was requested in T243056 i.e. actually serving a static copy for browsing. I don't think the labstore boxes should be doing that.

In that case i think it sounds like this should have a dedicated ganeti VM just for this.

Sounds good to me, though we probably want that dicussion on the other task.

This has happened in T243056 (sites have been added to the miscweb* VMs shared with other static sites)

https://static-codereview.wikimedia.org/MediaWiki/1.html

I think it's (basically?) done.

Some HTML corruption ocurred in the post-processing step. This has caused "follow-up" links to become broken:

https://static-codereview.wikimedia.org/MediaWiki/75446.html?

<a ./75429.html" title="Special:Code/MediaWiki/75429">r75429</a>
<a ./75446.html" title="Special:Code/MediaWiki/75446">r75446</a>
Screenshot 2020-05-20 at 03.28.55.png (310×1 px, 199 KB)
Screenshot 2020-05-20 at 03.31.22.png (324×670 px, 36 KB)

Original from https://www.mediawiki.org/wiki/Special:Code/MediaWiki/75446

<a href="/wiki/Special:Code/MediaWiki/75466" title="Special:Code/MediaWiki/75466">r75466</a>
<a href="/wiki/Special:Code/MediaWiki/75467" title="Special:Code/MediaWiki/75467">r75467</a>

Also, is it MediaWiki/1.html or MediaWiki/rev/1.html. I've seen both versions. It seems we're back to the former?

Also, is it MediaWiki/1.html or MediaWiki/rev/1.html. I've seen both versions. It seems we're back to the former?

It's https://static-codereview.wikimedia.org/MediaWiki/1.html The other version was MediaWiki/r1.html but not /rev/1.html.

So, when is the Apache rewrite being put in place? That's blocking undeploying the extension.

So, when is the Apache rewrite being put in place? That's blocking undeploying the extension.

Ping! :)

(I remembered this task after this comment on the GitLab consultation: https://www.mediawiki.org/wiki/Topic:Vu63x95by4od74uc )

Ping again. Any SREer have the time to finish this?

Change 724049 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] mediawiki: Redirect Special:CodeReview to static archives

https://gerrit.wikimedia.org/r/724049

Change 744080 had a related patch set uploaded (by Legoktm; author: Legoktm):

[mediawiki/tools/codereview-archiver@master] Fix re1 replacement

https://gerrit.wikimedia.org/r/744080

Change 744080 merged by Legoktm:

[mediawiki/tools/codereview-archiver@master] Fix re1 replacement

https://gerrit.wikimedia.org/r/744080

Mentioned in SAL (#wikimedia-operations) [2021-12-06T19:58:36Z] <legoktm> trying new dump of Special:CodeReview on mwmaint1002 (T205361)

OK, I copied over the new dump to miscweb, the issues in T205361#6150835 are fixed now. I *think* we're ready to finally do this.

OK, I copied over the new dump to miscweb, the issues in T205361#6150835 are fixed now. I *think* we're ready to finally do this.

I take it we didn't pull the trigger last month, though? :-(

No, the Apache config change is non trivial (disable puppet everywhere, roll out to one/few servers, use httpbb for verification plus manual testing, then slowly enable everywhere) and it was too close to the freeze. I had asked in this week's ServiceOps meeting if anyone wanted to pick it up but I don't think anyone volunteered (or at least I didn't see it in the notes). Maybe we can do it during one of the puppet request windows next week.

Change 754088 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/tools/codereview-archiver@master] Set HTML doctype and lang, strip purge link, add basic styles

https://gerrit.wikimedia.org/r/754088

Change 754088 merged by Legoktm:

[mediawiki/tools/codereview-archiver@master] Set HTML doctype and lang, strip purge link, add basic styles

https://gerrit.wikimedia.org/r/754088

Mentioned in SAL (#wikimedia-operations) [2022-03-28T22:31:27Z] <rzl> rzl@cumin2002:~$ sudo cumin A:mw 'disable-puppet T205361'

Change 724049 merged by RLazarus:

[operations/puppet@production] mediawiki: Redirect Special:CodeReview to static archives

https://gerrit.wikimedia.org/r/724049

Mentioned in SAL (#wikimedia-operations) [2022-03-28T22:39:14Z] <rzl> rzl@cumin2002:~$ sudo cumin A:mw 'enable-puppet T205361'

Change 774821 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] httpbb: fix status code checks for CodeReview redirects

https://gerrit.wikimedia.org/r/774821

Change 774821 merged by RLazarus:

[operations/puppet@production] httpbb: fix status code checks for CodeReview redirects

https://gerrit.wikimedia.org/r/774821

Change 754088 merged by Legoktm:

[mediawiki/tools/codereview-archiver@master] Set HTML doctype and lang, strip purge link, add basic styles

https://gerrit.wikimedia.org/r/754088

It seems the result of this change was not deployed, the pages display:

This page is in Quirks Mode. Page layout may be impacted.

And the layout is impacted as a result (odd sizes, broken font, etc.)

It seems the result of this change was not deployed, the pages display:

This page is in Quirks Mode. Page layout may be impacted.

And the layout is impacted as a result (odd sizes, broken font, etc.)

Where are you seeing that?

Change 774943 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] mediawiki: fix r123 syntax for special:codereview redirects

https://gerrit.wikimedia.org/r/774943

This page is in Quirks Mode. Page layout may be impacted.

And the layout is impacted as a result (odd sizes, broken font, etc.)

Where are you seeing that?

Browser console. But also view-source shows the page is not in sync with the repo, note the lack of doctype, and the font styles missing etc,

<html>
<head>
<title>r113071 MediaWiki - Code Review archive</title>

Change 774981 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] httpbb: follow-up to 'fix status code checks for CodeReview redirects'

https://gerrit.wikimedia.org/r/774981

Change 774981 merged by Dzahn:

[operations/puppet@production] httpbb: follow-up to 'fix status code checks for CodeReview redirects'

https://gerrit.wikimedia.org/r/774981

Mentioned in SAL (#wikimedia-operations) [2022-03-29T22:50:25Z] <mutante> cumin1001 - systemctl start httpbb_hourly_appserver fixed Icinga alert after gerrit:774981 T205361

A month later, it'd be really nice to get this finally done so that we can undeploy the code. How feasible is this?

@Jdforrester-WMF I recall this issue from, at least, more than a year ago. I don't think, but am willing to be wrong, that Platform Engineering owns completing this but who does own this right now? (Apologies if that is already evident and I just missed it, curious where it fits in priorities and if it should move up/down)