Page MenuHomePhabricator

Make an HTML dump of the output of the CodeReview extension on MediaWiki.org
Open, HighPublic

Description

As demanded by @Legoktm.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I think the diffs themselves are not that important to dump (we have them at https://phabricator.wikimedia.org/diffusion/SVN/), but the review comments are often very useful.

Soooo... @Legoktm were you able to give it a try, and if so, what happened?

T218079 brings T116948 up again and this task here is a blocker for both.

Change 500884 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/CodeReview@master] CodeRevisionView: Fix one case of viewvc not being optional

https://gerrit.wikimedia.org/r/500884

Change 500884 merged by jenkins-bot:
[mediawiki/extensions/CodeReview@master] CodeRevisionView: Fix one case of viewvc not being optional

https://gerrit.wikimedia.org/r/500884

I guess this task is stalled? I'll check with @Legoktm to see if we can move it forward.

Also checking in with @CCicalese_WMF to see if we can rebalance some resources.

WDoranWMF added a subscriber: WDoranWMF.

It appears this extension's data is accessible here
https://www.mediawiki.org/wiki/Special:Code

So we need to scrap all the data under these pages. However, the scale of the data is such that it would have to be paginated in some manner, which would then need to be ported to MW.org.

I'm moving this to the Core Platform Team Inbox so that it can be triaged and planned into a future sprint for one of our subteams.

We've done a similar archiving in the past when we shutdown our BugZilla installation.
Don't know if you want to go this route, but it's an idea I explored last year at T205482:

  • Maybe a MW maintenance script or Python scraper to render the pages without the skin (like action=render, but for Special:Code), and upload them somewhere (puppet microsite can be used).
  • Redirect the Special:Code urls for mediawiki.org to this static site, using an Apache rewrite rule.
  • Then, undeploy the extension (reduced to only log entries, as for EducationProgram, could be in WikimediaMessages repo).
WDoranWMF edited projects, added Platform Team Legacy; removed Platform Engineering.
WDoranWMF raised the priority of this task from Low to High.Dec 10 2019, 3:20 PM

I have everything dumped locally, it's about 4GB. I'll rsync it to people.wm.o so people can review it before we place it in its final location. I mostly did what @Krinkle suggested but with a few regex tweaks to fix URLs.

Mentioned in SAL (#wikimedia-operations) [2020-01-16T02:35:18Z] <Krinkle> krinkle@mwmaint1002 Change code_repo.repo_viewvc from 'https://svn.wikimedia.org/viewvc/mediawiki' to '' for 'MediaWiki' repo_name. Ref 2162cf2fc46cfe, T205361.

@Legoktm Awesome. I do have a few small nit picks:

  • Per T205361#5080437, I've applied the repo_viewvc change. This results in some of the broken interface links, being omitted. If you re-run the script now, those links will be gone, and the paths will remain as plain text.
  • The archive pages could do with a basic <h1> heading. Maybe make them a simple copy of the <title> that you have already?
  • I noticed the relative links are currently absolute e.g. <a href="/~legoktm/CodeReview/MediaWiki/rev/2.html">. If these were relative, like ./2.html, the archive would be more portable (without needing string replacements or regeneration).
  • I have a few minor CSS tweaks (e.g. hide the no-op "purge" link), but I'll stuff that in a patch later.

Change 565805 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/tools/codereview-archiver@master] Initial commit

https://gerrit.wikimedia.org/r/565805

Change 565805 merged by Legoktm:
[mediawiki/tools/codereview-archiver@master] Initial commit

https://gerrit.wikimedia.org/r/565805

I updated the dump last night with fixes for @Krinkle's feedback.

Marking as Resolved as it is in the Done column. Feel free to reopen if there is remaining work.

Where is HTML dump located?

Thanks @Dzahn. I looked in https://people.wikimedia.org/~legoktm/CodeReview/MediaWiki/rev/35.html (example) and found that some URL's are to MediaWiki.org like history, "purge"...

@Legoktm Is the dump available somewhere more public or documented somewhere? Could you please add a link to the final location somewhere and re-resolve this task?

Maybe talk to @ArielGlenn about getting it on the official dumps servers (dumps.wikimedia.org) under "misc". That would be more stable than the people VM.

Maybe talk to @ArielGlenn about getting it on the official dumps servers (dumps.wikimedia.org) under "misc". That would be more stable than the people VM.

We could host a tarball of the hmtl pages but that's different than a static copy that people can browse online.

Since the SQL dumps for codereview are also on dumps servers (T243055) doesn't it fit to also have the HTML together with it?

The HTML dump can be in a tarball for download, sure. But that is separate from what was requested in T243056 i.e. actually serving a static copy for browsing. I don't think the labstore boxes should be doing that.

In that case i think it sounds like this should have a dedicated ganeti VM just for this.

Sounds good to me, though we probably want that dicussion on the other task.

This has happened in T243056 (sites have been added to the miscweb* VMs shared with other static sites)

https://static-codereview.wikimedia.org/MediaWiki/1.html

I think it's (basically?) done.

Some HTML corruption ocurred in the post-processing step. This has caused "follow-up" links to become broken:

https://static-codereview.wikimedia.org/MediaWiki/75446.html?

<a ./75429.html" title="Special:Code/MediaWiki/75429">r75429</a>
<a ./75446.html" title="Special:Code/MediaWiki/75446">r75446</a>

Original from https://www.mediawiki.org/wiki/Special:Code/MediaWiki/75446

<a href="/wiki/Special:Code/MediaWiki/75466" title="Special:Code/MediaWiki/75466">r75466</a>
<a href="/wiki/Special:Code/MediaWiki/75467" title="Special:Code/MediaWiki/75467">r75467</a>

Also, is it MediaWiki/1.html or MediaWiki/rev/1.html. I've seen both versions. It seems we're back to the former?

Also, is it MediaWiki/1.html or MediaWiki/rev/1.html. I've seen both versions. It seems we're back to the former?

It's https://static-codereview.wikimedia.org/MediaWiki/1.html The other version was MediaWiki/r1.html but not /rev/1.html.

So, when is the Apache rewrite being put in place? That's blocking undeploying the extension.

So, when is the Apache rewrite being put in place? That's blocking undeploying the extension.

Ping! :)

(I remembered this task after this comment on the GitLab consultation: https://www.mediawiki.org/wiki/Topic:Vu63x95by4od74uc )