Wikimedia.org portal broken in Beta Cluster (Domain unavailable)
Open, HighPublic

Description

https://www.wikimedia.beta.wmflabs.org/ no longer works. Gets default Apache response for "Domain not configured".

The Wikipedia portal does work.
https://www.wikipedia.beta.wmflabs.org/

Krinkle created this task.Aug 22 2017, 10:40 PM
Restricted Application added a project: Discovery. · View Herald TranscriptAug 22 2017, 10:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
debt added a subscriber: debt.

I don't think this is something that we can fix.

greg added a subscriber: greg.Aug 24 2017, 5:03 PM

I don't think this is something that we can fix.

It's something that the portal devs *should* be able to fix. :)

Krinkle triaged this task as Unbreak Now! priority.Mar 20 2018, 8:14 PM
This comment was removed by Krinkle.
Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptMar 20 2018, 8:14 PM
Krinkle added a comment.EditedMar 20 2018, 8:14 PM

This is still broken https://www.wikipedia.beta.wmflabs.org/ and also affecting stability of the WebPageTest infrastructure that Performance Team run for the portals

Our shared Jenkins job has been failing for several months now due to the 404 for this url. It's disruptive for overall server maintenance that nothing relating to portals can be verified in Beta Cluster. This used to work but has stopped working, and yet deployments for portals are being continued..

debt assigned this task to Jdrewniak.Mar 21 2018, 1:16 AM
debt added a project: Discovery-Portal-Sprint.
debt added a subscriber: Jdrewniak.

@Jdrewniak - can you take a look?

A change was deployed on Monday that probably broke the Wikipedia portal on beta https://www.wikipedia.beta.wmflabs.org/

This change involved replacing a submodule on mediawiki-config. As I recall, @hashar noticed that this symlink in mediawiki-config wasn't included in our sync-portals script and had be synced manually.

So going off that, I assume that syncing that symlink to beta-cluster would fix the Wikipedia beta page.

I guess that is the same issue we had on production. Probably just need to scap sync-file it

Mentioned in SAL (#wikimedia-releng) [2018-03-21T17:52:00Z] <hashar> deployment-prep: scap sync-file docroot/wwwportal/portal 'T173887'

I did a sync of docroot/wwwportal/portal but that hasn't changed anything.

Even before the sync, a mediawiki server (deployment-mediawiki04) had the proper link:

$ readlink -f /srv/mediawiki/docroot/wwwportal/portal/
/srv/mediawiki/portals
$ ls /srv/mediawiki/portals/
package.json       sync-portals  urls-to-purge.txt  wikimedia.org  wikipedia.org  wikiversity.org  wiktionary.org
package-lock.json  tests         wikibooks.org      wikinews.org   wikiquote.org  wikivoyage.org
$

So looks like something else is messy. Maybe at apache level. I can't debug it though.

At least https://www.wikipedia.beta.wmflabs.org/apple-app-site-association works.

Looking at the apache config on deployment-mediawiki04 in /etc/apache2/sites-enabled/01-wwwportals.conf it has:

<VirtualHost *:80>
    DocumentRoot /srv/mediawiki/docroot/wwwportal
    ServerName www.wikipedia.beta.wmflabs.org

    RewriteEngine On
    RewriteRule . - [E=RW_PROTO:%{HTTP:X-Forwarded-Proto}]
    RewriteCond %{ENV:RW_PROTO} !=https
    RewriteRule . - [E=RW_PROTO:http]

    # Front page...
    RewriteRule ^/$ /portal-master/wikipedia.org/index.html [L]
        RewriteRule ^/portal/(.*)$ /portal-master/$1 [L]
    RewriteRule ^/portal-master/.*$ - [L]

...

Note the last rewrite rule which points / to /portal-master/wikipedia.org/index.html. That is relative to the DocumentRoot /srv/mediawiki/docroot/wwwportal. Hence on disk we have:

$ ls -l1 /srv/mediawiki/docroot/wwwportal
portal -> ../../portals
portal-master -> ../../portal-master/prod
static -> /srv/mediawiki/static
w

So while production points at portal:

$ readlink -f /srv/mediawiki/docroot/wwwportal/portal
/srv/mediawiki/portals

On deployment-prep the Apache configuration points at portal-master:

$ readlink -f /srv/mediawiki/docroot/wwwportal/portal-master
/srv/mediawiki/portal-master/prod

That is an empty directory.

We made the Jenkins job to update the portals git submodule to use the tip of the remote branch (instead of the commit pointed in mediawiki-config). So probably we can have deployment-prep Apache configuration to be aligned with production. Namely drop the portal-master thing entirely.

@Jdrewniak let me know whether that makes sense. We can try some hack on deployment-prep and attempt to fix it once for all.

@hashar that sounds good to me!
anything that helps simplify the Apache config works (I didn't know the rewrite rule was different on deployment-prep)

Mentioned in SAL (#wikimedia-releng) [2018-04-05T17:50:20Z] <eddiegp> Cherry-picking https://gerrit.wikimedia.org/r/c/424361/ on deployment-puppetmaster02 to try unbreak T173887

Change 424361 had a related patch set uploaded (by EddieGP; owner: EddieGP):
[operations/puppet@production] Beta: Unbreak wwwportals

https://gerrit.wikimedia.org/r/424361

EddieGP added a subscriber: EddieGP.EditedApr 5 2018, 7:18 PM

The initial task description was about the wikiMedia portal. This task suddenly became UBN! and active after the wikiPedia portals broke.

For the wikiPedia portal: It's now fixed. It's working from a cherry-pick right now though. We don't want long-lived cherry-picks on deployment-prep. Any op feel free to puppet-merge https://gerrit.wikimedia.org/r/c/424361/ and remove the cherry-pick on deployment-prep, or I'll set that up for puppet swat next week (whatever happens faster).

For the wikiMedia portal: It's still broken. Because of claims on this task that this "used to work" I've looked through the git history to find an offending commit breaking it, but even when resetting my branch to January 2017 (some random point in time long before this task was opened) and searching in the apache configs effective at the time, the only apache config for www.wikimedia is loaded in prod_sites.pp, which does what the name suggests (only used in prod, not beta). That situation does look like it's the same as right now. @Krinkle Are you sure the wikiMedia portal ever worked in beta, and if yes, do you know when? Or was there just some confusion about wikiMedia vs wikiPedia here? I might try to make wikiMedia portals work, which is a valid request regardless. But I'm not sure whether to start at "how it used to work" or at "how prod works right now" - this depends on whether it ever "used to work". (Edit: Oh, and if it never used to work, I don't think this task should be ubn.)

@EddieGP I'm pretty sure it used to work. Note that operations/puppet.git history is not (necessarily) a complete picture of how things were on Beta Cluster. There's various live hacks applied to the beta cluster's puppetmaster at any point in time, some of which are in Gerrit (unmerged), some of which may get merged (but with different contents and at a different point in time), some of which may get abandoned in Gerrit, of never be in Gerrit at all. Then there's also Hiera values which can come from Wikitech/Horizon, and then there's the case of /srv/mediawiki in Beta Cluster mostly being untracked (e.g. beta's clone of mediawiki, php-master, and portals, are not reflected in Git).

(EDIT: Looking in the history I think it worked at 8b1f4f768c)

As for getting wikimedia.org portal to work in Beta, I understand it's essentially a static site for which the config is virtually identical between portals, so it "should" be straight-forward to mold into a shape where it's shared between portals, and shared between beta/prod. Most of it is in place already, but a few bits and pieces may need to be figured out still.

I marked the task UBN primarily due to wikipedia.org not working. I didn't check other portal sites at the time.

I consider it UBN because we should not encourage production deployments of Apache configuration and portal submodule updates, without being able to verify them on the Beta Cluster. Even if the portal maintainers themselves are fine without beta, it affects other teams. The portals are served from the application servers, they share the same Apache configuration as wikis, and share the same directory (/srv/mediawiki), and deployment process (scap from the mw deployment master). In order to be able to maintain and test changes to Apache configuration, mediawiki-config and/or deployment procedures, it is an unnecessary burden that we are unable to confirm whether or not the portals still work properly with the current state of master. Finding out in production is... not great.

Also, I would imagine that it's pretty useful to have a url that shows the current master of the portals repo. Both for sharing with one another, for others to easily confirm recent changes (e.g. from design or QA), cross-device/cross-browser testing, and the exposure of lots of volunteers and staff possibly finding issues before going to production.

Krinkle lowered the priority of this task from Unbreak Now! to High.Apr 5 2018, 11:01 PM

Mentioned in SAL (#wikimedia-releng) [2018-04-06T21:50:59Z] <eddiegp> beta: Cherry-picking https://gerrit.wikimedia.org/r/c/424707/ , test for T173887

Change 424707 had a related patch set uploaded (by EddieGP; owner: EddieGP):
[operations/puppet@production] mediawiki: Move www.wikimedia.org to wwwportals.conf

https://gerrit.wikimedia.org/r/424707

Change 424361 merged by Dzahn:
[operations/puppet@production] Beta: Unbreak wwwportals

https://gerrit.wikimedia.org/r/424361

Change 424361 merged by Dzahn:

The pedia change is now merged and no longer a cherry-pick.


Change 424707 had a related patch set uploaded (by EddieGP; owner: EddieGP):

Mentioned in SAL (#wikimedia-releng) [2018-04-06T21:50:59Z] <eddiegp> beta: Cherry-picking https://gerrit.wikimedia.org/r/c/424707/ , test for T173887

That patch makes www.wikimedia.beta.wmflabs.org working. Cherry-picking confirmed it works in beta. It's not sure how it'll affect production though, because of the weird alias *.wikimedia.org. I cannot just move that one over to wwwportals.conf (this was attempted before and caused an outage).

I've removed it from the puppetmaster again for now and will try getting it merged in some state that doesn't break prod. wikimedia portals in beta should automatically unbreak once we get there.