Page MenuHomePhabricator

secure.wikimedia.org speed and status
Closed, DeclinedPublic

Description

Author: William.Allen.Simpson

Description:
[originally reported on wikitech]

I've been using secure for login for over a year now, and at first it seemed
pretty good, other than the inability to switch sites easily (bug 5440).

And always editing links from secure.wikimedia.org/.../w to en.wikipedia.org/w,
but I've gotten used to doing that extra bit by hand.

Anyway, it's just been a dog lately. During EDT daylight hours, it often
gives an error not able to access page, especially saving.

So, I've reverted to the old practice from the days of 2005-2006, and
mostly edit in very off-peak hours. Yet it slowed down drastically again!

Here's my test log, edits queued and ready to go, demonstrating roughly how
long they take to come back and display:

;off hours

  1. 2009-07-01T06:54:45
  2. 2009-07-01T06:55:59 1 minute 14 seconds

;peak time

  1. 2009-07-01T17:05:24
  2. 2009-07-01T17:06:17 53 seconds
  3. 2009-07-01T17:06:53 36 seconds
  4. 2009-07-01T17:08:00 1 minute 7 seconds
  5. 2009-07-01T17:08:40 40 seconds
  6. 2009-07-01T17:09:45 1 minute 5 seconds
  7. 2009-07-01T17:11:49 2 minutes 4 seconds
  8. 2009-07-01T17:12:44 55 seconds
  9. 2009-07-01T17:13:49 1 minute 5 seconds
  10. 2009-07-01T17:15:00 1 minute 11 seconds
  11. 2009-07-01T17:16:10 1 minute 10 seconds

In short, sometimes as slow off-peak as peak.

Does this mean that many secure users are from Asia?

Are there too many secure users?

Is there anywhere that configuration and usage of secure is listed?


Version: unspecified
Severity: normal

Details

Reference
bz19587

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 10:41 PM
bzimport added a project: HTTPS.
bzimport set Reference to bz19587.

Fred can you take a peek and see if we can monitor status of secure server? I haven't noticed any problems using it, nor did the load graphs look particularly unpleasant when I checked last week, but we want to make sure it's not going to crap when we're not looking.

  • Bug 19588 has been marked as a duplicate of this bug. ***

fvassard wrote:

Secure.wikimedia.org seems to point to bart and uses apache2 to proxy the ssl connection over to the cluster.
However, bart is also the nagios monitoring server and will therefore see spikes in CPU usage from time to time, depending on the nagios scheduler.
Also, this server is very low on memory:

[root@bart conf]# free -m

total       used       free     shared    buffers     cached

Mem: 3550 3129 420 0 406 1118
-/+ buffers/cache: 1604 1945
Swap: 1983 0 1983

which could cause some of the issues you are seeing.

I will enable process accounting on that server to try and get a better view as to what is going on.

Ganglia graphs available at http://ganglia.wikimedia.org/pmtpa/?c=Miscellaneous&h=bart.wikimedia.org&m=&r=hour&s=descending&hc=4

Also note, this server is set to be decomissioned in the near future.

William.Allen.Simpson wrote:

peaks during off-peak time

Thank you for the ganglia link. The server list had "ssl"
instead of secure.wikimedia.org, so I'd missed it.

I've been looking at the graphs from time to time, and
this was a fine example.

Attached:

2009-0719-bart.graph.php.png (168×397 px, 15 KB)

William.Allen.Simpson wrote:

Noting for the record that http://nagios.wikimedia.org/ has been reporting
MEMCACHED CRITICAL - Can not connect to 10.0.2.159:11000 (Connection refused)
for some time now....

William.Allen.Simpson wrote:

peaks at same time each day

For comparison between 07-19 and 07-20, has the first CPU peak at the
same time. However, there is a 07-20 network peak at the same time as the
second 07-19 CPU peak, indicating some kind of regular process, too.

Attached:

2009-0720-bart.CPU.graph.php.png (168×397 px, 15 KB)

wikimedia-bugzilla wrote:

Dunno if this helps, but I've noticed this problem only on these pages so far. I look at a lot of Wikipedia articles):
https://secure.wikimedia.org/wikipedia/en/wiki/Barack_Obama
https://secure.wikimedia.org/wikipedia/en/wiki/Barack
https://secure.wikimedia.org/wikipedia/en/wiki/Obama

I'm accessing Wikipedia from New Zealand. The pages seem to be perpetually inaccessible (a few days so far). Of course the non-secure pages work fine.

wikimedia-bugzilla wrote:

The above three pages are still inaccessible for me. Also I've found another:
https://secure.wikimedia.org/wikipedia/en/wiki/9/11

https://secure.wikimedia.org/wikipedia/en/wiki/Obama

The first time i tried to access it i got a 502 error about proxy not being able to read. Second time, it went through rather quickly. Perhaps the parser cache is separate for secure and rest of everything, and that page just takes insanely long to render that it times out(?)

(In reply to comment #10)

Some more: https://secure.wikimedia.org/wikipedia/en/wiki/World_War_II
https://secure.wikimedia.org/wikipedia/en/wiki/World_War_2
https://secure.wikimedia.org/wikipedia/en/wiki/World_war_2
https://secure.wikimedia.org/wikipedia/en/wiki/WWII
https://secure.wikimedia.org/wikipedia/en/wiki/Ww2
https://secure.wikimedia.org/wikipedia/en/wiki/WW2

For me too. All return "502 Proxy Error"
Proxy Error

The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /wikipedia/en/wiki/World_War_II.

Reason: Error reading from remote server

Apache/2.2.8 (Ubuntu) mod_fastcgi/2.4.6 PHP/5.2.4-2ubuntu5.12wm1 with Suhosin-Patch mod_ssl/2.2.8 OpenSSL/0.9.8g Server at secure.wikimedia.org Port 443

Statuscode:502 Bad Gateway
Connection:Keep-Alive
Content-Length:616
Content-Type:text/html; charset=iso-8859-1
Date:Wed, 02 Mar 2011 07:33:01 GMT
Keep-Alive:timeout=1, max=100

That again appears to be the proxy timing out. I only get the 502 when the page was not served from the parser cache. If its served from the parser cache, it works fine from secure.

Probably the timeout on the proxy server needs to be increased (or someone could make the parser be super fast, but that's a little more difficult ;)

William.Allen.Simpson wrote:

[tried sending this via email, trying again]

We've seen these Proxy errors before with the server overloaded. It's
currently on singer. But I don't see (via Ganglia) the huge cpu spikes
we used to have on bart with nagios.

However, I was just going to post to wikitech that I've been seeing
other problems from secure lately, too:

  • Edits don't seem to flush the cache properly. After noticing this

weekend, I had to action=flush a dozen pages by hand to see my article
and category changes reflected via normal access.

  • It's losing the user name on edits, showing up with IP instead. I'm

not sure this wasn't due to my user error somehow -- but it was fairly
frequent back in the old overloaded days, hadn't happened to me for a
couple of years, and just showed up again yesterday!

(In reply to comment #13)

[tried sending this via email, trying again]

yeah, trying to reply by email to bugmail doesn't work.

We've seen these Proxy errors before with the server overloaded. It's
currently on singer. But I don't see (via Ganglia) the huge cpu spikes
we used to have on bart with nagios.

If my theory is correct, its not caused by load.

[..]

  • Edits don't seem to flush the cache properly. After noticing this

weekend, I had to action=flush a dozen pages by hand to see my article
and category changes reflected via normal access.

There was recently some issues with the job queue (bug 27727), may be related to that (That wouldn't be secure specific though)

  • It's losing the user name on edits, showing up with IP instead. I'm

not sure this wasn't due to my user error somehow -- but it was fairly
frequent back in the old overloaded days, hadn't happened to me for a
couple of years, and just showed up again yesterday!

That's a more interesting issue, I have no idea what could cause that.

Giving half of Fred's old bugs to Ashar since I trust him to get it done or reassign if he doesn't have time.

Resetting this back to wikibugs, and almost willing to close it.

Appears this is was assisgned back before we had status.* and the related tools and that is what was wanted. Which is now Bug 27912 to get it inculded.

And the 502 errors are also a seperate bug (bug 25271), which could probably get duped either way.

Assigning back to me. Pending actions:

  • make sure it is monitored by nagios and ganglia
  • check the peaks disappeared or either
    • move the process generating them elsewhere
    • move secure.w.o somewhere else

William.Allen.Simpson wrote:

Regarding comment 16, I had already filed Bug 19588 on the Proxy errors, but Brion marked it as a duplicate of this bug (back in comment 2). So maybe they should be split again?

wikimedia-bugzilla wrote:

Merge with bug 25271?

The Wikimedia Foundation operation team is rebuilding the HTTPS system from scratch that will solve this bug for good.

HTTPS has been enabled on test some days ago:
http://blog.wikimedia.org/2011/07/19/protocol-relative-urls-enabled-on-test-wikipedia-org/

Therefore, this bug will not be fixed since the architecture is going to be replaced.

Sounds more like "almost FIXED" than WONTFIX. :)