Page MenuHomePhabricator

Investigate CAS performance
Closed, ResolvedPublic

Description

Apereo cas at idp.wikimedia.org seems to take anywhere from ~5 to 20 seconds to login, this seems a bit to long and we should seek to lower it

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolvedjbond

Event Timeline

I have turned on trace level logging and the main gap is between the following two log entries

2020-02-24 16:32:37,560 DEBUG [org.apereo.cas.util.scripting.ScriptingUtils] - <Preparing constructor arguments [[null, SimplePrincipal(id=jbond, attributes={cn=[Jbond], memberOf=[cn=project-puppet-diffs,ou=groups,dc=wikimedia,dc=org, cn=mfa-enabled,ou=groups,dc=wikimedia,dc=org, cn=project-deployment-prep,ou=groups,dc=wikimedia,dc=org, cn=project-cloudinfra,ou=groups,dc=wikimedia,dc=org, cn=project-tools,ou=groups,dc=wikimedia,dc=org, cn=project-automation-framework,ou=groups,dc=wikimedia,dc=org, cn=ops,ou=groups,dc=wikimedia,dc=org, cn=project-sso,ou=groups,dc=wikimedia,dc=org, cn=wmf,ou=groups,dc=wikimedia,dc=org, cn=project-bastion,ou=groups,dc=wikimedia,dc=org, cn=project-monitoring,ou=groups,dc=wikimedia,dc=org, cn=project-puppet,ou=groups,dc=wikimedia,dc=org, cn=project-testlabs,ou=groups,dc=wikimedia,dc=org, cn=project-traffic,ou=groups,dc=wikimedia,dc=org], mail=[jbond@wikimedia.org], mfa-method=[mfa-u2f]}), [AbstractMultifactorAuthenticationProvider(bypassEvaluator=org.apereo.cas.authentication.bypass.DefaultChainingMultifactorAuthenticationBypassProvider@4fbc516f, failureModeEvaluator=org.apereo.cas.authentication.DefaultMultifactorAuthenticationFailureModeEvaluator@4040ccae, failureMode=UNDEFINED, id=mfa-u2f, order=0)], org.apache.logging.slf4j.Log4jLogger@62e77ffe]] for resource [URL [file:/etc/cas/global_principal_attribute_predicate.groovy]]>




2020-02-24 16:32:46,845 DEBUG [org.apereo.cas.authentication.trigger.PredicatedPrincipalAttributeMultifactorAuthenticationTrigger] - <Created predicate instance [PredicateLDAP] from [URL [file:/etc/cas/global_principal_attribute_predicate.groovy]] to filter multifactor authentication providers [[AbstractMultifactorAuthenticationProvider(bypassEvaluator=org.apereo.cas.authentication.bypass.DefaultChainingMultifactorAuthenticationBypassProvider@4fbc516f, failureModeEvaluator=org.apereo.cas.authentication.DefaultMultifactorAuthenticationFailureModeEvaluator@4040ccae, failureMode=UNDEFINED, id=mfa-u2f, order=0)]]>
2020-02-24 16:32:46,846 DEBUG [org.apereo.cas.authentication.trigger.PredicatedPrincipalAttributeMultifactorAuthenticationTrigger] - <mfa-method detected, testing value [[mfa-u2f]] against provider [AbstractMultifactorAuthenticationProvider(bypassEvaluator=org.apereo.cas.authentication.bypass.DefaultChainingMultifactorAuthenticationBypassProvider@4fbc516f, failureModeEvaluator=org.apereo.cas.authentication.DefaultMultifactorAuthenticationFailureModeEvaluator@4040ccae, failureMode=UNDEFINED, id=mfa-u2f, order=0)] with service [null]>

unfortunatly nothing is logged between theses to events which may highlight where things have been stuck

Other things i have tested:

  • i monitored tcpdump port 636 and traffic the lag is not the ldap server
  • Attached VisualVM to the jmx/jstatd service but was not able to see anything obvious (im not an expert here so could have easily missed something)

Its also worth noting that when you restart the cas service the first time you visit any page it takes a bit longer then normal to render (assuming it is cached after the first hit)

there seem to be a few resources which take a bit of time so we may be able to set some cache options* to help however this is *not* the main bottleneck

image.png (260×1 px, 37 KB)

I checked and most of zxcvbn.js is a password list probably used to dissuade common passwords. We could probably update this so it is much smaller, as well as changing the cache control headers

*options are currently: cache-control: no-cache, no-store, max-age=0, must-revalidate

@jbond: Assuming this task is about SRE, hence adding that project tag.

jbond triaged this task as Medium priority.Feb 25 2020, 9:57 AM

Change 574808 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/software/cas-overlay-template@master] templates: update so that CSS and JS files come from CF CDN

https://gerrit.wikimedia.org/r/574808

Change 574884 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/software/cas-overlay-template@master] templates: update so that CSS and JS files come from CF CDN

https://gerrit.wikimedia.org/r/574884

Change 574808 abandoned by Jbond:
templates: update so that CSS and JS files come from CF CDN

Reason:
use 574884 as it provides istory

https://gerrit.wikimedia.org/r/574808

Change 574884 merged by Jbond:
[operations/software/cas-overlay-template@master] templates: update so that CSS and JS files come from CF CDN

https://gerrit.wikimedia.org/r/574884

initial page load is a bit better

image.png (315×1 px, 54 KB)

Change 575215 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/software/cas-overlay-template@master] themes: don't use externally hosted js/css files

https://gerrit.wikimedia.org/r/575215

Change 575215 merged by Jbond:
[operations/software/cas-overlay-template@master] themes: don't use externally hosted js/css files

https://gerrit.wikimedia.org/r/575215

I dug into this a little further, but no luck yet. Some further findings:

Opening the login page took ~ 20s from Firefox after a restart of CAS. There's two significant, long gaps (where nothing is logged):

2020-03-09 16:31:21,767 DEBUG [org.springframework.webflow.engine.impl.FlowExecutionImplFactory] - <Creating new execution of 'login'>

[ mysterious 8 seconds gap where nothing gets logged ]

2020-03-09 16:31:29,379 DEBUG [org.springframework.webflow.engine.impl.FlowExecutionImpl] - <Starting in org.springframework.webflow.mvc.servlet.MvcExternalContext@7476eb07 with input null>
2020-03-09 16:31:29,505 DEBUG [org.springframework.security.web.context.SecurityContextPersistenceFilter] - <SecurityContextHolder now cleared, as request processing completed>

[ mysterious 7 seconds gap where nothing gets logged ]

2020-03-09 16:31:36,470 DEBUG [org.springframework.webflow.engine.Flow] - <Creating [FlowVariable@7d7447fc name = 'credential', valueFactory = [BeanFactoryVariableValueFactory@776ff643 type = UsernamePasswordCredential]]>

[ This does the grunt work of the actual request processing with lots of activity, for which the 4 seconds in total seem in line with a Java web application ]

2020-03-09 16:31:40,928 DEBUG [org.springframework.web.servlet.DispatcherServlet] - <Completed 200 OK>
2020-03-09 16:31:40,928 DEBUG [org.springframework.security.web.access.ExceptionTranslationFilter] - <Chain processed normally>

The second delay happens within Spring:

2020-03-10 12:13:53,660 DEBUG [org.springframework.binding.mapping.impl.DefaultMapper] - <Beginning mapping between source [org.springframework.webflow.core.collection.LocalParameterMap] and target [org.apereo.cas.authentication.credential.UsernamePasswordCredential]>

(nothing else logged)

2020-03-10 12:14:01,483 DEBUG [org.springframework.binding.mapping.impl.DefaultMapping] - <Adding mapping result [Success@5f30d5 mapping = parameter:'username' -> username, code = 'success', error = false, originalValue = 'Muehlenhoff', mappedValue = 'Muehlenhoff']>

Code in Spring Webflow for the above:

public MappingResults map(Object source, Object target) {
        if (logger.isDebugEnabled()) {
                logger.debug("Beginning mapping between source [" + source.getClass().getName() + "] and target ["
                                + target.getClass().getName() + "]");
        }
        DefaultMappingContext context = new DefaultMappingContext(source, target);
        for (DefaultMapping mapping : mappings) {
                mapping.map(context);
        }
        MappingResults results = context.getMappingResults();
        if (logger.isDebugEnabled()) {
                logger.debug("Completing mapping between source [" + source.getClass().getName() + "] and target ["
                                + target.getClass().getName() + "]; total mappings = " + results.getAllResults().size()
                                + "; total errors = " + results.getErrorResults().size());
        }
        return results;
}

Change 582803 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/software/cas-overlay-template@master] FasterXML: add jackson-module-kotlin

https://gerrit.wikimedia.org/r/582803

Change 582803 merged by Jbond:
[operations/software/cas-overlay-template@master] FasterXML: add jackson-module-kotlin

https://gerrit.wikimedia.org/r/582803

Currently seeing a 1.1 minute (!) TTFB for https://idp.wikimedia.org/logout followed by a 504 response code with body "upstream request timeout"

Currently seeing a 1.1 minute (!) TTFB for https://idp.wikimedia.org/logout followed by a 504 response code with body "upstream request timeout"

Do you see this every time you logout. I have noticed that after cas is the first person to hit a page pays a large price in the request time ~20-30 seconds however subsequent requests are quicker.

I have also noticed that the logout page also suffers from the issue you describe. i.e. it takes a long time then issues a 5xx (not checked if its always 504 but assume it is). however it dose actually log the user out further upon login and back out again the logout process takes < 5 seconds. Can you confirm if this seems to match your experience.

Going forward we plan to install and manage tomcat via apt. There is some anecdotal evidence this will help with the large lag when login in or authorising an application and i am also hoping that this will allow use to pre-generate the servlet pages to reduce the initial penalty on the first render. If this is not possible iill add a script to preform a login and logout after a service restart

Currently seeing a 1.1 minute (!) TTFB for https://idp.wikimedia.org/logout followed by a 504 response code with body "upstream request timeout"

Do you see this every time you logout. I have noticed that after cas is the first person to hit a page pays a large price in the request time ~20-30 seconds however subsequent requests are quicker.

I have also noticed that the logout page also suffers from the issue you describe. i.e. it takes a long time then issues a 5xx (not checked if its always 504 but assume it is). however it dose actually log the user out further upon login and back out again the logout process takes < 5 seconds. Can you confirm if this seems to match your experience.

No, I haven't seen it every time -- although the last time I tried to use the logout page was a couple weeks ago, it worked fine then (5-10 seconds latency and a success response).

Logging out now also worked fine.

Logging out now also worked fine.

ack thanks please add a comment here f you see it again and we should be able to validate if it is only caused by the restart

Change 582835 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/puppet@production] apereo_cas: fix lag

https://gerrit.wikimedia.org/r/582835

Change 582835 merged by Jbond:
[operations/puppet@production] apereo_cas: fix lag

https://gerrit.wikimedia.org/r/582835

ok so i don't want to get too excited but it appears to me that the last patch may have fixed the issue with lag. The issues relating to rendering the first hit and the logout page causing a 503 is still problematic however i think the performance is usable now.

The fix was to add the following line to cas.properties

spring.autoconfigure.exclude=org.springframework.boot.autoconfigure.web.embedded.EmbeddedWebServerFactoryCustomizerAutoConfiguration

I'm unsrue why this fixed the issue however i cam across the suggestion while attempting to configure a standalone war to work with tomcat as provided by Debian and hitting an error regarding a missing method

Change 584595 had a related patch set uploaded (by Jbond; owner: John Bond):
[operations/software/cas-overlay-template@master] tomcatversion: correct tomcat version to 9.0.30

https://gerrit.wikimedia.org/r/584595

Change 584595 merged by Jbond:
[operations/software/cas-overlay-template@master] tomcatversion: correct tomcat version to 9.0.30

https://gerrit.wikimedia.org/r/584595

jbond claimed this task.
jbond added a subscriber: Zbyszko.

I asked @Zbyszko to look at this and they where unable to get to the root cause however as we now use an external tomcat instance this issue no longer affects us. As such im resolving.