Page MenuHomePhabricator

Can we make the quibble webserver used in CI log the time it has taken to execute calls to the API?
Closed, ResolvedPublic

Description

Can we make the webserver used in quibble CI jobs log the time it has taken to execute calls to the API?

Currently we have various outputs such as:

23:15:31 INFO:backend.PhpWebserver:[Thu Feb 25 23:15:31 2021] 127.0.0.1:39412 [200]: //api.php?format=json
23:15:31 INFO:backend.PhpWebserver:[Thu Feb 25 23:15:31 2021] 127.0.0.1:39414 [200]: //index.php?title=Talk%3AData-bridge-test-page-0.5750508401188479-I%C3%B1t%C3%ABrn%C3%A2ti%C3%B4n%C3%A0liz%C3%A6ti%C3%B8n
23:15:32 INFO:backend.PhpWebserver:[Thu Feb 25 23:15:32 2021] 127.0.0.1:39428 [200]: /load.php?lang=en&modules=jquery%2Coojs-ui-core%2Coojs-ui-widgets%7Cjquery.ui&skin=vector&version=1e40s
23:15:33 INFO:backend.PhpWebserver:[Thu Feb 25 23:15:33 2021] 127.0.0.1:39438 [200]: /api.php?action=query&format=json&titles=Talk%3AData-bridge-test-page-0.5750508401188479-I%C3%B1t%C3%ABrn%C3%A2ti%C3%B4n%C3%A0liz%C3%A6ti%C3%B8n&prop=info&meta=siteinfo&intestactions=edit&intestactionsdetail=full&siprop=restrictions&errorformat=raw&formatversion=2
23:15:33 INFO:backend.PhpWebserver:[Thu Feb 25 23:15:33 2021] 127.0.0.1:39440 [200]: /api.php?action=wbgetentities&format=json&props=labels%7Cdatatype%7Cinfo%7Cclaims&ids=P8%7CQ25&languages=en&languagefallback=true&formatversion=2
23:15:33 INFO:backend.PhpWebserver:[Thu Feb 25 23:15:33 2021] 127.0.0.1:39442 [200]: /api.php?action=query&format=json&meta=wbdatabridgeconfig%7Csiteinfo&formatversion=2&errorformat=raw&titles=Item%3AQ25&prop=info&intestactions=edit&intestactionsdetail=full&siprop=restrictions
23:15:37 INFO:backend.PhpWebserver:[Thu Feb 25 23:15:37 2021] 127.0.0.1:39490 [200]: //api.php?format=json

Seemingly sometimes such API calls take a long amount of time and cause tests to fail. (I created this after needing a deep dive into T273327)
It would be good to be able to identify which of the API calls actually cause the issues so we could look for trends and tackle issues if spotted.

I did some searching and couldn't find out if this was possible, so opening this ticket to see if anyone else knows / has ideas.

Another alternative would be to make the JS log and output timing of network requests somehow?

Another alternative would be to use apache, which allows more flexible logging options

Event Timeline

Those messages come from php -S stderr being relayed. PHP does not seem to have a way to change the output format (such as adding the request time), maybe later PHP version do?

The queries are served serially which is not ideal. There is some project to move to Apache (and thus have multithread serving) which would make the API more responsive. But that opens some other issues such as race conditions between queries (and potentially locking issue with Sqlite).

If those requests originate from Chromium / Selenium / devtools, maybe there is a way to have the browser session to capture all the requests in a .har which would could attach to the build. One can then potentially download it and replay the file locally? That is what we have in the Fresnel jobs, tests are run for the parent patch and the current patch, performances are recorded in .har which can then be compared. Example artifacts under: https://integration.wikimedia.org/ci/job/mediawiki-fresnel-patch-docker/lastSuccessfulBuild/artifact/log/fresnel_records/ , one should be able to drag'n drop one of the .har to the Chromium development console and that will load the view.

I however have no idea how to enable the functionality via Selenium / devtools.

Those messages come from php -S stderr being relayed. PHP does not seem to have a way to change the output format (such as adding the request time), maybe later PHP version do?

I did lots of searching and couldn't find any :(

The queries are served serially which is not ideal.

I didn't know that :(

There is some project to move to Apache (and thus have multithread serving)

Ticket?

But that opens some other issues such as race conditions between queries (and potentially locking issue with Sqlite).

Hmm, where is that perceived to be an issue?
I don't think we would run into anything like that for our tests for example.
Setup is generally done serially anyway, the current pinch point seems to be API calls that happen when the UI is actually loaded, and these seem to cause many timeouts! :(

If those requests originate from Chromium / Selenium / devtools, maybe there is a way to have the browser session to capture all the requests in a .har which would could attach to the build. One can then potentially download it and replay the file locally? That is what we have in the Fresnel jobs, tests are run for the parent patch and the current patch, performances are recorded in .har which can then be compared. Example artifacts under: https://integration.wikimedia.org/ci/job/mediawiki-fresnel-patch-docker/lastSuccessfulBuild/artifact/log/fresnel_records/ , one should be able to drag'n drop one of the .har to the Chromium development console and that will load the view.

That sounds quite nice.
We looked into if we could make these web requests just be output from the JS side of things in CI / test output before but drew a blank.

I did lots of searching and couldn't find any :(

Maybe in php8.0 source code there could be a hint. Else add the new feature by writing more Zend C dialect :-\

The queries are served serially which is not ideal.

I didn't know that :(

Kosta found that out a while back. php 7.4 does have an option to support serving requests across multiple threads but:

  • Quibble doesn't have that support, would be all about checking the PHP version then enable the relevant option
  • bunch of workload on our CI is on php 7.2

    >> There is some project to move to Apache (and thus have multithread serving)

Ticket?

T225218 for Quibble support , we even have a CI image. The idea is to then add some non-voting / experimental jobs: T276428 which would let us surface issues in the tests and get them addressed. Then with COVID and lockdown, that dynamic is a bit frozen :-/

But that opens some other issues such as race conditions between queries (and potentially locking issue with Sqlite).

Hmm, where is that perceived to be an issue?
I don't think we would run into anything like that for our tests for example.
Setup is generally done serially anyway, the current pinch point seems to be API calls that happen when the UI is actually loaded, and these seem to cause many timeouts! :(

T199393#6702427 is an example, and I am pretty sure we had issue when multiple API calls triggered a localisation cache rebuild which raced for a lock on the sqlite database (got solved by always generating the loalisation cache when installing mediawiki). There must be a bunch of other related races. But we will need to have the jobs and run them to surface them.

If those requests originate from Chromium / Selenium / devtools, maybe there is a way to have the browser session to capture all the requests in a .har which would could attach to the build. One can then potentially download it and replay the file locally? That is what we have in the Fresnel jobs, tests are run for the parent patch and the current patch, performances are recorded in .har which can then be compared. Example artifacts under: https://integration.wikimedia.org/ci/job/mediawiki-fresnel-patch-docker/lastSuccessfulBuild/artifact/log/fresnel_records/ , one should be able to drag'n drop one of the .har to the Chromium development console and that will load the view.

That sounds quite nice.
We looked into if we could make these web requests just be output from the JS side of things in CI / test output before but drew a blank.

The repository is https://gerrit.wikimedia.org/r/performance/fresnel . That uses browser Javascript API exposed via the performance variable:

function browserCode() {
    /* eslint-env browser */
    return performance.getEntriesByType( 'resource' )
        .concat( performance.getEntriesByType( 'navigation' ) )
        .map( ( entry ) => entry.toJSON() );
}

I would assume webdriver.io has a similar system?

Timo would know more for sure.

I don't know the answer to your question, but if you know a certain test is having issues then I would look into running it locally with a $wgProfiler enabled and going from there.

Or do you want to collect aggregates so that you can see if the patch incurs a performance cost, like Fresnel does?

I don't know the answer to your question, but if you know a certain test is having issues then I would look into running it locally with a $wgProfiler enabled and going from there.

Or do you want to collect aggregates so that you can see if the patch incurs a performance cost, like Fresnel does?

I mean, the main goal would be able to easily spot issues in CI.
But it sounds like most of this will just be down to the php dev server being used.
The particular app / ui loading here makes 3 API calls, but all of that combined with the load.php, index.php and other api.php calls just lead to these 3 API calls not even returning until after the test has already timed out.
But if these are being blocked by all of the other requests that started first completing first it all starts to make sense.

I might investigate rolling out some non voting jobs in line with T276428 for wikibase testing this apache thing

@Addshore rolled the Quibble apache jobs today. There is a non voting one running the Selenium jobs for Wikibase. I was quite surprised to see the Apache access log being attached to the build \o/

So we can do some tuning to the Apache log. In integration/config we have:

git grep -n Log dockerfiles/quibble-stretch-php72-apache
dockerfiles/quibble-stretch-php72-apache/apache/000-default.conf:12:    ErrorLog /workspace/log/apache-error.log
dockerfiles/quibble-stretch-php72-apache/apache/000-default.conf:13:    CustomLog /workspace/log/apache-access.log combined
dockerfiles/quibble-stretch-php72-apache/apache/apache2.conf:10:ErrorLog /tmp/error.log
dockerfiles/quibble-stretch-php72-apache/apache/apache2.conf:11:LogLevel warn
dockerfiles/quibble-stretch-php72-apache/apache/apache2.conf:38:LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined
dockerfiles/quibble-stretch-php72-apache/apache/apache2.conf:39:LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined
dockerfiles/quibble-stretch-php72-apache/apache/apache2.conf:40:LogFormat "%h %l %u %t \"%r\" %>s %O" common
dockerfiles/quibble-stretch-php72-apache/apache/apache2.conf:41:LogFormat "%{Referer}i -> %U" referer
dockerfiles/quibble-stretch-php72-apache/apache/apache2.conf:42:LogFormat "%{User-agent}i" agent

Wikimedia went with a standard log format which is tab separated. The high level overview is https://wikitech.wikimedia.org/wiki/Apache_log_format . It is defined in Puppet modules/httpd/files/defaults.conf:

LogFormat "%{%Y-%m-%dT%H:%M:%S}t  %D      %a      %R/%>s  %B      %m      http://%{Host}i%U%q     -       %{Content-Type}o        %{Referer}i     %{X-Forwarded-For}i     %{User-Agent}i  %{Accept-Language}i     %{X-Analytics}i %u      %{User}o        %a      %{X-Request-Id}i        %{X-Client-IP}i" wmf

The available fields are described at: https://httpd.apache.org/docs/2.4/mod/mod_log_config.html#formats . Guess we can use the wmf format as a basis and strip from it anything that is not relevant.

%{%Y-%m-%dT%H:%M:%S}tDateKeep
%DTime taken to server the request, in microsecondsKeep, that is what @Addshore wants!
%aClient IPDrop
%R/%>sHandlerDrop?
%BSize of response in bytesKeep
%mRequest methodKeep
http://%{Host}i%U%qURLKeep
%{Content-Type}o-Keep
%{Referer}i-Keep?
%{X-Forwarded-For}i-Drop
%{User-Agent}i-Keep
%{Accept-Language}i-Keep
%{X-Analytics}i %u-Drop, wmf specific
%{User}o-Drop, auth related
%aOriginal client ipDrop
%{X-Request-Id}i-Drop, wmf specific
%{X-Client-IP}i"Original client IPDrop

So I think we can go with a tab separated:

apache/apache2.conf
LogFormat %{%Y-%m-%dT%H:%M:%S}t   %D  %B  %m  http://%{Host}i%U%q %{Content-Type}o    %{Referer}i %{User-Agent}i  %{Accept-Language}i quibble

And:

apache/000-default.conf
CustomLog /workspace/log/apache-access.log quibble

Change 671207 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] dockerfiles: [quibble-stretch-php72-apache] Add custom log to time requests

https://gerrit.wikimedia.org/r/671207

%D Time taken to server the request, in microseconds Keep, that is what @Addshore wants!

Yup!

Addshore renamed this task from Can we make the PHP inbuilt webserver used in CI log the time it has taken to execute calls to the API? to Can we make the quibble webserver used in CI log the time it has taken to execute calls to the API?.Mar 15 2021, 11:09 PM
Addshore updated the task description. (Show Details)

Change 671207 merged by jenkins-bot:
[integration/config@master] dockerfiles: [quibble-stretch-php72-apache] Add custom log to time requests

https://gerrit.wikimedia.org/r/671207

Change 672759 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[integration/config@master] jjb: Switch apache jobs to 0.0.46-s1 with request timing logging

https://gerrit.wikimedia.org/r/672759

Change 672759 merged by jenkins-bot:
[integration/config@master] jjb: Switch apache jobs to 0.0.46-s1 with request timing logging

https://gerrit.wikimedia.org/r/672759

Since Jenkins artifacts tend to disappear:

127.0.0.1 - - [16/Mar/2021:16:54:37 +0000] "GET //index.php?title=Special:JavaScriptTest/qunit/export HTTP/1.1" 200 21740 "http://localhost:9876/context.html" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/73.0.3683.75 Safari/537.36"

2021-03-16T17:20:54 1755295 21206 GET http://127.0.0.1:9413/index.php?title=Special:JavaScriptTest/qunit/export text/javascript http://localhost:9876/context.html Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/73.0.3683.75 Safari/537.36 -

The last field showing up with - is for Accept-Language

The 2nd and 3rd fields are all integers and make it a bit confusing as to what is what. They are respectively the request service time in micro seconds and the response size. I had two ideas to enhance the readability:

  • adding a suffix in CustomLog to each of the the fields: %D us\t%B bytes\t
  • have supervisord to create the Apache log file with some headers, but that would surely drift whenever CustomLog is amended later on leading to more confusion later

I am not sure how fixed assets ends up served in less than a tenth of a miliseconds. That is super fast ;) Then computers are faster than us so maybe that is normal.

Regardless, thank you @Jdforrester-WMF to have taken care of the image build, jobs update and verifying it works as expected!