Periodic 500s from piwik.wikimedia.org
Closed, ResolvedPublic

Description

I noticed a periodic spike in 500s from varnish, turns out piwik quite regularly spews 500s back to clients:

The root cause seems to be exhausted mysql connections, from /var/log/apache2/error.log

[Wed Jan 04 01:43:15.171605 2017] [:error] [pid 22875] [client 10.64.32.97:5493] Error in Piwik (tracker): SQLSTATE[08004] [1040]
 Too many connections
[Wed Jan 04 01:43:15.381398 2017] [:error] [pid 22875] [client 10.64.0.95:32831] Error in Piwik (tracker): SQLSTATE[08004] [1040]
 Too many connections
[Wed Jan 04 01:43:15.421769 2017] [:error] [pid 22875] [client 10.64.0.98:42041] Error in Piwik (tracker): SQLSTATE[08004] [1040]
 Too many connections

Sometimes also errors similar to this

[Wed Jan 04 02:01:03.905886 2017] [:error] [pid 25107] [client 10.64.32.97:12162] Error in Piwik (tracker): Error query: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction In query: UPDATE piwik_log_visit SET idvisitor = ?, visit_last_action_time = ?,  visit_total_actions = visit_total_actions + 1 ,  visit_total_events = visit_total_events + 1 , visit_total_time = ?, custom_var_k1 = ?, custom_var_v1 = ?, custom_var_k2 = ?, custom_var_v2 = ?, custom_var_k3 = ?, custom_var_v3 = ? WHERE idsite = ? AND idvisit = ? Parameters: array (   0 =>
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 4 2017, 2:04 AM
elukey added a subscriber: elukey.Jan 11 2017, 6:07 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.Jan 23 2017, 4:57 PM

Yep, this is definitely still happening. I took a look and what's happening is every time someone visits a page instrumented with piwik, piwik makes a new connection instead of just re-using connections. They closed a bug about 7 years ago by basically saying there's not much they can do [1]. And new reports of the same issue seem to go un-answered [2]. I think piwik is just not made for super high traffic, and maybe the stats we get from it are flawed. But it certainly needs a deeper look, this problem is real. We could kick the can down the road by increasing max_connections on mysql, which is now set to the default 151. That seems kind of low but it also seems really high for how I think piwik should work. The other problem, the deadlock one, seems to also have gone unanswered for a few years [3].

[1] https://issues.piwik.org/1760
[2] https://forum.piwik.org/t/error-in-piwik-tracker-too-many-connections/15365
[3] https://github.com/piwik/piwik/issues/6398

Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.

Assuming this is mysql, this is something I can indeed help relatively easily without touching code. As a reminder, I do not only handle mysql for mediawiki and labs, I also handle dozens of other misc db backends for other services, many times with poorly database client code and connection handling.

The only thing I would ask in exchange is some concessions in the draconian "enforcing standards and quality" department, by adding proper monitoring and other improvements :-)

This is a humble offer for help...

Thanks, @jcrespo, I didn't see the ping, it looks like Phabricator had some notification issues.

The idea with this service is that it wouldn't take time away from ops, so I hesitate to get you involved. If we wanted a monitored, scalable tool, we would revisit the idea of using piwik.

I'll try and troubleshoot this to see if there's an easy config we can make to make it hobble along. I appreciate any links or research that you think might be useful.

jcrespo added a comment.EditedJan 30 2017, 6:14 PM

I recently packaged and puppetized ProxySQL: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/proxysql/manifests/init.pp It allows to multiplex connections by creating queues in front of the servers. If there is only one client server, it can be installed there as a middlewhere and behave as the Java MySQL connector.

That functionality is also part of a MariaDB server itself- we have on mediawiki production a max_connections limit of 10000 open connections, but only allow 36 concurrent connections running at the same time with its own pool of connections. https://mariadb.com/kb/en/mariadb/thread-pool-in-mariadb/ While it is not thought for reliability, but for scalability, it avoids many issues we found with bugs on mediawiki from time to time, only taking some negible amount of latency, but increasing the throughput by many orders.

Just checked with Manuel if thread_pool_size was available for mysql 5.5 but it seems that it needs a proprietary extension to work.

I executed SET GLOBAL max_connections=300 as test to see if things improve.

My personal feeling is that this project, even if experimental, would need at least some monitoring to figure out how well the system work. I'd love to have Jaime and Manuel involved in tuning Piwik, but as Dan was saying the team seems to prefer a self hosted solution for the moment.

Summary of today:

  • I followed https://github.com/piwik/piwik/issues/6398#issuecomment-91093146 and set bulk_requests_use_transaction=0 manually to fix an error showing up in the apache error logs.
  • Disabled mod_deflate manually to rule out gzip encoding
  • Focused on Varnish, and currently tracing this path: cp4004 frontend --> cp2012 backend --> cp1058 backend. On cp2012 backend I found FetchError http first read error: EOF (https://varnish-cache.org/tips/varnishlog/fetcherror.html). This is not the last step before fetching from bohrium since cp1058 is mentioned, so I just started a varnishlog | tee to get more info.
elukey triaged this task as "Normal" priority.Feb 15 2017, 6:04 PM
elukey claimed this task.
elukey added a project: User-Elukey.

Details from cp1058:

--  VCL_call       BACKEND_FETCH
--  VCL_return     fetch
--  FetchError     no backend connection
--  Timestamp      Beresp: 1487218002.833983 0.000049 0.000049
--  Timestamp      Error: 1487218002.833998 0.000064 0.000015
--  BerespProtocol HTTP/1.1
--  BerespStatus   503
--  BerespReason   Service Unavailable
--  BerespReason   Backend fetch failed
--  BerespHeader   Date: Thu, 16 Feb 2017 04:06:42 GMT
--  BerespHeader   Server: Varnish
--  VCL_call       BACKEND_ERROR
--  BerespHeader   X-CDIS: int
--  BerespHeader   Content-Type: text/html; charset=utf-8
--  VCL_return     deliver
--  Storage        malloc Transient
--  ObjProtocol    HTTP/1.1
--  ObjStatus      503
--  ObjReason      Backend fetch failed
--  ObjHeader      Date: Thu, 16 Feb 2017 04:06:42 GMT
--  ObjHeader      Server: Varnish
--  ObjHeader      X-CDIS: int
--  ObjHeader      Content-Type: text/html; charset=utf-8
--  Length         1554
--  BereqAcct      0 0 0 0 0 0
--  End

This one is an example of last Varnish backend before the fetch to bohrium, so following https://varnish-cache.org/tips/varnishlog/fetcherror.html it might be due to the fact that the Varnish Director thinks that bohrium is unhealthy.

elukey@oxygen:/srv/log/webrequest$ grep piwik archive/5xx.json-20170216 | jq -r '[.http_status,.dt]| @csv' | awk -F":" '{print $1}'| sort | uniq -c
      1 "500","2017-02-15T06
      1 "500","2017-02-15T08
     37 "503","2017-02-15T06
     70 "503","2017-02-15T07
     70 "503","2017-02-15T08
     60 "503","2017-02-15T09
     64 "503","2017-02-15T10
     73 "503","2017-02-15T11
     65 "503","2017-02-15T12
    199 "503","2017-02-15T13
     52 "503","2017-02-15T14
     58 "503","2017-02-15T15
     59 "503","2017-02-15T16
     48 "503","2017-02-15T17
     68 "503","2017-02-15T18
    168 "503","2017-02-15T19
     93 "503","2017-02-15T20
     77 "503","2017-02-15T21
     68 "503","2017-02-15T22
     49 "503","2017-02-15T23
    191 "503","2017-02-16T00
     84 "503","2017-02-16T01
     55 "503","2017-02-16T02
    236 "503","2017-02-16T03
    146 "503","2017-02-16T04
     68 "503","2017-02-16T05
     49 "503","2017-02-16T06
elukey@oxygen:/srv/log/webrequest$ grep piwik 5xx.json | jq -r '[.http_status,.dt]| @csv' | awk -F":" '{print $1}'| sort | uniq -c
     47 "503","2017-02-16T06
     50 "503","2017-02-16T07
     59 "503","2017-02-16T08
     65 "503","2017-02-16T09
     56 "503","2017-02-16T10

As far as I know this problem has been ongoing for a long time..

elukey moved this task from Backlog to In Progress on the User-Elukey board.Feb 16 2017, 11:58 AM

yes, definitely has been going on for a while because we never really looked at it. We assumed the numbers piwik reported made sense because they went up and down. Looking at how often the problem happens, for some of our low-traffic sites it could basically mean every log request returned an error.

ema moved this task from Triage to Caching on the Traffic board.Feb 20 2017, 9:14 AM

Change 338953 had a related patch set uploaded (by Ema):
cache: allow specifying applayer backend probes and probe piwik

https://gerrit.wikimedia.org/r/338953

Mentioned in SAL (#wikimedia-operations) [2017-02-21T15:06:25Z] <elukey> Increased manually maximum httpd keep alive requests and timeout on bohrium (piwik) - T154558

elukey added a comment.EditedFeb 21 2017, 4:42 PM

Checked the oxygen logs and the following UA (should be the iOS app) pattern is the only one getting 503s during the past 21 days:

244268 "Wikipedia/1046 CFNetwork/808.3 Darwin/16.3.0"

What I'd need to figure out is if the error that we are seeing is due to a Apache/Piwik scalability issue or if it is something else. I can't find anything in the apache access logs on bohrium that might indicate a resource exhaustion issue, but I might be wrong.

Ping @Fjalapeno this UA is the iOS app, right? Any help you can provide Luca in finding out why we might be seeing 503s from just that UA?

Just adding a note: I am seeing also others similar UA, that follows the same pattern.. but nothing else. I suspect that I am seeing the distribution of app versions deployed on the various phones accessing our websites..

In the meantime, I'll try to figure out if the traffic volume is causing this.

@Milimetric having @JoeWalsh verify this for you

@Milimetric @elukey verified that this is the iOS app

JoeWalsh added a comment.EditedFeb 21 2017, 6:43 PM

@Milimetric this UA is from the iOS app. In testing locally, I didn't see any 503s. A potential cause of the surge of traffic could be that the current version of the app in the store has a 100% piwik event sampling rate, whereas previous versions shipped with a 10% sampling rate.

The change has been reverted to be a 10% sampling rate, but the fix is not released to the app store yet.

Change 339123 had a related patch set uploaded (by Elukey):
Add the Apache Prometheus exporter to bohrium

https://gerrit.wikimedia.org/r/339123

Change 339123 merged by Elukey:
Add the Apache Prometheus exporter to bohrium

https://gerrit.wikimedia.org/r/339123

jcrespo removed a subscriber: jcrespo.Feb 22 2017, 8:58 AM

Today I tried to do the following:

  • Bump the maximum Apache connections allowed (hence the number of processes due to mpm_prefork) from 150 to 250, reducing also the churn due to process kill/re-spawn with MaxSpareServers set to 150 as well (and Min to 50). Ganglia shows some interesting metrics with little spikes in traffic, my suspicion was that Apache was not able to cope with bursts of requests. The number of 503s didn't change much after the change.
  • Tracked down tcp RST, nothing related to caching hosts..
elukey@bohrium:~$ sudo tcpdump -vv 'tcp[tcpflags] & (tcp-rst) != 0'
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
09:47:55.787258 IP (tos 0x0, ttl 62, id 62255, offset 0, flags [DF], proto TCP (6), length 40)
    tegmen.wikimedia.org.33256 > bohrium.eqiad.wmnet.ssh: Flags [R], cksum 0x778a (correct), seq 1570161823, win 0, length 0
09:47:55.787758 IP (tos 0x0, ttl 62, id 62256, offset 0, flags [DF], proto TCP (6), length 40)
    tegmen.wikimedia.org.33256 > bohrium.eqiad.wmnet.ssh: Flags [R], cksum 0x778a (correct), seq 1570161823, win 0, length 0
09:47:57.084262 IP (tos 0x0, ttl 63, id 40658, offset 0, flags [DF], proto TCP (6), length 40)
    einsteinium.wikimedia.org.38242 > bohrium.eqiad.wmnet.ssh: Flags [R], cksum 0x28f3 (correct), seq 503135529, win 0, length 0
09:48:55.848922 IP (tos 0x0, ttl 62, id 6722, offset 0, flags [DF], proto TCP (6), length 40)
    tegmen.wikimedia.org.46344 > bohrium.eqiad.wmnet.ssh: Flags [R], cksum 0x83a5 (correct), seq 3141478331, win 0, length 0
09:48:55.849510 IP (tos 0x0, ttl 62, id 6723, offset 0, flags [DF], proto TCP (6), length 40)
    tegmen.wikimedia.org.46344 > bohrium.eqiad.wmnet.ssh: Flags [R], cksum 0x83a5 (correct), seq 3141478331, win 0, length 0
09:48:57.176230 IP (tos 0x0, ttl 63, id 54648, offset 0, flags [DF], proto TCP (6), length 40)
    einsteinium.wikimedia.org.51394 > bohrium.eqiad.wmnet.ssh: Flags [R], cksum 0xf45f (correct), seq 71456791, win 0, length 0

After the changes done this morning, something *seems* to have changed, namely I don't see anymore FetchError no backend connection on cp1* misc hosts.

Most of the errors seems now FetchError http first read error: EOF from codfw/uslfo/esams backends trying to contact eqiad.

Change 338953 merged by Ema:
cache: allow specifying applayer backend probes and probe piwik

https://gerrit.wikimedia.org/r/338953

@Milimetric this UA is from the iOS app. In testing locally, I didn't see any 503s. A potential cause of the surge of traffic could be that the current version of the app in the store has a 100% piwik event sampling rate, whereas previous versions shipped with a 10% sampling rate.

The change has been reverted to be a 10% sampling rate, but the fix is not released to the app store yet.

Hi Joe! I don't think that this is a load problem to be honest, I took a look to the numbers and it doesn't seem that piwik is failing, but something subtle that is lying between Apache and Varnish.

More numbers about number of requests landing to piwik/apache/bohrium and failed ones (503s).

The following numbers are the number of requests logged in the apache access logs for piwik broken down by hour. The number of 503s (not registered in Apache but only seen by Varnish) is in the range of 40~80 per hour (data from oxygen's 5xx.log), so less than 1% of requests end up in 503s.

So again I think that we are hitting some sort of weird Varnish/Apache network behavior that causes failure to read from sockets once in a while, not that we are failing due to load.

root@bohrium:/var/log/apache2# cut -d ":" -f 1 other_vhosts_access.log | sort | uniq -c
  25984 2017-02-24T06
  41047 2017-02-24T07
  37623 2017-02-24T08
  34962 2017-02-24T09
  43187 2017-02-24T10
  36751 2017-02-24T11
root@bohrium:/var/log/apache2# cut -d ":" -f 1 other_vhosts_access.log.1 | sort | uniq -c
  24734 2017-02-23T06
  39088 2017-02-23T07
  35369 2017-02-23T08
  34217 2017-02-23T09
  34050 2017-02-23T10
  37974 2017-02-23T11
  42504 2017-02-23T12
  44891 2017-02-23T13
  46716 2017-02-23T14
  47303 2017-02-23T15
  47263 2017-02-23T16
  50234 2017-02-23T17
  53773 2017-02-23T18
  60940 2017-02-23T19
  68436 2017-02-23T20
  71886 2017-02-23T21
  66917 2017-02-23T22
  51934 2017-02-23T23
  50398 2017-02-24T00
  53196 2017-02-24T01
  55495 2017-02-24T02
  58139 2017-02-24T03
  54071 2017-02-24T04
  49383 2017-02-24T05
  19331 2017-02-24T06

It seems to me you can close this task and open up a new one to investigate Varnish / Apache problems (as those are probably generally relevant for other things we run on Apache. It's really strange that it's only this user agent that seems to have the problem, though.

Mentioned in SAL (#wikimedia-operations) [2017-02-24T15:35:01Z] <ema> temporarily bumping timeout_idle to 60s on cache_misc T154558

Mentioned in SAL (#wikimedia-operations) [2017-02-27T18:12:25Z] <ema> temporarily bumping timeout_idle to 120s on cache_misc T154558

Mentioned in SAL (#wikimedia-operations) [2017-03-02T09:55:40Z] <elukey> increased PHP memory_limit on bohrium for Piwik (T154558)

elukey added a comment.Mar 3 2017, 1:29 PM

Summary before closing:

Piwik was showing a lot of errors in the apache logs, we removed them and opened a task to apply permanently the fixes via puppet (https://phabricator.wikimedia.org/T159136). As part of the new year annual budget we asked for a new dedicated machine for piwik, or at least a more powerful VM.

We also opened https://phabricator.wikimedia.org/T158322 to upgrade Piwik to a newer version.

Last but not the least, we opened https://phabricator.wikimedia.org/T159429 to investigate how to tune properly Varnish timeouts.

As far as I can see from the 50X rate the background noise seems highly reduced to an acceptable level, so the immediate issue has been mitigated.

elukey moved this task from In Progress to Done on the Analytics-Kanban board.Mar 3 2017, 1:29 PM
elukey moved this task from Done to In Progress on the Analytics-Kanban board.Mar 6 2017, 4:07 PM
Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Mar 8 2017, 8:04 PM

Change 342007 had a related patch set uploaded (by Elukey):
[operations/puppet] Remove Piwik backend probe from Varnish Misc backends

https://gerrit.wikimedia.org/r/342007

elukey moved this task from Paused to In Progress on the Analytics-Kanban board.Mar 9 2017, 3:08 PM

Change 342007 merged by Elukey:
[operations/puppet] Remove Piwik backend probe from Varnish Misc backends

https://gerrit.wikimedia.org/r/342007

I followed https://piwik.org/docs/optimize-how-to/ and applied set global innodb_flush_log_at_trx_commit=2; as root on bohrium's mysql, it seems to have helped a lot decreasing IOPS and disk utilization.

Added innodb_buffer_pool_size = 512M and innodb_flush_log_at_trx_commit 2 to /etc/mysql/my.cnf restarted mysql on bohrium

Set innodb_buffer_pool_size = 2048M to see if helps. I checked SHOW ENGINE INNODB STATUS and some data is relevant:

--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (write thread)
I/O thread 7 state: waiting for completed aio requests (write thread)
I/O thread 8 state: waiting for completed aio requests (write thread)
I/O thread 9 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
281210 OS file reads, 340273 OS file writes, 8070 OS fsyncs
100.17 reads/s, 16384 avg bytes/read, 137.70 writes/s, 3.35 fsyncs/s

At the moment the traffic is at its lowest (compared to US daytime) and the stats are not that good for a VM on Ganeti with 8GB of ram.

----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 2197815296; in additional pool allocated 0
Dictionary memory allocated 982919
Buffer pool size   131072
Free buffers       0
Database pages     121667
Old database pages 44892
Modified db pages  2106
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 177062, not young 0
137.82 youngs/s, 0.00 non-youngs/s
Pages read 281201, created 4222, written 52669
100.19 reads/s, 1.94 creates/s, 20.43 writes/s
Buffer pool hit rate 995 / 1000, young-making rate 7 / 1000 not 0 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 121667, unzip_LRU len: 0
I/O sum[6091]:cur[3], unzip sum[0]:cur[0]

Good hit rate for the buffer cache, good news :)

--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
1 read views open inside InnoDB
Main thread process no. 4696, id 140523010610944, state: sleeping
Number of rows inserted 161572, updated 122017, deleted 0, read 6167061
65.84 inserts/s, 49.21 updates/s, 0.00 deletes/s, 2848.14 reads/s

Again not high numbers for a moderately good bare metal server with SSD, but probably too much for a VM?

@JoeWalsh: Hi! I am wondering if the 100% -> 10% reduction in the sample rate has been released or not, since I am currently leaning towards the idea that our little bohrium VM is not suitable for all the analytics tracking traffic coming from the iOS apps.. The number of 503s registered in our metrics is not super high (way less than 1% of the total traffic) but still the current status of the Piwik service is probably not able to handle all this traffic well.

We (as Analytics team) will try to invest time during the next quarter to improve the service, but we'll probably need new (real) hardware and a review of the use cases (like the iOS app) to decide how much time and energy (and money!) to invest.

I propose, if you agree, to reduce the iOS traffic as soon as possible lowering the sampling size and then re-sync next quarter to figure out if we need to raise it again to 100%.

What do you think?

@elukey the reduction to 10% will be in the next app version, 5.4. Currently it's scheduled to be released at the end of this week or early next week. I don't think we'd need to re-raise it to 100%.

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Mar 13 2017, 3:15 PM
elukey moved this task from In Progress to Stalled on the User-Elukey board.Mar 14 2017, 7:17 AM
JMinor added a subscriber: JMinor.Mar 22 2017, 6:16 PM

The version returning to 10% sampling rate is up on the app store.

The traffic is definitely decreased a lot from last week, but I am still seeing some 503s (way more than before). I am going to wait some days to let more people to upgrade, then I'll restart the investigations. Probably upgrading piwik to the new version will be a safe bet to improve perfomance.

elukey moved this task from Paused to Done on the Analytics-Kanban board.Wed, Apr 19, 8:34 AM
Nuria closed this task as "Resolved".Fri, Apr 21, 6:43 PM