Page MenuHomePhabricator

Investigate missing WikibaseQualityConstraints logs in logstash.
Closed, DeclinedPublic

Description

While looking at T204267 I noticed that there has been another spike in requests to the WBQC api in the past 48 hours.
Specifically on the 15th starting between 15:25 and 15:27 and tailing off toward 16:40
These requests can be seen on the WBQC dashboard @ https://grafana.wikimedia.org/d/000000344/wikidata-quality?orgId=1&from=1547564623742&to=1547573021473

The SPARQL throttling panel on that dashboard indicates that the WDQS was overloaded by the requests then made by WBQC.
Conveniently there is a gap in WDQS logs for other reasons thus no data appears from the wdqs servers @ https://logstash.wikimedia.org/goto/9676cbb299b21b114c80c7d8499c8e07

When previously investigating a spike T204267#4865669 the responses from WDQS that responded with retry-after were logged in logstash
An example can be seen at https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2018.12.30/mediawiki?id=AWgAckPBzpjgITg61Cm8&_g=h@4e853bb
For the most recent spike on the 15th I can not find any of this data in logstash, I would expect to see it here: https://logstash.wikimedia.org/goto/bcb628c5f31be343d9ae51aa215dbabd
(there are logs appearing in the channel though).
Even expanding to the full day doesn't show any of these throttle related logs.

Looking on mwlog1001 I do see a collection of these logs however

addshore@mwlog1001:/srv/mw-log$ zcat ./archive/WikibaseQualityConstraints.log-20190116.gz | grep retry-after | wc -l
638

and

addshore@mwlog1001:/srv/mw-log$ zcat ./archive/WikibaseQualityConstraints.log-20190116.gz | grep retry-after | head
2019-01-15 15:27:10 [XD37zApAMEAAAFaniE4AAAAF] mw1229 wikidatawiki 1.33.0-wmf.12 WikibaseQualityConstraints WARNING: Sparql API replied with status 429 and no valid retry-after header. {"responseHeaders":"{\"server\":[\"nginx\\/1.13.6\"],\"date\":[\"Tue, 15 Jan 2019 15:27:10 GMT\"],\"content-type\":[\"application\\/octet-stream\"],\"connection\":[\"close\"],\"cache-control\":[\"no-cache\"]}","responseContent":"Rate limit exceeded

Looking in logstash for XD37zApAMEAAAFaniE4AAAAF specifically also reveals nothing.

AFAIK the logging config should mean that everything is being logged everywhere?

https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings.php#L5591

'wmgMonologChannels' => [
	'default' => [
...
		'WikibaseQualityConstraints' => 'debug',

Whats up? Did I miss something?

Event Timeline

First suspect I checked was the move to new logging infra (on the mw in practice means moving to syslogging to localhost and prefixing syslog payloads with @cee:) and there seem to be similar logs that came in through that, though not on the 15th, e.g.:
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.01.08/mediawiki?id=AWgsslHQzpjgITg6gXm7&_g=h@9894d8e

On mwlog the message above is:

2019-01-08 09:01:50 [XDRm-QpAIDwAABSB67sAAACH] mw1348 wikidatawiki 1.33.0-wmf.9 WikibaseQualityConstraints WARNING: Sparql API replied with status 429 and no valid retry-after header. {"responseHeaders":"{\"server
\":[\"nginx\\/1.13.6\"],\"date\":[\"Tue, 08 Jan 2019 09:01:50 GMT\"],\"content-type\":[\"application\\/octet-stream\"],\"connection\":[\"close\"],\"cache-control\":[\"no-cache\"]}","responseContent":"Rate limit exceeded                                                                                                                                                                                                               "}

Whereas for example this message on mwlog from the 15th I can't find in logstash:

2019-01-15 15:28:05 [XD38AwpAEMUAAEp9w5MAAADS] mw1316 wikidatawiki 1.33.0-wmf.12 WikibaseQualityConstraints WARNING: Sparql API replied with status 429 and no valid retry-after header. {"responseHeaders":"{\"serve
r\":[\"nginx\\/1.13.6\"],\"date\":[\"Tue, 15 Jan 2019 15:28:05 GMT\"],\"content-type\":[\"application\\/octet-stream\"],\"connection\":[\"close\"],\"cache-control\":[\"no-cache\"]}","responseContent":"Rate limit exceeded                                                                                                                                                                                                              "}

I'll keep looking

I can't find anything obviously wrong, the pipeline looks like this with protocols in parenthesis: mw -> (syslog on localhost) -> rsyslog -> (kafka tls) -> kafka <- (kafka tls) <- logstash -> (http) -> elasticsearch.

On the 15th we bumped the kafka partitions from one to three in T213081: Consider increasing kafka logging topic partitions though that happened later than these timestamps (at around 17 UTC) at afaik it shouldn't impact anything.

CDanis triaged this task as Medium priority.Jan 17 2019, 4:51 PM

Might be related to how MediaWiki logging is configured? Some messages get through like jobrunner and some messages from the api, warning level and above. See:

// InitializeSettings.php
//
// Configure Monolog logging to udp2log (logfiles), Logstash, EventBus extension, etc.
// See logging.php for more detailed information
// channel => false  == ignore all log events on this channel
// channel => level  == record all events of this level or higher to udp2log and logstash (except:
//    logstash won't go below info level, use explicit logstash=>debug field for that)
// channel => [ 'udp2log'=>level, 'logstash'=>level, 'sample'=>rate, 'buffer'=>buffer, 'eventbus' => 'debug' ]
// Defaults: [ 'udp2log'=>'debug', 'logstash'=>'info', 'sample'=>false, 'buffer'=>false, 'eventbus' => 'debug' ]
// Valid levels: 'debug', 'info', 'warning', 'error', false
// Note: sampled logs will not be sent to Logstash
// Note: Udp2log events are sent to udp://{$wmgUdp2logDest}/{$channel}
'wmgMonologChannels' => [

It's been a while and it's still not clear where these logs got lost. Closing as there has been no new reports. Please let us know if you encounter an issue like this again.