Page MenuHomePhabricator

investigate recent divot in landingpage log activity
Closed, ResolvedPublic

Description

We're getting alerts from monitoring of kafkatee's landingpages log rotation, for zero activity for >2H (warning) or >4H critical. Centralnotice logs are active during these times. Is it just that we have no fundraising campaigns up? Or is there a change to the traffic pattern we need to account for in monitoring?

Event Timeline

Jgreen created this task.May 31 2019, 2:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 31 2019, 2:52 PM
Jgreen added a comment.EditedMay 31 2019, 3:52 PM

Here are the block of time where there were no logs to rotate, the 'min' value is the time elapsed since the last non-zero log file.

2019-05-04T01:30:03.014643+00:00 120 min
2019-05-04T01:45:03.083360+00:00 135 min

2019-05-05T03:00:02.915055+00:00 120 min
2019-05-05T03:15:02.742720+00:00 135 min

2019-05-11T00:30:02.263044+00:00 120 min
2019-05-11T03:30:02.965101+00:00 300 min

2019-05-29T15:30:02.267733+00:00 120 min
2019-05-29T20:45:02.174639+00:00 435 min

2019-05-29T22:45:03.039016+00:00 120 min
2019-05-31T09:30:02.853534+00:00 2205 min

2019-05-31T11:30:02.343803+00:00 120 min
2019-05-31T15:35:02.561301+00:00 365 min
...

Adding analytics. Let us know if this is the right tag. We were wondering if anything has changed in the last 2 weeks.

Raw hit counts from the logs:

2019-05-20 171
2019-05-21 162
2019-05-22 173
2019-05-23 172
2019-05-24 138
2019-05-25 149
2019-05-26 180
2019-05-27 148
2019-05-28 136
2019-05-29 69
2019-05-30 0
2019-05-31 1

From this it looks pretty clear something broke/stopped on the 29th, and drilling down on that date a bit, it looks like it happened roughly around 14:00 UTC:

2019-05-29T00 3
2019-05-29T01 3
2019-05-29T02 1
2019-05-29T03 5
2019-05-29T04 6
2019-05-29T05 2
2019-05-29T06 4
2019-05-29T07 9
2019-05-29T08 4
2019-05-29T09 7
2019-05-29T10 4
2019-05-29T11 6
2019-05-29T12 9
2019-05-29T13 5
2019-05-29T14 0
2019-05-29T15 0
2019-05-29T16 0
2019-05-29T17 0
2019-05-29T18 0
2019-05-29T19 0
2019-05-29T20 1

fdans added a subscriber: fdans.Jun 3 2019, 3:37 PM

@Jgreen can you provide a little more context if you think we could be of help? Are you also missing data on webrequest table on hive?

fdans moved this task from Incoming to Radar on the Analytics board.Jun 3 2019, 3:37 PM
DStrine moved this task from Triage to FR-Ops on the Fundraising-Backlog board.Jun 3 2019, 7:30 PM

@Jgreen can you provide a little more context if you think we could be of help?

Hi @fdans! These are logs generated with Kafkatee, which, I think, connects to and filters the firehose of all webrequests, to record only hits to Donate wiki landing pages. The logs are then processed and hits ingressed into a database on the fundraising cluster.

This is part of an old data pipeline that we're working to replace, BTW.

Are you also missing data on webrequest table on hive?

We haven't checked yet... Definitely that's the first place to check.

Thanks much!!!

Jgreen added a comment.Jun 4 2019, 8:15 PM

@Jgreen can you provide a little more context if you think we could be of help?

Hi @fdans! These are logs generated with Kafkatee, which, I think, connects to and filters the firehose of all webrequests, to record only hits to Donate wiki landing pages. The logs are then processed and hits ingressed into a database on the fundraising cluster.

This is part of an old data pipeline that we're working to replace, BTW.

Are you also missing data on webrequest table on hive?

We haven't checked yet... Definitely that's the first place to check.

Thanks much!!!

I think this is the new pipeline, where there is pre-filtering happening on the analytics side and kafkatee is consuming only that specific traffic.

Jgreen added a comment.Jun 4 2019, 8:32 PM

More context/clarification, we currently have two kafkatee pipelines which I'll refer to as "old" vs "new".

The old pipeline is running on americium.frack.eqiad.wmnet with this general config:

kafka.metadata.broker.list = kafka-jumbo1001.eqiad.wmnet:9093,kafka-jumbo1002.eqiad.wmnet:9093,kafka-jumbo1003.eqiad.wmnet:9093,kafka-jumbo1004.eqiad.wmnet:9093,kafka-jumbo1005.eqiad.wmnet:9093,kafka-jumbo1006.eqiad.wmnet:9093

input [encoding=json] kafka topic webrequest_text partition 0-23 from stored

output pipe 10 /usr/local/bin/beaconImpressions-filter

output pipe 1 /usr/local/bin/landingpages-filter

The old pipeline seems to be collecting normally, not disrupted.

The new pipeline runs on alnitak.frack.codfw.wmnet with this basic config:

kafkatee process #1

kafka.metadata.broker.list = kafka-jumbo1001.eqiad.wmnet:9093,kafka-jumbo1002.eqiad.wmnet:9093,kafka-jumbo1003.eqiad.wmnet:9093,kafka-jumbo1004.eqiad.wmnet:9093,kafka-jumbo1005.eqiad.wmnet:9093,kafka-jumbo1006.eqiad.wmnet:9093

kafka.offset.store.method=broker

input [encoding=string] kafka topic eventlogging_CentralNoticeImpression partition 0 from stored

output file 10 /srv/kafkatee/centralnotice-impressions.+.json.sampled10.log

kafkatee process #2

kafka.metadata.broker.list = kafka-jumbo1001.eqiad.wmnet:9093,kafka-jumbo1002.eqiad.wmnet:9093,kafka-jumbo1003.eqiad.wmnet:9093,kafka-jumbo1004.eqiad.wmnet:9093,kafka-jumbo1005.eqiad.wmnet:9093,kafka-jumbo1006.eqiad.wmnet:9093

kafka.offset.store.method=broker

input [encoding=string] kafka topic eventlogging_LandingPageImpression partition 0 from stored

output file 1 /srv/kafkatee/landingpage-impressions.+.json.log

I ~think~ the new pipeline is prefiltered on the analytics side, but am not positive. The hit counts mentioned earlier on this task refer to the second (landingpages) kafkatee process, which seems to have stopped collecting entirely. Restarting kafkatee does not make a difference.

Nuria added a subscriber: Nuria.EditedJun 4 2019, 9:37 PM

@Jgreen I think you might want to get in touch with @AndyRussG

Both the kafka topics you are filtering are being persisted to hive in tables CentralNoticeImpression and LandingPageImpression on events database, i would look there first to see if you also see inconsistencies with data. If you do it is likely that the events sent from the client are invalid (and, as such, never persisted).

Please take a look at hive and let us know what you find.

The events are not in Hive:

select year, month, day, count(*) from LandingPageImpression where year=2019 and month=5 and day in (27,28,29,30,31) group by year, month, day order by day limit 10;
year	month	day	_c3
2019	5	27	148
2019	5	28	136
2019	5	29	69
2019	5	31	1

And, since the the May 31 is less than 7 days ago, we can check if the data is still in Kafka:

# Show the dt field of the latest 3 messages in the eventlogging_LandingPageImpression topic:
kafkacat -C -b kafka-jumbo1001.eqiad.wmnet:9092 -t eventlogging_LandingPageImpression -o -3 -c 3 | jq .dt
"2019-05-29T20:40:08Z"
"2019-05-31T09:25:03Z"
"2019-06-02T22:39:41Z"

Looks like the latest 3 messages in Kafka match up with what you see: not a lot of data.

I just tailed a some log output of the eventlogging processers and do see an error like:

Jun 04 16:39:21 eventlog1002 eventlogging-processor@client-side-00[7169]: 2019-06-04 16:39:21,221 [7169] (MainThread) root [ERROR] Unable to validate: <Event 984ec52a4f8a5d5c855cbe8349f33696 of schema (u'LandingPageImpression', -1)> (Invalid revision ID -1)

I just verified that indeed the eventlogging-client-side topic (which is produced directly from clients) has revision: -1.

I believe that @AndyRussG recently deployed a change to the producer code to update the revision used?

Ejegg added a subscriber: Ejegg.Jun 4 2019, 11:08 PM

Possibly-related commit in the FundraiserLandingPage extension:
Remove temporary EventLogging code-duplication rEFUL9f0535ff0750ee605eede776facc83cf0f32bd44 (merged May 22)

I believe that @AndyRussG recently deployed a change to the producer code to update the revision used?

@Ottomata thanks so much for digging in!!! The schema revision changes were for other events, not this one. But we can definitely dig into what's going on with the client code now...

Thank again!!

Jgreen removed Jgreen as the assignee of this task.Jun 27 2019, 5:37 PM

unassigning b/c it does not look like there's anything wrong with the frack part of the pipeline (kafkatee and filters)

AndyRussG closed this task as Resolved.Sep 14 2019, 5:05 AM
AndyRussG claimed this task.

I've verified that both types of event are making it correctly from clients, through the Kafka streams and into log files in alnitak:/srv/banner_logs/. So, closing this task for now! Thanks much!!

Jgreen moved this task from In Progress to Done on the fundraising-tech-ops board.Feb 13 2020, 7:37 PM
Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM