Page MenuHomePhabricator

Event partitions missing since 2023-02-21T10:00 for stream without events (canary events not produced?)
Closed, ResolvedPublic

Description

It seems like partitions for the event datasets are missing since 2023-02-21T10:00.

Example:

dcausse@stat1004:~$ hdfs  dfs -ls hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Found 11 items
drwxr-x---   - analytics analytics-privatedata-users          0 2023-02-21 01:09 hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21/hour=00
drwxr-x---   - analytics analytics-privatedata-users          0 2023-02-21 02:09 hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21/hour=01
drwxr-x---   - analytics analytics-privatedata-users          0 2023-02-21 03:09 hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21/hour=02
drwxr-x---   - analytics analytics-privatedata-users          0 2023-02-21 04:09 hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21/hour=03
drwxr-x---   - analytics analytics-privatedata-users          0 2023-02-21 05:09 hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21/hour=04
drwxr-x---   - analytics analytics-privatedata-users          0 2023-02-21 06:09 hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21/hour=05
drwxr-x---   - analytics analytics-privatedata-users          0 2023-02-21 07:08 hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21/hour=06
drwxr-x---   - analytics analytics-privatedata-users          0 2023-02-21 08:08 hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21/hour=07
drwxr-x---   - analytics analytics-privatedata-users          0 2023-02-21 09:09 hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21/hour=08
drwxr-x---   - analytics analytics-privatedata-users          0 2023-02-21 10:09 hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21/hour=09
drwxr-x---   - analytics analytics-privatedata-users          0 2023-02-21 11:09 hdfs:///wmf/data/raw/event/codfw.mediawiki.revision-score/year=2023/month=02/day=21/hour=10

This is problematic as some airflow sensors are waiting on some of these partitions and might fail or miss their SLA.

Event Timeline

org.wikimedia.analytics.refinery.job.ProduceCanaryEvents seems stuck since yesterday maint operation on the k8s cluster, one thread seems blocked on:

"main" #1 prio=5 os_prio=0 tid=0x00007f301000b000 nid=0x138e runnable [0x00007f3015216000]
   java.lang.Thread.State: RUNNABLE
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:464)
	at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:68)
	at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1346)
	at sun.security.ssl.SSLSocketImpl.access$300(SSLSocketImpl.java:73)
	at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:962)
	at org.apache.http.impl.conn.LoggingInputStream.read(LoggingInputStream.java:84)
	at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
	at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
	at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
	at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
	at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
	at org.wikimedia.eventutilities.core.http.BasicHttpClient.post(BasicHttpClient.java:107)
	at org.wikimedia.eventutilities.monitoring.CanaryEventProducer.postEvents(CanaryEventProducer.java:286)
	at org.wikimedia.eventutilities.monitoring.CanaryEventProducer.lambda$postEventsToUris$2(CanaryEventProducer.java:253)
	at org.wikimedia.eventutilities.monitoring.CanaryEventProducer$$Lambda$52/478182173.apply(Unknown Source)
	at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1321)
	at java.util.stream.Collectors$$Lambda$37/1365790282.accept(Unknown Source)
	at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
	at java.util.HashMap$EntrySpliterator.forEachRemaining(HashMap.java:1723)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
	at org.wikimedia.eventutilities.monitoring.CanaryEventProducer.postEventsToUris(CanaryEventProducer.java:251)
	at org.wikimedia.analytics.refinery.job.ProduceCanaryEvents$.produceCanaryEvents(ProduceCanaryEvents.scala:307)
	at org.wikimedia.analytics.refinery.job.ProduceCanaryEvents$$anonfun$4$$anonfun$apply$1.apply$mcZ$sp(ProduceCanaryEvents.scala:212)
	at org.wikimedia.analytics.refinery.job.ProduceCanaryEvents$$anonfun$4$$anonfun$apply$1.apply(ProduceCanaryEvents.scala:212)
	at org.wikimedia.analytics.refinery.job.ProduceCanaryEvents$$anonfun$4$$anonfun$apply$1.apply(ProduceCanaryEvents.scala:212)
	at scala.util.Try$.apply(Try.scala:192)
	at org.wikimedia.analytics.refinery.job.ProduceCanaryEvents$$anonfun$4.apply(ProduceCanaryEvents.scala:211)
	at org.wikimedia.analytics.refinery.job.ProduceCanaryEvents$$anonfun$4.apply(ProduceCanaryEvents.scala:210)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.wikimedia.analytics.refinery.job.ProduceCanaryEvents$.apply(ProduceCanaryEvents.scala:210)
	at org.wikimedia.analytics.refinery.job.ProduceCanaryEvents$.main(ProduceCanaryEvents.scala:147)
	at org.wikimedia.analytics.refinery.job.ProduceCanaryEvents.main(ProduceCanaryEvents.scala)

perhaps stuck in a retry loop and/or excessive http read timeout that happened when eventgate-main@codfw was depooled as part of the k8s upgrade (T329664)

@dcausse taking a look. It seems adjacent to Event-Platform. You caught me in a blind spot; I’m not familiar with the lifecycle (and recovery) of this job.

I can’t help much right now, but I’d like to understand the process better. Do you maybe know who the (historical) point of contact / steward for ProduceCanaryEvents is ?
Does it fall under the general analytics-refinery ops umbreall? cc / @BTullis

I think @Ottomata used to take care of this and was not sure what tags to add so please feel free to remove it from here :)

I think the ProduceCanaryEvents job is only helpful for batch related jobs that needs to be triggered on the existence of hdfs partitions.
I'm not familiar either with the lifecycle (and recovery) of this job but I think it's running via a systemd timer and is expected to run every 15mins.
I think it might make more sense to move it to an airflow job so that every runs can be tracked and replayed.

@dcausse yes the plan is to move this job to airflow eventually.

Its a little tricky because it dynamically discovers the work to do via stream config. Marcel has prototyped ways to handle this in airflow. Looking at issue now...

This does indeed look like it was caused by T329664: Update wikikube codfw to k8s 1.23. The ProduceCanaryEvents job is configured to explicitly produce to the appropriate eventgate service in both DCs, e.g. eventgate-main.svc.eqiad.wmnet and eventgate-main.svc.codfw.wmnet instead of eventate-main.discovery.wmnet.

We do this so that we can be sure we will have canary events in both the eqiad and codfw prefixed Kafka topics. This allows us to differentiate between low volume empty topics, and broken pipelines. As long as each topic has at least one canary event per hour, we know that every hour should have some data in it, and can create the Hive partitions once there is actual data for an hour.

In this codfw was really offline, so this actually worked as designed. codfw was 'broken'. These partitions really had no data for this hour, and canary events really failed to be produced.

I'm not really sure how to account for this in the future. Service Ops can and should be able to depool and take offline services in one DC at a time, and things should keep working. For all of the actual uses, everything uses the discovery addresses, so there was no actual outage.

Probably, we should run the ProduceCanaryEvents job in both DCs, and use the discovery addresses too. ProduceCanaryEvents then would not encounter errors, but during a maintenance like this, there would still be no canary events in the offline DC, and we'd still have the same Hive partition problem.

Are there ways to unblock it? It's causing plenty of hourly jobs to fail on our side.

It should be fixed for recent data, since they turned codfw back on. There is no (easy) way to produce past canary events. There really was no codfw data for those hours. Is there a way to make airflow run even if it thinks the data is missing?

I'm working through marking everything so airflow will run now. The basic idea is you have to mark the task as success, from the cli it looks something like (adjusting for days/dag_ids/task_id's on each relevant dag):

dag_id=ores_predictions_hourly
task_id=wait_for_hourly_data
for hour in $(seq -f "%02g" 11 23); do
  sudo -u analytics-search airflow run --mark_success ${dag_id} ${task_id} 20230221T${hour}0000
done
for hour in $(seq -f "%02g" 0 17); do
  sudo -u analytics-search airflow run --mark_success ${dag_id} ${task_id} 20230222T${hour}0000
done

Thanks for fixing the issue and dealing with failed sensors! :)
Regarding what we could do to mitigate this in the future I don't think there exists great solution out of the box...
One could set soft_fail=True on their sensors but that means you have to set a meaningful timeout and not sure the benefit would be worth the possible harm in case the data finally appears a couple hours later...
Perhaps the solution to creating these "empty" partitions could be done without relying on canary events?

Perhaps the solution to creating these "empty" partitions could be done without relying on canary events?

Hm, yeah maybe a script that could easily be run manually for a range?

Change 894642 had a related patch set uploaded (by DCausse; author: DCausse):

[analytics/refinery/source@master] ProduceCanaryEvents: set a timeout on the http client

https://gerrit.wikimedia.org/r/894642

I think this happened again.

I noticed today that there were no canary events for a new topic in codfw. Looking on an-launcher1002, systemctl list-timers showed me:

NEXT                         LEFT                LAST                         PASSED             UNIT                                                     ACTIVATES
n/a                          n/a                 Thu 2023-05-04 16:15:00 UTC  1 weeks 6 days ago produce_canary_events.timer                              produce_canary_events.service

Meaning, the job hasn't run since 2023-05-04 16:15:00!

I saw a produce_canary_events java process running on an-launcher1002 since that time. There were no logs available since journalctl only keeps a short history before rotating them out. I do not know why this job was stuck.

I killed the job. I tried to restart it, but noticed that it took a very long time to run. I stopped it, and ran it manually. The job does indeed finish, but takes several minutes now, whereas previously I believe it would only take several seconds.

I just remembered this task. I did not get to check and see what the java process was stuck on.

@dcausse submitted a workaround that would have at least mitigated this problem. It got lost in the backlog of stuff and was never deployed. Deploying it now.

Change 894642 merged by Ottomata:

[analytics/refinery/source@master] ProduceCanaryEvents: set a timeout on the http client

https://gerrit.wikimedia.org/r/894642

Change 920762 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Bump refinery version for produce_canary_events job

https://gerrit.wikimedia.org/r/920762

Change 920762 merged by Ottomata:

[operations/puppet@production] Bump refinery version for produce_canary_events job

https://gerrit.wikimedia.org/r/920762

Change 920764 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] produce_canary_events - fix refinery-job to use shaded jar

https://gerrit.wikimedia.org/r/920764

Change 920764 merged by Ottomata:

[operations/puppet@production] produce_canary_events - fix refinery-job to use shaded jar

https://gerrit.wikimedia.org/r/920764

Change 920767 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] produce_canary_events - revert bump of refinery-job version

https://gerrit.wikimedia.org/r/920767

Change 920767 merged by Ottomata:

[operations/puppet@production] produce_canary_events - revert bump of refinery-job version

https://gerrit.wikimedia.org/r/920767

Everything is deployed, but I had to revert the puppet change that would use the new code. It looks like refinery-job-shaded no longer includes Scala dependencies. Perhaps since Spark 3 upgrade?

@JAllemandou @mforns @Antoine_Quhen Any clues?

Refinery-source does not ship Scala anymore because it was included in wikihadoop, which is not included anymore.
https://archiva.wikimedia.org/#artifact-dependencies/org.wikimedia/wikihadoop/0.3-wmf1

In the meantime, if you get stuck, a quick fix is to add to the classpath the scala jars from the Spark install:

export SCALA_JARS=$(ls /opt/conda-analytics/lib/python3.10/site-packages/pyspark/jars/scala*.jar | tr '\n' ':')
java -cp $SCALA_JARS ...

Joal suggested creating a puppet profile for Scala to install it cleanly on the machines.

There is a second problem hidden behind the missing Scala lib: the Guava version mismatch between the one provided by Hadoop and the one included in eventutilities.

A simple solution exists for produce-canary-events: putting our refinery-job-shaded.jar before hadoop -classpath in the -cp option when launching the java app.

The Refine job is also using eventualities:

cd analytics-refinery-source
ack -l eventutilities
changelog.md
refinery-job/pom.xml
refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ProduceCanaryEvents.scala
refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/Refine.scala
refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/SchemaLoaders.scala
refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/tool/EvolveHiveTable.scala

And the same problem arises here https://phabricator.wikimedia.org/T335308

Change 932456 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refinery::job::canary_events - use spark to launch, bump to version 0.2.17

https://gerrit.wikimedia.org/r/932456

Change 932456 merged by Ottomata:

[operations/puppet@production] refinery::job::canary_events - use spark to launch, bump to version 0.2.17

https://gerrit.wikimedia.org/r/932456