Page MenuHomePhabricator

Java Prep for Webrequest Load
Closed, ResolvedPublic9 Estimated Story Points

Description

Goal:
Do the required Java prep work to migrate the webrequest load jobs to Airflow

Job Details:

InputProcessingOutput
Raw JSONHiveHive + Table Tests

Success Criteria:

  • Have the 2 Jobs Migrated (SLA 5 Hours)
Gotchas
  • This job includes archiving of results. Maybe we need to adapt the existing Airflow custom ArchiveOperator to match this job's format.
  • Job needs to be rewritten - TBD how.

Gerrit organisation:

  • 1 merge request about replacing Guava cache by Caffeine and extracting Guava
  • 1 merge request about making existing code in UDFs thread compatible (remove singletons + function serialization)

Event Timeline

EChetty updated Other Assignee, added: Antoine_Quhen.
EChetty moved this task from To be prioritised to Sprint 07 on the Data Pipelines board.
EChetty edited projects, added Data Pipelines (Sprint 07); removed Data Pipelines.
EChetty set the point value for this task to 9.Jan 16 2023, 4:22 PM

Change 883118 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] [WIP] Java preparations before migrating webrequest to Spark

https://gerrit.wikimedia.org/r/883118

We met a bug when using Caffeine with the Maxmind Geocoding lib. An infinite loop was triggered at unit test time. The return of loader.load is sometimes not compatible with Caffeine.cache.get. So it forced us to use Caffeine.cache.getIfPresent + Caffeine.cache.put:
https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/0e56e1f949222865b70578d19aa3eacb3e489519/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/maxmind/DatabaseReaderCache.java#45

Change 886800 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] Java Hive UDF thread safety

https://gerrit.wikimedia.org/r/886800

Notice about bucketing: The wmf.webrequest table is currently bucketed by hostname and sequence. It happens during Hive writing. Bucketing is mainly useful when sampling a table. Spark does not allow table bucketing when working in SQL. So, if we want to migrate refine_webrequest job from Hive to Spark, we may have to remove the bucketing optimization.

The bucket strategy to write the files is used in downstream processes like the Druid sample webrequest job. (It's the only one I know of, actually).

The perspective of using Iceberg to store our data means we won't need this optimization for long.

What do you think?
Is the product analytics team using this feature?

Notice about Java optimizations for Spark:

Even if we now have a working solution for use in refine_webrequest. We could perform those refactoring on the code:

1/ Remove the class singletons used in the context of UDFs. They are not thread safe

https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/886800/

2/ Take care of the serialization of objects used in UDFs

  • create unit tests with SerializationUtils.clone()
  • add implements Serializable
  • make sure all static are finals
  • declare some transient for some properties with a methods to rebuild it from constructor and from readObject (~ de-serialization hook). A good example of it is located in AbstractDatabaseReader where the property is set in readObject, and a bad one in initializeDigestIfNeeded used on every call.

refine_webrequest is currently working with Spark and refinery-hive-0.2.5 jar

The current trouble is with the Guave to Caffeine cache replacement: https://phabricator.wikimedia.org/P43846

Caffeine caches are not easily serializable with Kryo, the serialization library used by Spark. I've already tried to tune those parameters:

  • spark.kryo.registrator
  • spark.kryo.classesToRegister

without success, as the missing Caffeine classes are not clear. e.g. com.github.benmanes.caffeine.cache.LocalLoadingCache$$Lambda$1983/1345962703

As an alternative, we are going to make the cache property transient and manually initialize it.

Change 886800 abandoned by Aqu:

[analytics/refinery/source@master] Java Hive UDF thread safety

Reason:

Replaced by 883118

https://gerrit.wikimedia.org/r/886800

The replacement of Guava forced me:

  • to make sure our Singletons are not serialized.
  • to check the thread safeness of those instances.

In other words, the code of this ticket is already in here: T325266

Change 883118 merged by Joal:

[analytics/refinery/source@master] Remove Guava from dependency

https://gerrit.wikimedia.org/r/883118

Change 904778 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] Review Java UDFs used in refine webrequest

https://gerrit.wikimedia.org/r/904778