Java Prep for Webrequest Load
Closed, ResolvedPublic9 Estimated Story Points
Actions

Assigned To

Authored By

	• EChetty
	Jan 16 2023, 2:39 PM

Description

Goal:
Do the required Java prep work to migrate the webrequest load jobs to Airflow

Job Details:

Input	Processing	Output
Raw JSON	Hive	Hive + Table Tests

Success Criteria:

Have the 2 Jobs Migrated (SLA 5 Hours)

Gotchas

This job includes archiving of results. Maybe we need to adapt the existing Airflow custom ArchiveOperator to match this job's format.
Job needs to be rewritten - TBD how.

Gerrit organisation:

1 merge request about replacing Guava cache by Caffeine and extracting Guava
1 merge request about making existing code in UDFs thread compatible (remove singletons + function serialization)

Details

Subject	Repo	Branch	Lines +/-
Review Java UDFs used in refine webrequest	analytics/refinery/source	master	+497 -443
Remove Guava from dependency	analytics/refinery/source	master	+238 -734
Java Hive UDF thread safety	analytics/refinery/source	master	+31 -37

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T324484 [Migrate] Oozie jobs migration for Webrequest
Resolved	Antoine_Quhen	T327072 Java Prep for Webrequest Load
Resolved	Antoine_Quhen	T325266 Replace refinery-source Guava caches by Caffeine

Event Timeline

• EChetty created this task.Jan 16 2023, 2:39 PM

• EChetty moved this task from Backlog to To be prioritised on the Data Pipelines board.Jan 16 2023, 2:41 PM

• EChetty updated Other Assignee, added: Antoine_Quhen.

• EChetty moved this task from To be prioritised to Sprint 07 on the Data Pipelines board.

• EChetty edited projects, added Data Pipelines (Sprint 07); removed Data Pipelines.

• EChetty assigned this task to Antoine_Quhen.Jan 16 2023, 2:44 PM

• EChetty updated Other Assignee, removed: Antoine_Quhen.

• EChetty moved this task from Ready to Next Up on the Data Pipelines (Sprint 07) board.Jan 16 2023, 4:13 PM

• EChetty set the point value for this task to 9.Jan 16 2023, 4:22 PM

Antoine_Quhen moved this task from Next Up to In Progress on the Data Pipelines (Sprint 07) board.Jan 19 2023, 5:21 PM

Change 883118 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] [WIP] Java preparations before migrating webrequest to Spark

https://gerrit.wikimedia.org/r/883118

gerritbot added a project: Patch-For-Review.Jan 24 2023, 9:59 AM

Antoine_Quhen added a subtask: T325266: Replace refinery-source Guava caches by Caffeine.Jan 24 2023, 4:57 PM

Antoine_Quhen updated the task description. (Show Details)Jan 24 2023, 5:00 PM

We met a bug when using Caffeine with the Maxmind Geocoding lib. An infinite loop was triggered at unit test time. The return of loader.load is sometimes not compatible with Caffeine.cache.get. So it forced us to use Caffeine.cache.getIfPresent + Caffeine.cache.put:
https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/refinery/source/+/0e56e1f949222865b70578d19aa3eacb3e489519/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/maxmind/DatabaseReaderCache.java#45

Change 886800 had a related patch set uploaded (by Aqu; author: Aqu):

[analytics/refinery/source@master] Java Hive UDF thread safety

https://gerrit.wikimedia.org/r/886800

Antoine_Quhen moved this task from In Progress to Ready to Deploy on the Data Pipelines (Sprint 07) board.Feb 6 2023, 12:01 PM

Notice about bucketing: The wmf.webrequest table is currently bucketed by hostname and sequence. It happens during Hive writing. Bucketing is mainly useful when sampling a table. Spark does not allow table bucketing when working in SQL. So, if we want to migrate refine_webrequest job from Hive to Spark, we may have to remove the bucketing optimization.

The bucket strategy to write the files is used in downstream processes like the Druid sample webrequest job. (It's the only one I know of, actually).

The perspective of using Iceberg to store our data means we won't need this optimization for long.

What do you think?
Is the product analytics team using this feature?

• EChetty edited projects, added Data Pipelines (Sprint 08); removed Data Pipelines (Sprint 07).Feb 7 2023, 12:22 PM

• EChetty moved this task from Ready to Ready to Deploy on the Data Pipelines (Sprint 08) board.Feb 7 2023, 12:23 PM

Notice about Java optimizations for Spark:

Even if we now have a working solution for use in refine_webrequest. We could perform those refactoring on the code:

1/ Remove the class singletons used in the context of UDFs. They are not thread safe

https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/886800/

2/ Take care of the serialization of objects used in UDFs

create unit tests with SerializationUtils.clone()
add implements Serializable
make sure all static are finals
declare some transient for some properties with a methods to rebuild it from constructor and from readObject (~ de-serialization hook). A good example of it is located in AbstractDatabaseReader where the property is set in readObject, and a bad one in initializeDigestIfNeeded used on every call.