Replace refinery-source Guava caches by Caffeine
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JAllemandou
	Dec 15 2022, 8:53 AM

Description

This change will allow refinery-source to get rid of it's guava dependendcy, facilitating integration with other modules (guava version sync is complicated, see https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/857779)

Note:
There is another usage of guava in StemmerUDF line 209:

return Joiner.on(" ").join(words);

We can replace this using return StringUtils.join(words, " "); from org.apache.commons.lang.StringUtils

Status	Assigned	Task
Resolved	Ottomata	T324484 [Migrate] Oozie jobs migration for Webrequest
Resolved	Antoine_Quhen	T327072 Java Prep for Webrequest Load
Resolved	Antoine_Quhen	T325266 Replace refinery-source Guava caches by Caffeine

Mind if I hijack this task and we do this for wikimedia-event-utilities at the same time?

Oh, there is one more nice thing guava has that we use though: Resources. Makes it really easy to load bytes from almost any URL.

I've performed some tests with Hive and Spark to make sure geocoded_data generates the same output as before:

We have noticed that Caffeine caches were not easily de-serialized by Spark-Kryo: https://phabricator.wikimedia.org/P43846

Let's add 1 more test to make sure all UDFs are yielding the same results with Spark+Caffeine: https://gist.github.com/aq/056c1248e906b184a1106368274044c6

Result seems all good now. Well with refine_webrequest at least.

The last round of review is needed.

I've made sure that the use of our singletons is thread-safe + lazy-loaded.
I've ensured that Kryo does not serialize our singletons (and their caches).
I've fixed some suggestions from Sonar, and I've added some cleanup along the way.

Antoine_Quhen claimed this task.