Page MenuHomePhabricator

Replace refinery-source Guava caches by Caffeine
Closed, ResolvedPublic

Description

The refinery-source code depends on guava only for its caching implementation. However as @Ottomata noted on a CR review, Caffeine is now the way to go for caching instead of guava: https://guava.dev/releases/31.0-jre/api/docs/com/google/common/cache/CacheBuilder.html#:~:text=The%20successor%20to%20Guava's%20caching,CacheBuilder%20is%20its%20Caffeine%20class

This change will allow refinery-source to get rid of it's guava dependendcy, facilitating integration with other modules (guava version sync is complicated, see https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/857779)

Note:
There is another usage of guava in StemmerUDF line 209:

return Joiner.on(" ").join(words);

We can replace this using return StringUtils.join(words, " "); from org.apache.commons.lang.StringUtils

Event Timeline

Mind if I hijack this task and we do this for wikimedia-event-utilities at the same time?

Oh, there is one more nice thing guava has that we use though: Resources. Makes it really easy to load bytes from almost any URL.

I've performed some tests with Hive and Spark to make sure geocoded_data generates the same output as before:

We have noticed that Caffeine caches were not easily de-serialized by Spark-Kryo: https://phabricator.wikimedia.org/P43846

So after creating a new patch to avoid serializing the caches: https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/883118

Let's add 1 more test to make sure all UDFs are yielding the same results with Spark+Caffeine: https://gist.github.com/aq/056c1248e906b184a1106368274044c6

Result seems all good now. Well with refine_webrequest at least.

The last round of review is needed.

  • I've made sure that the use of our singletons is thread-safe + lazy-loaded.
  • I've ensured that Kryo does not serialize our singletons (and their caches).
  • I've fixed some suggestions from Sonar, and I've added some cleanup along the way.