Page MenuHomePhabricator

Replace refinery-source Guava caches by Caffeine
Closed, ResolvedPublic


The refinery-source code depends on guava only for its caching implementation. However as @Ottomata noted on a CR review, Caffeine is now the way to go for caching instead of guava:'s%20caching,CacheBuilder%20is%20its%20Caffeine%20class

This change will allow refinery-source to get rid of it's guava dependendcy, facilitating integration with other modules (guava version sync is complicated, see

There is another usage of guava in StemmerUDF line 209:

return Joiner.on(" ").join(words);

We can replace this using return StringUtils.join(words, " "); from org.apache.commons.lang.StringUtils

Event Timeline

Mind if I hijack this task and we do this for wikimedia-event-utilities at the same time?

Oh, there is one more nice thing guava has that we use though: Resources. Makes it really easy to load bytes from almost any URL.

I've performed some tests with Hive and Spark to make sure geocoded_data generates the same output as before:

We have noticed that Caffeine caches were not easily de-serialized by Spark-Kryo:

So after creating a new patch to avoid serializing the caches:

Let's add 1 more test to make sure all UDFs are yielding the same results with Spark+Caffeine:

Result seems all good now. Well with refine_webrequest at least.

The last round of review is needed.

  • I've made sure that the use of our singletons is thread-safe + lazy-loaded.
  • I've ensured that Kryo does not serialize our singletons (and their caches).
  • I've fixed some suggestions from Sonar, and I've added some cleanup along the way.