Useful classes for data engineers - Scala and Java

We all have our habits and as programmers, libraries and frameworks are definitely a part of the group. In this blog post I'll share with you a list of Java and Scala classes I use almost every time in data engineering projects. The part for Python will follow next week!

Unit testing - Diffx

It's one of my favorites to add an extra context to the test failures. Diffx is a Scala library that spots the different values in case classes. So instead of having this:

Letter(1,a,AA) did not equal Letter(1,a,A)
ScalaTestFailureLocation: com.waitingforcode.DiffxExample at (DiffxExample.scala:16)
Expected :Letter(1,a,A)
Actual   :Letter(1,a,AA)

You'll get the exact difference for the mismatched fields:

Matching error:
Letter(
 	id: 1,
 	lower: a,
 	upper: A[A])

Additionally, you can customize the output by defining your own ShowConfig class.

Maps.difference

Unfortunately, Diffx shines only for the case class comparisons. What if you have an arbitrary map instead? The Open Source community also takes care of us here. Google Guava project - that you certainly know well if you have come to data engineering from software engineering - brings a method called Maps.difference. As the name suggests, it returns the difference between 2 maps, including:

The class is really powerful because it can even detect differences for nested maps!

val mapLeft = new java.util.HashMap[String, Any]()
mapLeft.put("common_equal", 1)
mapLeft.put("common_different", 1)
mapLeft.put("extra_in_left", "22222")
val nestedMapLeft = new java.util.HashMap[String, Int]()
nestedMapLeft.put("extra_left", 1)
nestedMapLeft.put("common_equal", 2)
mapLeft.put("different_nested_map", nestedMapLeft)

val mapRight = new java.util.HashMap[String, Any]()
mapRight.put("common_equal", 1)
mapRight.put("common_different", 11)
mapRight.put("extra_in_right", "33333")
val nestedMapRight = new java.util.HashMap[String, Int]()
nestedMapRight.put("extra_right", 11)
nestedMapRight.put("common_equal", 2)
mapRight.put("different_nested_map", nestedMapLeft)

val diff = Maps.difference[String, Any](mapLeft, mapRight)

diff.areEqual() shouldEqual false
// {common_different=(1, 11)}
println(diff.entriesDiffering())
// {extra_in_left=22222}
println(diff.entriesOnlyOnLeft())
// {extra_in_right=33333}
println(diff.entriesOnlyOnRight())

TimeUnit

That's probably the most useful class if you deal with time. It simplifies the code a lot because instead of transforming time units with multiplications or divisions, you simply express the input with the expected output, as below:

val inputSeconds = 120

TimeUnit.SECONDS.toMinutes(inputSeconds) shouldEqual 2
TimeUnit.SECONDS.toMillis(inputSeconds) shouldEqual 120000

Beautiful, doesn't it?

FileUtils

A common requirement is to serialize a class and store it in the file as text. Even though Apache Spark fulfills it without involving any specific code on your side, you may not always use Apache Spark. It's especially true for testing or small context files where a FileUtils.writeStringToFile method should shine:

val textToWrite =
  """
    |line#1
    |line#2
    |line#3
    |""".stripMargin

FileUtils.writeStringToFile(new File("/tmp/test.txt"), textToWrite, "UTF-8")

FileUtils.readFileToString(new File("/tmp/test.txt"), "UTF-8") shouldEqual textToWrite

Besides the write, you can see in the snippet the opposite method that reads a file to string.

ObjectMapper

Sometimes even FileUtils can't be insufficient to write an object as text. It's true especially for JSON where building a JSON object manually is cumbersome and error-prone. One way to address that issue is to use a dedicated JSON serialization library and the one working best for me for several years is Jackson.

To save a case class as JSON with Jackson, it's easy. You simply initiate an ObjectMapper with all required modules (DefaultScalaModule for the example) and use one of the existing write and read methods:

val scalaJsonMapper = new ObjectMapper()
	scalaJsonMapper.registerModule(DefaultScalaModule)

val personToSave = Person("Save", "Me")

val personJson = scalaJsonMapper.writeValueAsString(personToSave)
FileUtils.writeStringToFile(new File("/tmp/test.txt"), personJson, "UTF-8")

scalaJsonMapper.readValue(new File("/tmp/test.txt"), classOf[Person]) shouldEqual personToSave

CountDownLatch

Lastly, a class that should help you in coordinating the asynchronous code. I like using it to start a background process and let the main thread continue and eventually finish before the background process. Of course, there are certainly multiple other ways for achieving this but for me having an explicit blocker shows the intent much better than a Future for example.

The class helping this is CountDownLatch. It's a counter-based lock where you define the number of processes that should decrease the counter (countDown()) before resuming the execution from the blocking point (await()). An example is just below with a background process generating a file and the process doing other thing:

val textToWrite =
  """
    |line#1
    |line#2
    |line#3
    |""".stripMargin

val countDownLatch = new CountDownLatch(1)
new Thread(new Runnable() {
  override def run(): Unit = {
    // Give some time to see the synchronization
    Thread.sleep(3000L)
    try {
      FileUtils.writeStringToFile(new File("/tmp/test.txt"), textToWrite, "UTF-8")
    } finally {
      countDownLatch.countDown()
    }
  }
}).start()

println("Doing some other, more important work here")

countDownLatch.await(10, TimeUnit.SECONDS)

FileUtils.readFileToString(new File("/tmp/test.txt"), "UTF-8") shouldEqual textToWrite

Hope you discovered something new here. Until last year I was not aware of the Diffx and Maps.difference but they turn out to be a better way to compare objects than visual checks or a custom comparison code! What about your favorite Java or Scala libraries?