Java Programming Question, Unique but sorted

When doing technical job interviews I often ask my candidates simple questions to see how good their knowledge of several aspects of the Java programming language is.

The most basic knowledge every programmer must have is: data structures and algorithms. Therefore the most basic knowledge every Java programmer must have is: the Collections framework. The following examines the candidates knowledge on sets and lists (and their differences). The key isn’t to solve the question, but to show you why you did what you did.

The question

You are given a list of 1000 strings. Write a program that will remove any duplicates from the list while keeping the order.

The naive approach

The naive approach is to walk over the given list and add all elements to a result list, that does not yet contain the current element. The following example code demonstrates this:

public List<String> filter(List<String> strings) {
  List<String> result = new ArrayList<>();
  for (String string : strings) {
    if (!result.contains(string)) {
      result.add(string);
    }
  }
  return result;
}

This will produce the correct result. Let’s examine this code for two factors: Memory usage and runtime (measured in Big-O complexity).

Memory usage is quite low in this example. We use the 1000 strings which are already in memory and create a new list with references to them. No worries here.

Let’s examine the runtime next. The part of this example algorithm are:

  1. Walking the input list of strings
    This is unavoidable. We must take a look at each of the elements in the input list, which mean O(n)
  2. Checking if the result list already contains the current string
    The cost of this call depends on the implemenation within the class ArrayList. This is something an educated Java developer must know (or at least must guess correctly). In a best case scenario the element we are looking for is the first one. The worst case is that the element is the last one. Therefore the implementation of ArrayList must walk the result list from the beginning to the end for each contains() call, again O(n).
  3. Inserting the element in the result list
    Inserting an element to the end of an ArrayList is cheap (i.e. O(1)), but (an this is again something to identify an educated developer) only if there is enough room left in the array. If there is not enough room left (the data array within the ArrayList is full), the ArrayList will have to grow(). This will result in a call to Arrays#copyOf(), which will create a new array in memory and copy over the elements, which again means O(n). Note that this will happen more seldom, because ArrayList does a good job ensuring grow() doesn’t get called to often.

This results in a complexity of O(n²), because with the loop over the given strings we loop over the result set. If you come up with this solution, you definitely must give the explanations above. And you definitely must show the interest to improve your answer!

The educated guess

Now we know the naive approach is expensive because of it’s contains() calls. The next step is to think about what we can do to improve this. Another data structure provides a faster implementation of contains(): the Set. But simply jumping on this data structure is not enough, because one thing is not featured by this data structure: keeping the order elements were inserted. Using these two data structures we can implement the requested functionality.

public List<String> filter(List<String> strings) {
  List<String> result = new ArrayList<>();
  Set<String> seen = new HashSet<>();
  for (String string : strings) {
    if (!seen.contains(string)) {
      result.add(string);
      seen.add(string);
    }
  }
  return result;
}

Again, let’s take a look at the memory usage and complexity. The memory usage is low again, and only slightly more expensive. The implementation of HashSet uses a HashMap internally, which means it will store the references to the strings in memory plus the hashcode for them (i.e. 1000 integers).

Determining the complexity of this algorithm gets a bit harder.

  1. Walking the input list of strings
    Same as above, no changes here
  2. Checking is the element is already present in the result
    As mentioned the implementation of Set#contains() is cheaper than List#contains(). This is due to the fact that it will use Object#hashCode() to check if the element is already present. The HashMap used in the HashSet will guarantee that this is O(1). Note: if you also mention that on a worst case a HashMap could have O(n) due to using a single hash bucket, you would definitely gain extra points!
  3. Adding the element to the result list
    Same as above, no changes here
  4. Adding the element to the set
    HashMap#put(), which is used by HashSet#add(), also has O(1). Note that we could also trigger a HashMap#resize() here.

Now our algorithm has linear complexity O(n), which is way better than the quadratic complexity before.

Getting immediately hired

If you are a long time Java developer, you are likely to come up with the solution above. You would now be presented with the question: „Can we do better than this?“

The answer is that O(n) is the best we can do in regards of complexity. We must take a look at each element of the list at least once. Therefore we can’t get below O(n).

But in regards of the implementation there are several points we can still improve. First of all we could initialize our result List and seen Set with appropriate capacities to reduce calls to ArrayList#grow() or HashMap#resize(). If nothing else is specified in JDK 8 an ArrayList is initialized with space for 10 elements, and an HashMap with 16 elements for a load of 0.75. This is way too small for 1000 elements.

If you point out LinkedHashSet in the interview, chances are good that you could be hired immediately. This simplifies the implementation to the following two lines:

public List<String> filter(List<String> strings) {
  Set<String> result = new LinkedHashSet<>(strings);
  return new ArrayList<>(result);
}

This implementation avoids all manual size initialization and iteration mentioned above.

Copyright © christophbrill.de, 2002-2017.