30.07.2017 - hashCode()/equals() using lambdas

Writing proper hashCode() and equals() methods is really important, but a pretty boring task. Luckily IDEs nowadays generate these, so we don’t have to worry about them too much. But this approach has two downsides:

  1. it adds at least 20 lines of complex looking code to often very simple classes (e.g. DTOs or JPA entities), which reduces readability and might have a negative impact on static code analysis
  2. the IDEs take different approaches, so the code looks different for e.g. Eclipse and IntelliJ

Lambdas as an alternative

To simplify the process of generating hashCode() and equals() methods I thought about using Java 8 lambdas. This is the approach I came up with:

public class HashCodeEqualsBuilder {

    public static  int buildHashCode(T t, Function<? super T, ?>... functions) {
        final int prime = 31;
        int result = 1;
        for (Function<? super T, ?> f : functions) {
            Object o = f.apply(t);
            result = prime * result + ((o == null) ? 0 : o.hashCode());
        }
        return result;
    }

    public static  boolean isEqual(T t, Object obj, Function<? super T, ?>... functions) {
        if (t == obj) {
            return true;
        }
        if (obj == null) {
            return false;
        }
        if (t.getClass() != obj.getClass()) {
            return false;
        }
        T other = (T) obj;
        for (Function<? super T, ?> f : functions) {
            Object own = f.apply(t);
            Object foreign = f.apply(other);
            if (own == null) {
                if (foreign != null) {
                    return false;
                }
            } else if (!own.equals(foreign)) {
                return false;
            }
        }
        return true;
    }
}

To use it in a class (e.g. DTO) type the following:

@Override
public int hashCode() {
    return HashCodeEqualsBuilder.buildHashCode(this, MyClass::getA, MyClass::getB);
}

@Override
public boolean equals(Object obj) {
    return HashCodeEqualsBuilder.isEqual(this, obj, MyClass::getA, MyClass::getB);
}

Based on some microbenchmarks I did, the performance is equal to the autogenerated implementation (e.g. by Eclipse). I will start using this from now on and can recommend using this approach for two reasons:

  1. it does not add additional lines of code to your project that might pop up as uncovered by unit tests
  2. updating the implementation (if ever) must only happen at a single place

03.10.2015 - Postfix domain whitelist

If you want to send e-mails just to a specific domain and block all other domains (e.g. when you are developing software and testing it without sending e-mails to your customers), you need to filter or rewrite the recipient. This domain whitelist can be done efficiently using postfix as filter.

Set up the relay in /etc/postfix/main.cf

relayhost = some.ip.or.hostname

Set up rewriting in /etc/postfix/main.cf

smtp_generic_maps = regexp:/etc/postfix/generic

Set the whitelist in /etc/postfix/generic

!/(.*@allowed.domain)/ admin.to.inform@allowed.domain

Note that you might have to regenerate /etc/postfix/generic.db and restart postfix.

postmap /etc/postfix/generic
sudo systemctl restart postfix

This is pretty simple and will ensure no accidential e-mails leave your development environment. You can test your setup using:

postmap -fq someone@allowed.domain regexp:/etc/postfix/generic
postmap -fq someoneelse@bad.domain regexp:/etc/postfix/generic

07.08.2015 - Hibernate, @Fetch(FetchMode.SUBSELECT) and accidentially loading a whole database into memory

This week I was facing a very suspicious performance issue with @Fetch(FetchMode.SUBSELECT) and Hibernate. Let’s imagine the following Entities

@Entity
public class Parent {
    @Fetch(FetchMode.SUBSELECT)
    private List children;
}

@Entity
public class Child {
    private String something;
}

This comes with the following logic:

for (int i = 0; i < parentCount; i += 10) {
    List tenParents = entityManager.createQuery("from Parent").setOffset(i).setMaxResults(10);
    for (Parent parent : tenParents) {
        for (Child child : parent.children) {
            System.out.println(child.something);
        }
    }
}

Long pause at the start

Don’t try to make to much sense from the code above. What it does is loading the Parent-entities in batches of 10 and print the childrens member variable ‘something’. This looks pretty straight forward but was causing a long pause when accessing the children relation of the first Parent-entity. It just didn’t make much sense. We took a look at the SQL statements generated by Hibernate:

SELECT ... FROM Parent LIMIT 0, 10

This is our manual query which loads 10 parents.

SELECT ... FROM Child WHERE parentId IN (SELECT id FROM Parent)

This one is … suspicous. It is the query which caused the program to stop for a while. And it got me thinking.

The magic of SUBSELECT

What Hibernate with SUBSELECT does for SUBSELECT is a pretty brilliant idea. When you load entites using Hibernate, it will remember which query you used to load the data. This query is used to lazily load the child entities. If you used query X to load the data, it will use the same query to load the children using the query X as subquery (i.e. “SELECT … FROM Y WHERE parentId in (X)”).

This is helpful in certain situations. One example is where Fetch.SELECT fails. Fetch.SELECT will load the subentities using the ID of the parent, i.e. “SELECT … FROM Y WHERE parentId = ?”. Combined with @BatchSize(size = 10) this will generate optimized queries like “SELECT … FROM Y WHERE parentId IN (?, ?, ?, …)”. Adding large batch sizes will generate more questionmarks. This forces Hibernate to transfer many IDs to the database server, where using a subselect might simply be faster (of course only if the SQL query used in the subselect is fast).

The issue with SUBSELECTs

The issue in my specific example is: While Hibernate remembers the query used to load the data, it does not remember offset and maxResults! In my case this caused to load more that 1 million Children just to access a few of them in one batch. While it is correct behavior, it will consume a lot of memory which might be unexpected. If this is combined with flushes to the entity manager in the loops above, Hibernate will load the data multiple times.

My solution

In my case switching to @Fetch(FetchMode.SELECT) with @BatchSize was the best solution. My batch size is small so sending a few IDs to the database does no harm.

04.04.2015 - Java Programming Question, Unique but sorted

When doing technical job interviews I often ask my candidates simple questions to see how good their knowledge of several aspects of the Java programming language is.

The most basic knowledge every programmer must have is: data structures and algorithms. Therefore the most basic knowledge every Java programmer must have is: the Collections framework. The following examines the candidates knowledge on sets and lists (and their differences). The key isn’t to solve the question, but to show you why you did what you did.

The question

You are given a list of 1000 strings. Write a program that will remove any duplicates from the list while keeping the order.

The naive approach

The naive approach is to walk over the given list and add all elements to a result list, that does not yet contain the current element. The following example code demonstrates this:

public List<String> filter(List<String> strings) {
    List<String> result = new ArrayList<>();
    for (String string : strings) {
        if (!result.contains(string)) {
            result.add(string);
        }
    }
    return result;
}

This will produce the correct result. Let’s examine this code for two factors: Memory usage and runtime (measured in Big-O complexity).

Memory usage is quite low in this example. We use the 1000 strings which are already in memory and create a new list with references to them. No worries here.

Let’s examine the runtime next. The part of this example algorithm are:

  1. Walking the input list of strings
    This is unavoidable. We must take a look at each of the elements in the input list, which mean O(n)

  2. Checking if the result list already contains the current string
    The cost of this call depends on the implemenation within the class ArrayList. This is something an educated Java developer must know (or at least must guess correctly). In a best case scenario the element we are looking for is the first one. The worst case is that the element is the last one. Therefore the implementation of ArrayList must walk the result list from the beginning to the end for each contains() call, again O(n).

  3. Inserting the element in the result list
    Inserting an element to the end of an ArrayList is cheap (i.e. O(1)), but (an this is again something to identify an educated developer) only if there is enough room left in the array. If there is not enough room left (the data array within the ArrayList is full), the ArrayList will have to grow(). This will result in a call to Arrays#copyOf(), which will create a new array in memory and copy over the elements, which again means O(n). Note that this will happen more seldom, because ArrayList does a good job ensuring grow() doesn’t get called to often.

This results in a complexity of O(n²), because with the loop over the given strings we loop over the result set. If you come up with this solution, you definitely must give the explanations above. And you definitely must show the interest to improve your answer!

The educated guess

Now we know the naive approach is expensive because of it’s contains() calls. The next step is to think about what we can do to improve this. Another data structure provides a faster implementation of contains(): the Set. But simply jumping on this data structure is not enough, because one thing is not featured by this data structure: keeping the order elements were inserted. Using these two data structures we can implement the requested functionality.

public List<String> filter(List<String> strings) {
    List<String> result = new ArrayList<>();
    Set<String> seen = new HashSet<>();
    for (String string : strings) {
        if (!seen.contains(string)) {
            result.add(string);
            seen.add(string);
        }
    }
    return result;
}

Again, let’s take a look at the memory usage and complexity. The memory usage is low again, and only slightly more expensive. The implementation of HashSet uses a HashMap internally, which means it will store the references to the strings in memory plus the hashcode for them (i.e. 1000 integers).

Determining the complexity of this algorithm gets a bit harder.

  1. Walking the input list of strings
    Same as above, no changes here

  2. Checking is the element is already present in the result
    As mentioned the implementation of Set#contains() is cheaper than List#contains(). This is due to the fact that it will use Object#hashCode() to check if the element is already present. The HashMap used in the HashSet will guarantee that this is O(1). Note: if you also mention that on a worst case a HashMap could have O(n) due to using a single hash bucket, you would definitely gain extra points!

  3. Adding the element to the result list
    Same as above, no changes here

  4. Adding the element to the set
    HashMap#put(), which is used by HashSet#add(), also has O(1). Note that we could also trigger a HashMap#resize() here.

Now our algorithm has linear complexity O(n), which is way better than the quadratic complexity before.

Getting immediately hired

If you are a long time Java developer, you are likely to come up with the solution above. You would now be presented with the question: „Can we do better than this?“

The answer is that O(n) is the best we can do in regards of complexity. We must take a look at each element of the list at least once. Therefore we can’t get below O(n).

But in regards of the implementation there are several points we can still improve. First of all we could initialize our result List and seen Set with appropriate capacities to reduce calls to ArrayList#grow() or HashMap#resize(). If nothing else is specified in JDK 8 an ArrayList is initialized with space for 10 elements, and an HashMap with 16 elements for a load of 0.75. This is way too small for 1000 elements.

If you point out LinkedHashSet in the interview, chances are good that you could be hired immediately. This simplifies the implementation to the following two lines:

public List<String> filter(List<String> strings) {
    Set<String> result = new LinkedHashSet<>(strings);
    return new ArrayList<>(result);
}

This implementation avoids all manual size initialization and iteration mentioned above.

14.03.2015 - Continuous Integration Setup

Improving the code quality of software in an everyday process is challenging. It consists of multiple parts like unit and integration tests, test code coverage, static code analysis including code metrics and proper comments. All this work is additional to main task of software developers: creating fantastic new features … and fixing bugs from time to time. Over the years I’ve played around with several tools to easy this task for all parties involved and come up with a solution, that performs most of the job automatically. Without no further ado I present you a continuous integration setup that I think works best (it’s actually very similar to what my company is doing).

Overview of the continuous integration setup

Continuous Integration Setup

Now that you managed to scroll over the large image you want to know more of the details, right? First of all this scenario is meant for Java EE based development processes, but I see no reason not to apply it to other types of developments, too (but keeping in mind that some components might need different implementations). I’d like to look at this from two different angles: the developer view and the manager view.

Developer view

When doing development I want my focus to be on the primary target: adding a new, cool feature. In the currently workflow at my company it is necessary that I also add a black-box test case for whichever code I added to verify I did well. The next step (usually done by a fellow software engineer of the same team) is to convert the black-box test case into a white-box test case, by expanding to cover most of the code paths (especially including negative testing). We are currently focussing on integration tests (i.e. not testing using mock objects, but on a dedicated machine hosting the complete software including a database), simply because we want to test a scenario most similar to our production system. Also having a large integration test suite is really awesome, as it can be easily used for regression testing.

But integration tests are not suited for running the tests on a developer machine. The main reason is that the tests are quite time consuming and required lots of resources (RAM, CPU, etc.). Another reason is that comparing the results is like comparing apples and pears: They were done on different systems with possibly different configurations and possibly different datasets. This is where our automated process comes into play: Each time a developer commits his code and pushes it to the central server (we are using Git), our build server will go ahead and compile our software, run the unit tests and notify the developer of bad results. Now we have a built- and unit-tested piece of software. The next step is: Running the integration tests and, also, getting notified about bad results. In best case scenarios the developer will not be notified about anything, which makes every developer happy.

After a successful result the last step is publishing the artifacts. There are two possible points on when to publish the built artifact: Either after it was built- and unit-tested or after it was verified by the integration tests. It boils down to the question: Is an artifact with a failed integration test still a valid artifact? In my current process we accept artifacts with failed integration tests for non-release versions (i.e. maven SNAPSHOT versions), but not for software to possibly be deployed to production servers.

And this is basically where the developer view ends. Every developer is encouraged to look at other measurements for quality, but in a team solely consisting of senior developers it’s not essential.

Manager and quality assurance view

All information must be aggregated in some kind of portal to be easily accessible for everyone. From a management perspective it can be used to see how good (or bad) things are going. This is essential for everyone not directly involved in the actual development process, but still working on or working with the product in development. So the portal is primary used by software architects, team leads, even CTOs or vice presidents (and of course the product owner in a scrum process).

When diving deeper into the metrics, test results and documentation the same portal can be used for quality assurance. It helps the product owner to verify if stories were implemented properly (e.g. no failing tests and proper documentation) or testers to identify which parts of the software aren’t completely tested yet and write tests for those cases. The real crucial part is to aggregate the information in a proper way. In our setup SonarQube does a lot of this work for us; it provides coverage for unit and integration tests next to code metrics. Other tools can be used as well (e.g. Checkstype, FindBugs, or PMD) and there are really good plugins integrating the data into Jenkins.

A key component is also automatic delivery of the software to an internal test system. It is a reference system hosting the latest software for everyone; holding the necessary configuration options, specifying the matching database and also providing a stable test system for the latest code.

Final words

Obviously: Setting up the complete infrastructure is a little bit time consuming, but it is absolutely worth it. Many components have to be brought together, but the grade of automation really simplifies the everyday work. The key is to have a transparent development process especially highlighting its weaknesses. This enables developers to improve their skill and largely improves the code quality.

A final remark is that not all default rules of several analysis tools will match your project. You have to keep in mind that tweaking the rules (e.g. starting with a little bit more relaxed rules) is an essential part of setting up your infrastructure. It will allow you to start with a decreased amount of issues.

06.06.2014 - Creating an 'Indiana Jones flight map effect' animation

I recently saw a simple 2D animation that rendered in a red line how a plane moved over a map. This is what you might also know as the „Indiana Jones flight map effect“. I immediately had an idea on how to do this in Blender so I went ahead and did it. Here is the result: Click to show the animation.

How to achieve the effect

The idea is quite simple: To make the red path appear on the map we move it from behind the map to front of it. The trick is that the path isn’t planar to the map but the front is „higher“ than the end of the line. This way moving it through the map will make the red line appear at the start first and reveal more and more until the line reaches the end.

So first of all you need to create the background map for your animation. I used tiles from OpenStreetMap on 3×3 planes for this, but you can use whatever background you prefer. Afterwards you need to create the line the flight will take. I thought using a straight line is boring, so I went for a curve. I created a bezier curve to move along the way you want to animate. Make sure you create a 3D curve, because we need the Z-axis for our animation. Also make sure to set the curves „Twisting“ to „Z-Up“ which is necessary that the line is flat in relation to the map. Once you are satisfied with your path, you need to add a mesh to solidify the curve. Use a simple small plane for this, set its color to red and add two modifiers:

  • Array: This will convert the plane into a line. Adjust width and height until you are satisfied with your line.
  • Curve: This will make your line follow the curve.

The red material must not cast any shadows. If you are using cycles to render the animation I recommend creating the material as shown in Blenderartists thread Cycles: make object cast no shadow.

Material setup without shadows

After you modeled your path (the mesh and the curve) move it behind you map. Afterwards insert a location keyframe (press „I“ and select „Location“) for both, go to your target frame (e.g. 90), move the line to the front and insert another location keyframe. Blender will now interpolate between the two points for you. Press „Alt+A“ to see a preview of the animation. You might need to adjust the curve to make the animation look smooth (rotate some nodes, add some notes, etc.). Once satisfied go ahead and render the animation.

To create a GIF from the generated PNGs you can (as for most imaging automation tasks) use imagemagick:

convert $( for a in 00*.png; do printf -- "-delay 10 %s " $a; done; ) result.gif

Feel free to play with the final blend file.

10.02.2014 - Finding relevant images in a large set of files

I recently had to recover data (images to be precise) from a 1 TB large drive. I used photorec to find the data on the drive. It resulted in lots and lots of images within lots and lots files. How do you approach finding relevant images within that amount of data? For initial filtering I focused on a single criteria: image dimension. For my case I used 1024 x 768 as the minimal resoltuion (which is 786432 when multiplied). I wrote a little script to find all images in a given directory, check their size, and symlink them to a folder sorting them by YEAR/MONTH/DAY. You can invoke them by calling the following:

./findimages.sh /some/directory

It will produce /some/directory/found with according subfolder for the date of the image. Note that you need the exiv2 binary to properly use this script.

#!/bin/sh

export LC_ALL=C
DIR='${1}'
TARGET_DIR='${DIR}/found/'

function symlink() {
    IMAGE="${1}"
    EXIV="${2}"
    IMAGE_DATE=$(echo "${EXIV}" | grep "Image timestamp" | sed -e "s/Image timestamp\s*:\s*//")
    DATE_FOLDER=$(echo ${IMAGE_DATE} | sed -e "s:\([0-9:]\+\) [0-9:]\+:\1:" | sed -e "s|:|/|g")
    mkdir -p "${TARGET_DIR}${DATE_FOLDER}"
    ln "${IMAGE}" "${TARGET_DIR}${DATE_FOLDER}"
    sync
}

COUNT=$(find ${DIR} -name '*.jpg' 2> /dev/null | wc -l)
CURRENT=1

for IMAGE in $(find ${DIR} -name '*.jpg' 2> /dev/null); do
    echo "${CURRENT}/${COUNT}"
    CURRENT=$((${CURRENT} + 1))
    EXIV="$(exiv2 ${IMAGE} 2> /dev/null)"
    IMAGE_SIZE=$(echo "${EXIV}" | grep "Image size" | sed -e "s/Image size\s*:\s*//")
    if [[ ${IMAGE_SIZE} =~ ^[0-9]+( )x( )[0-9]+.*$ ]]; then
        X=$(echo ${IMAGE_SIZE} | sed -e "s:\([0-9]\+\) x [0-9]\+:\1:")
        Y=$(echo ${IMAGE_SIZE} | sed -e "s:[0-9]\+ x \([0-9]\+\):\1:")
        if [ "$((${X} * ${Y}))" -gt "786432" ]; then
            symlink "${IMAGE}" "${EXIV}"
        fi
    else
        echo "${IMAGE}: ${IMAGE_SIZE} (${EXIV})"
        symlink "${IMAGE}" "${EXIV}"
    fi
done

If you have suggestions on how to improve the script or smarter alternatives, please let me know.

27.11.2013 - nginx + cgit in a subfolder

I’ve recently decided to give nginx a try. So far I’ve used Apache httpd 2.4 exclusively, but I wanted to get to know the competitor prior to sticking with Apache httpd or switching to nginx depending on the result. There are lots of claims about nginx’s performance and Apache httpd’s bloat, but I simply wanted to get started with a test scenario. For me this scenario was running the cgit web interface on http://localhost/cgit/. Next to the configuration files the only major difference to me is: nginx only provides a FastCGI interface, while I was using the built-in CGI interface in Apache. Luckily FastCGI is not hard to set up nowadays. On ArchLinux it basically boils down to the following three commands:

pacman -S nginx fcgiwrap
systemctl enable fcgiwrap.socket
systemctl start fcgiwrap.socket

After this you can use FastCGI, and thanks to systemd it is socket activated, i.e. it is only started when someone is accessing it. Afterwards I reused my /etc/cgitrc from the previous Apache httpd, which looks more or less like this:

root-title=My git Server
root-desc=This is my very own git server!
css=/cgit-web/cgit.css
logo=/cgit-web/cgit.png

virtual-root=/cgit

section=Some section

repo.url=Some Repo
repo.path=/srv/git/somerepo.git

For me the largest obstacle was setting „virtual-root = /cgit“, which was not necessary for Apache httpd, but mandatory for nginx. This has to match the property set in /etc/nginx/nginx.conf

http {
 ...
 server {
 ...
   location /cgit {
    include fastcgi_params;
    fastcgi_param SCRIPT_FILENAME /usr/lib/cgit/cgit.cgi;
    fastcgi_pass unix:/run/fcgiwrap.sock;
    fastcgi_split_path_info           ^(/cgit/?)(.+)$;
    fastcgi_param PATH_INFO $fastcgi_path_info;
    fastcgi_param QUERY_STRING $args;
   }
   location /cgit-web {
    rewrite ^/cgit-web(/.*)$ $1 break;
    root /usr/share/webapps/cgit;
   }
 }
}

After performing this small amount of configuration in nginx the server was up and running. In my very own and non-representative test scenario nginx and Apache httpd performed very similar (tested using jmeter). Even the memory usage was not that different. So at least for me the difference was not enough to convince me to permanently change my setup for now. Most of the things I need (PHP, ~/public_html, cgit) work fine on both, and my Apache httpd configuration is solid and proven for years.

If you have other suggestions on why to use nginx over Apache httpd feel free to inform me about it.

12.10.2013 - Unit testing with junit and mockito

If it’s not tested, it’s broken. – Bruce Eckel

I’m a great fan of reliable software. And as a software developer I only want to create software that works reliable. In my opinion crashes in any software (or even worse: data loss) cause distrust from the user. Would you want to run a piece of software that crashes? I wouldn’t. That’s why I think that testing software (and finding crashes before your user does) is a very important part of software development.

So how do we test software? As usual Wikipedia knows lot’s of testing types. You see, there is a large amount of ways to test software. I’ll focus on testing typically done by software developers: unit testing. Unit testing is done, as the name suggests, one a „unit“ of the software. I prefer to use my classes and their methods as my units, others might do differently. So let’s start with a trivial example unit, the SomeDao:

package de.egore911.test;
import javax.inject.Inject;
import javax.persistence.EntityManager;
import javax.persistence.Id;

public class SomeDao {
    public static class Some {
        @Id public Integer id;
    }

    @Inject private EntityManager em;

    protected EntityManager getEntityManager() { return em; }

    public Some comeGetSome(Integer id) {
        return getEntityManager().find(Some.class, id);
    }
}

This code is really simple. We have a DAO (data access object) which is able to load entities of the type Some by their ID. The class itself uses CDI to inject the EntityManager instance. This is a very common type of class I’ve seen in lots of web projects, but how to test this unit? At first glance it depends on an injection framework, which needs an entity-manager, which needs a database, which needs dummy data. This is a lot and lot’s of things could cause the test to fail that should not be part of this unit test (e.g. the database was not available). So we need to have a mock object for the EntityManager, which does not actually need to be injected and also does not need a database. This is where mockito comes into play.

import javax.persistence.EntityManager;

import org.junit.Assert;
import org.junit.Test;
import org.mockito.Mockito;

public class SomeDaoTest {

    @Test
    public void testComeGetSome() {
        // Mock the EntityManager to return our dummy element
        Some dummy = new Some();
        EntityManager em = Mockito.mock(EntityManager.class);
        Mockito.when(em.find(Some.class, 1234)).thenReturn(dummy);

        // Mock the SomeDao to use our EntityManager
        SomeDao someDao = Mockito.mock(SomeDao.class);
        Mockito.when(someDao.comeGetSome(1234)).thenCallRealMethod();
        Mockito.when(someDao.getEntityManager()).thenReturn(em);

        // Perform the actual test
        Assert.assertSame(dummy, someDao.comeGetSome(1234));
        Assert.assertNull(someDao.comeGetSome(4321));
    }
}

You can see that this is fairly simple. First you mock an EntityManager that will return a dummy object when the EntityManager.find() method called. Then we make sure our mocked EntityManager is returned when SomeDao.getEntityManager() is called and also the real SomeDao.comeGetSome() is invoked. Of course all this could be done using reflection ourselves, but mockito does all this groundwork for us.

Mockito offers some other nice features, but my primary use is stubbing method calls.

10.07.2013 - Creating a chair in Blender

Today I felt like … creating a chair in Blender! That’s the virtual result. Now I just need to figure out how to do that for real 😉

Model of a chair

Modelling was completely done in Blender, rendering was done using the cycles renderer.

1 2 3 Next »
Copyright © christophbrill.de, 2002-2018.