Exceptional Java - Design the failure case - Part 1


TightropeGood exception handling doesn’t happen by chance. It is designed and planned and when done properly it is one of the main roads to the software “Holy Grail” - quality and reliability.
But leave it to chance and soon all hell breaks loose.

Exceptions are a fact of life

Any non trivial application will encounter error conditions. It happens because the program will interact with unreliable humans (who don’t read manuals), unreliable systems (like networks) and systems prone to failure (hard drives, memory chips etc). Applications are spread across a continuum of functional and reliability requirements. For some applications it is fine to log or display a message to the user and then abandon the current task or even roll over and die. A GUI application for displaying pictures can crash without causing more than annoyance to the user when a corrupted image file is encountered. On the other hand server applications in critical domains like government, military, medicine, finance are not allowed to crash randomly since they can cause major damage in the real world.

Exceptions causes

Exceptions can be caused by mistakes in the code and by failed interactions with external systems. There is also a gray area where exceptions are used more for flow control than to signal real failures. I will discuss those in another post.

Bugs in the code cause exceptions when preconditions are not met and when unexpected results and states are not taken into account.
As examples of preconditions we can think of range of values for method parameters or more generic we can thing of configuration needed by any piece of software from a method to a big framework.
Unsatisfied preconditions can be a problem at the API implementation level when the programmer doesn’t validate the range of values and then the API code crashes randomly or produces wrong results.
Also they can be a problem at the application level when the API documentation is not specified and wrong parameters are provided to the API.
Untreated postconditions are also bugs. Just imagine the code that looks for a value in a map and then uses it without checking if it is not null. Also in the same category I will include uncaught, swallowed, untreated exceptions.

The other main cause category for exceptions is interaction with external systems. Examples are: the user (user input), database servers, network communication, file systems etc. Since these are external systems and not under the control of the programmer at development time all possible failure scenarios should be discovered, analyzed and treated.

The most severe form of failure happens when the system running the software has physical problems - system failures. In this case there is not much a program can do and it is actually preferred to recognize the condition (if possible) and exit in order to avoid data corruption. If reliability is important then redundant systems with external continuous monitoring shall be implemented.

The importance of treating exceptions

If exceptions are a fact of life, then how should we prepare for them? I see a lot of reasons for treating any exception as soon as possible:

  • When some action fails we might want to retry it a number of times based on a configuration. This is a valid approach for any kind of network communication for example. It is not fine to get a communication error and throw our hands in the air and give up without retrying.
  • A possible course of action when exceptions are encountered is temporary graceful service degradation. Some failure conditions can be temporary and the program can still function and provide useful services even if the performance is worse or the functionality is limited.
  • If the error persists after retry, then we want to get in recovery mode. There are at least two important reasons to specifically think about recovery. One is obviously the attempt to finish the action successfully by resorting to backup solutions. Maybe we can connect to another file server or to another database or web service or service provider.
    The second reason might be even more important - i call it “state stabilization“. When failure occurs your application might not be in a very well defined state and it has to be brought back to a known stable state. Bringing an application from failure state to a functional stable state can be simple or complex based on what the application does but it can involve closing resources, cleaning up data, notifying client threads and so on. Failure to think about and implements state stabilization can lead to a program that continues to (mal)function after failure and produce garbage instead the expected results.
  • An important reason for treating exceptions as close as possible to the source is to avoid “error context expansion“. Every time when we defer treating an exception and we delegate to the next caller on the stack we actually create a bigger context for the handling code. Since the handling code needs to know as much as possible about the exception context this can lead to breaking the encapsulation. This problem becomes even more acute when multi threaded code is involved.
  • And of course the most important reason is cleanup. Resources have to be closed, disposed of, stabilized.
  • The above reasons come from reliability requirements. There are also maintenance and repair requirements. A lot of times exceptions occur because of bugs in the code. In this situation it is really important to capture and log/record the failure with as much as possible of the original failure context.

Dangers and traps - 1

Let’s look at a simple example. In the LostThread class two threads are created and started. The two threads synchronize their execution using a common object as a monitor.

package com.littletutorials.ex;

public class LostThread
{
    public static void main(String[] args)
    {
        final Object m = new Object();

        Runnable r1 = new Runnable()
        {
            @Override
            public void run()
            {
                String tn = Thread.currentThread().getName();

                synchronized (m)
                {
                    try
                    {
                        System.out.println(tn + " is waiting…");
                        m.wait();
                        System.out.println(tn + " finished waiting!");
                    }
                    catch (InterruptedException e)
                    {
                        e.printStackTrace();
                    }
                }
            }
        };

        Runnable r2 = new Runnable()
        {
            @Override
            public void run()
            {
                String tn = Thread.currentThread().getName();

                if (false)
                {
                    throw new RuntimeException("Kill the thread " + tn);
                }

                synchronized (m)
                {
                    System.out.println(tn + " is notifying…");
                    m.notify();
                    System.out.println(tn + " finished notifying!");
                }
            }
        };

        Thread t1 = new Thread(r1, "W-Thread");
        t1.start();

        try
        {
            Thread.sleep(1000);
        }
        catch (InterruptedException e)
        {
            e.printStackTrace();
        }

        Thread t2 = new Thread(r2, "N-Thread");
        t2.start();
    }
}

Playing with this example can reveal same of the dangers one can run into when exception handling is not well designed. Running this code as is will show the expected behavior. If we change the condition on line 39 from false to true then we can see how a naive implementation can run into troubles. If there is no exception handling for the runtime exception the W-Thread is lost and locked for ever. Of course there are simple solutions in this simple case. Let’s look at two of them. First we can do the cleanup in a finally block. Replace lines 39-49 with:

try
{
    if (true)
    {
        throw new RuntimeException("Kill the thread " +
            Thread.currentThread().getName());
    }
}
finally
{
    synchronized (m)
    {
        System.out.println(tn + " is notifying...");
        m.notify();
        System.out.println(tn + " finished notifying!");
    }
}

This solves the initial problem. But what if, in order to make our cleanup decisions, we need to know if we are running the finally block in the success situation or in the failure situation? Just a try/finally block is no longer enough and we need to actually catch the unchecked exception and mark the failure scenario so the finally block can do the right thing.
This is actually a problem in Java. Maybe a better implementation would have been to also have a Throwable parameter for the finally block with a null value in the success case.
The problem becomes even bigger when mainly runtime exceptions are used, a trend in many modern frameworks. When doing cleanup after an operation the idea is to clean up for the failure case in the specific catch block and do common cleanup (for both failure and success cases) in the finally block. But with unchecked exceptions we lose information. While executing the finally block are we in the success case, the failure case… and which failure case?

Another possible solution for the code above is to provide an UncaughtExceptionHandler for the thread. To do this insert after line 65 this code:

t2.setUncaughtExceptionHandler(new Thread.UncaughtExceptionHandler()
{
    @Override
    public void uncaughtException(Thread t, Throwable e)
    {
        String tn = t.getName();
        System.out.println(tn + " is cleaning up after exception: " +
            e.getClass().getName());
        e.printStackTrace(System.out);
        synchronized (m)
        {
            System.out.println(tn + " is notifying...");
            m.notify();
            System.out.println(tn + " finished notifying!");
        }
    }
});

There are problems with this solution also. First this is a last resort solution - the thread is dead. Second we can see how the context expansion works. Now a high level piece of code has to know details about what algorithm was running on the thread.

For this kind of situations I think the advantage provided by checked exceptions is very important. They force you to take action and think about the failure case. Of course programmers can just swallow the exception with an empty catch block. But they might as well do nothing at all for unchecked exceptions. But in the end it doesn’t matter if one uses checked or unchecked exceptions, the problem to solve is the same.

Read part 2 of this article…

This post is part of a series on exceptions:

  1. Thoughts on Java exceptions
  2. Some exceptions are more equal than others
  3. Less than perfect exceptions hierarchy
  4. Checked exceptions are priceless… For everything else there is the RuntimeException
  5. Design the failure case - Part 1
  6. Design the failure case - Part 2
  7. Exception design relativity
  8. Bad advice on exceptions from Joel

Leave a Reply

Are you human? Type this in the box below:

  • Tracking

  • License

    • Creative Commons License