2013-08-25

JNI, Strings, Modified UTF-8 ... oh my!

While developing some native stuff on Android, I ran into an interesting issue testing the component I developed. The application passed Java Strings between Dalvik and the native component. This data was also passed to and received from a remote server and the remote server always returned UTF-8 string data. While testing with some sample data the application crashed with the following logcat output:

JNI WARNING: input is not valid Modified UTF-8: illegal start byte 0xf0
             string: 'A𝌹'
             in ..... (NewStringUTF)

While the input data received from the remote server was encoded in normal UTF-8, Dalvik (and also the JVM) expect UTF-8 encoded strings to be in “Modified UTF-8” form. “The Java Native Interface Programmers Guide and Specification” has the following to say about Modified UTF-8:

There are two differences between this format and the standard UTF-8 format. First, the null byte (byte)0 is encoded using the two-byte format rather than the one-byte format. This means that JNI UTF-8 strings never have embedded nulls. Second, only the one-byte, two-byte, and three-byte formats are used. The JNI does not recognize the longer UTF-8 formats.

What does this mean for the string in the warning above? How is it encoded in normal UTF-8 and Modified UTF-8?

String s = ("A" + "\uD834" + "\uDF39");
System.out.println(Arrays.toString(s.getBytes("UTF-8")));

Output:

[65,-16,-99,-116,-71]

Using one of the JNI methods to convert a Java String object to an UTF-8 encoded byte array:

const char* buffer = env->GetStringUTFChars(s, 0);
size_t length = strlen(buffer);
// Printing the buffer

Output:

[65,-19,-96,-76,-19,-68,-71]

The result, getBytes() encodes the 2nd character in one 4-byte form, while GetStringUTFChars() encodes it with two 3-byte characters. Passing the byte array from getBytes("UTF-8") to e.g. NewStringUTF results in a crash in the emulator, while the byte array from GetStringUTFChars() is converted correctly to a Java String object.

So how do we handle this issue? One could definitely write a converter to/from Modified UTF-8, but I chose a different way.

I no longer pass Strings through the JNI layer, rather do all String encoding/decoding in the VM and pass the normal UTF-8 encoded byte arrays down to the native component. The Java/Android API already provides the necessary methods for that:

// Create a Java String from a byte[] with string data encoded in normal UTF-8
new String(buffer, "UTF-8");

// Convert Java String to byte[] encoded in normal UTF-8
String.getBytes("UTF-8")

If you hide the native calls behind an interface you can still use Java Strings as method parameters and return values. Just encode/decode String objects in the wrapper class.

public interface Component {

    public String doSomething(String data);
}
public class NativeComponent implements Component {

    private native byte[] nativeDoSomething(long nativePtr, byte[] data);

    public String doSomething(String data) {
        return fromUtf8(nativeDoSomething(nativePtr, toUtf8(data)));
    }

    // fromUtf8, toUtf8, nativePtr initialization, ctor, ....

}

A sample Android application that demonstrates this issue is available at GitHub.

2013-08-24

Hello World

Well it’s that time again to start a new blog.