Class Converter

java.lang.Object
org.eclipse.swt.internal.Converter

public final class Converter extends Object
About this class: ################# This class implements the conversions between unicode characters and the platform supported representation for characters. Note that, unicode characters which can not be found in the platform encoding will be converted to an arbitrary platform specific character. This class is tested via: org.eclipse.swt.tests.gtk.Test_GtkTextEncoding About JNI invalid input: '&' string conversion: ############################# - Regular JNI String conversion usually uses a modified UTF-8, see: https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 - And in JNI, normally (env*)->GetStringUTFChars(..) is used to convert a javaString into a C string. See: http://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#GetStringUTFChars However, the modified UTF-8 only works well with C system functions as it doesn't contain embedded nulls and is null terminated. But because the modified UTF-8 only supports up to 3 bytes (and not up to 4 as regular UTF-8), characters that require 4 bytes (e.g emojos) are not translated properly from Java to C. To work around this issue, we convert the Java string to a byte array on the Java side manually and then pass it to C. See: http://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni Note: Java uses UTF-16 Wide characters internally to represent a string. C uses UTF-8 Multibyte characters (null terminated) to represent a string. About encoding on Linux/Gtk invalid input: '&' it's relevance to SWT: #################################################### UTF-* = variable length encoding. UTF-8 = minimum is 8 bits, max is 6 bytes, but rarely goes beyond 4 bytes. Gtk invalid input: '&' most of web uses this. UTF-16 = minimum is 16 bits. Java's string are stored this way. UTF-16 can be Big Endian : 65 = 00000000 01000001 # Human friendly, reads left to right. Little Endian : 65 = 01000001 00000000 # Intel x86 and also AMD64 / x86-64 series of processors use the little-endian [1] # i.e, we in SWT often have to deal with UTF-16 LE Some terminology: - "Code point" is the numerical value of unicode character. - All of UTF-* have the same letter to code-point mapping, but UTF-8/16/32 have different "back-ends". Illustration: (char) = (code point) = (back end). A = 65 = 01000001 UTF-8 = 00000000 01000001 UTF-16 BE = 01000001 00000000 UTF-16 LE - Byte Order Marks (BOM) are a few bytes at the start of a *file* indicating which endianess is used. Problem: Gtk/webkit often don't give us BOM's. (further reading *3) - We can reliably encode character to a backend (A -> UTF-8/16), but the other way round is guess work since byte order marks are often missing and UTF-16 bits are technically valid UTF-8. (see Converter.heuristic for details). We could improve our heuristic by using something like http://jchardet.sourceforge.net/. - Glib has some conversion functions: g_utf16_to_utf8 g_utf8_to_utf16 - So does java: (e.g null terminated UTF-8) ("myString" + '\0').getBytes(StandardCharsets.UTF-8) - I suggest using Java functions where possible to avoid memory leaks. (Yes, they happen and are big-pain-in-the-ass to find https://bugs.eclipse.org/bugs/show_bug.cgi?id=533995) Learning about encoding: ######################### I suggest the following 3 videos to understand ASCII/UTF-8/UTF-16[LE|BE]/UTF-32 encoding: Overview: https://www.youtube.com/watch?v=MijmeoH9LT4 Details: Part-1: https://www.youtube.com/watch?v=B1Sf1IhA0j4 Part-2: https://www.youtube.com/watch?v=-oYfv794R9s Part-3: https://www.youtube.com/watch?v=vLBtrd9Ar28 Also read all of this: http://kunststube.net/encoding/ and this: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ And lastly, good utf-8 reference: https://en.wikipedia.org/wiki/UTF-8#Description You should now be a master of encoding. I wish you luck on your journey. [1] https://en.wikipedia.org/wiki/Endianness [2] https://en.wikipedia.org/wiki/Byte_order_mark [3] BOM's: http://unicode.org/faq/utf_bom.html#BOM
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final byte[]
     
    static final char[]
     
    static final byte[]
     
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static String
    Given a byte array with unknown encoding, try to decode it via (relatively simple) heuristic.
    static String
    cCharPtrToJavaString(long cCharPtr, boolean freecCharPtr)
    This method takes a 'C' pointer (char *) or (gchar *), reads characters up to the terminating symbol '\0' and converts it into a Java String.
    static byte[]
    Given a java String, convert it to a regular null terimnated C string, to be used when calling a native C function.
    static char[]
    mbcsToWcs(byte[] buffer)
    Convert a "C" multibyte UTF-8 string byte array into a Java UTF-16 Wide character array.
    static char
    mbcsToWcs(char ch)
    Convert C UTF-8 Multibyte character into a Java UTF-16 Wide character.
    static char
    wcsToMbcs(char ch)
    Convert a Java UTF-16 Wide character into a single C UTF-8 Multibyte character that you can pass to a native function.
    static byte[]
    wcsToMbcs(char[] chars, boolean terminate)
    Convert a Java UTF-16 Wide character array into a C UTF-8 Multibyte byte array.
    static byte[]
    wcsToMbcs(String string, boolean terminate)
    Convert a Java UTF-16 Wide character string into a C UTF-8 Multibyte byte array.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • NullByteArray

      public static final byte[] NullByteArray
    • EmptyByteArray

      public static final byte[] EmptyByteArray
    • EmptyCharArray

      public static final char[] EmptyCharArray
  • Constructor Details

    • Converter

      public Converter()
  • Method Details

    • mbcsToWcs

      public static char[] mbcsToWcs(byte[] buffer)
      Convert a "C" multibyte UTF-8 string byte array into a Java UTF-16 Wide character array.
      Parameters:
      buffer - - byte buffer with C bytes representing a string.
      Returns:
      char array representing the string. Usually used for String construction like: new String(mbcsToWcs(..))
    • wcsToMbcs

      public static byte[] wcsToMbcs(String string, boolean terminate)
      Convert a Java UTF-16 Wide character string into a C UTF-8 Multibyte byte array. This algorithm stops when it finds the first NULL character. I.e, if your Java String has embedded NULL characters, then the returned string will only go up to the first NULL character.
      Parameters:
      string - - a regular Java String
      terminate - - if true the byte buffer should be terminated with a null character.
      Returns:
      byte array that can be passed to a native function.
    • javaStringToCString

      public static byte[] javaStringToCString(String string)
      Given a java String, convert it to a regular null terimnated C string, to be used when calling a native C function.
      Parameters:
      string - A java string.
      Returns:
      a pointer to a C String. In C, this would be a 'char *'
    • cCharPtrToJavaString

      public static String cCharPtrToJavaString(long cCharPtr, boolean freecCharPtr)
      This method takes a 'C' pointer (char *) or (gchar *), reads characters up to the terminating symbol '\0' and converts it into a Java String. Note: In SWT we don't use JNI's native String functions because of the 3 vs 4 byte issue explained in Class description. Instead we pass a character pointer from C to java and convert it to a String in Java manually.
      Parameters:
      cCharPtr - - A char * or a gchar *. Which will be freed up afterwards.
      freecCharPtr - - "true" means free up memory pointed to by cCharPtr. CAREFUL! If this string is part of a struct (ex GError), and a specialized free function (like g_error_free(..) is called on the whole struct, then you should not free up individual struct members with this function, as otherwise you can get unpredictable behavior).
      Returns:
      a Java String object.
    • wcsToMbcs

      public static byte[] wcsToMbcs(char[] chars, boolean terminate)
      Convert a Java UTF-16 Wide character array into a C UTF-8 Multibyte byte array. This algorithm stops when it finds the first NULL character. I.e, if your Java String has embedded NULL characters, then the returned string will only go up to the first NULL character.
      Parameters:
      chars - - a regular Java String
      terminate - - if true the byte buffer should be terminated with a null character.
      Returns:
      byte array that can be passed to a native function.
    • wcsToMbcs

      public static char wcsToMbcs(char ch)
      Convert a Java UTF-16 Wide character into a single C UTF-8 Multibyte character that you can pass to a native function.
      Parameters:
      ch - - Java UTF-16 wide character.
      Returns:
      C UTF-8 Multibyte character.
    • mbcsToWcs

      public static char mbcsToWcs(char ch)
      Convert C UTF-8 Multibyte character into a Java UTF-16 Wide character.
      Parameters:
      ch - - C Multibyte UTF-8 character
      Returns:
      Java UTF-16 Wide character
    • byteToStringViaHeuristic

      public static String byteToStringViaHeuristic(byte[] bytes)
      Given a byte array with unknown encoding, try to decode it via (relatively simple) heuristic. This is useful when we're not provided the encoding by OS/library.
      Current implementation only supports standard java charsets but can be extended as needed. This method could be improved by using http://jchardet.sourceforge.net/
      Run time is O(a * n) where a is a constant that varies depending on the size of input n, but roughly 1-20)
      Parameters:
      bytes - raw bits from the OS.
      Returns:
      String based on the most pop