public class PDFStringUtil
extends java.lang.Object
Utility methods for dealing with PDF Strings, such as:
converting to text strings
converting to PDFDocEncoded strings
converting to UTF-16BE strings
byte
and
string
representations
We refer to basic strings as those corresponding to the PDF 'string' type.
PDFRenderer represents these as String
s, though this is somewhat
deceiving, as they are, effectively, just sequences of bytes, although byte
values <= 127 do correspond to the ASCII character set. Outside of this,
the 'string' type, as repesented by basic strings do not possess any
character set or encoding, and byte values >= 128 are entirely acceptable.
For a basic string as represented by a String, each character has a value
less than 256 and is represented in the String as if the bytes represented as
it were in ISO-8859-1 encoding. This, however, is merely for convenience. For
strings that are user visible, and that don't merely represent some
identifying token, the PDF standard employs a 'text string' type that offers
the basic string as an encoding of in either UTF-16BE (with a byte order
marking) or a specific 8-byte encoding, PDFDocEncoding. Using a basic string
without conversion when the actual type is a 'text string' is erroneous
(though without consequence if the string consists only of ASCII
alphanumeric values). Care must be taken to either convert basic strings to
text strings (also expressed as a String) when appropriate, using either the
methods in this class, or PDFObject.getTextStringValue()
}. For
strings that are 'byte strings', asBytes(String)
or PDFObject.getStream()
should be used.
Constructor and Description |
---|
PDFStringUtil() |
Modifier and Type | Method and Description |
---|---|
static java.lang.String |
asBasicString(byte[] bytes)
Create a basic string from bytes.
|
static java.lang.String |
asBasicString(byte[] bytes,
int offset,
int length)
Create a basic string from bytes.
|
static byte[] |
asBytes(java.lang.String basicString)
Get the corresponding byte array for a basic string.
|
static java.lang.String |
asPDFDocEncoded(java.lang.String basicString)
Take a basic PDF string and produce a string of its bytes as encoded in
PDFDocEncoding.
|
static java.lang.String |
asTextString(java.lang.String basicString)
Take a basic PDF string and determine if it is in UTF-16BE encoding
by looking at the lead characters for a byte order marking (BOM).
|
static java.lang.String |
asUTF16BEEncoded(java.lang.String basicString)
Take a basic PDF string and produce a string from its bytes as an
UTF16-BE encoding.
|
byte[] |
toPDFDocEncoded(java.lang.String string) |
public static java.lang.String asTextString(java.lang.String basicString)
Take a basic PDF string and determine if it is in UTF-16BE encoding by looking at the lead characters for a byte order marking (BOM). If it appears to be UTF-16BE, we return the string representation of the UTF-16BE encoding of those bytes. If the BOM is not present, the bytes from the input string are decoded using the PDFDocEncoding charset.
From the PDF Reference 1.7, p158:
The text string type is used for character strings that are encoded in either PDFDocEncoding or the UTF-16BE Unicode character encoding scheme. PDFDocEncoding can encode all of the ISO Latin 1 character set and is documented in Appendix D. UTF-16BE can encode all Unicode characters. UTF-16BE and Unicode character encoding are described in the Unicode Standard by the Unicode Consortium (see the Bibliography). Note that PDFDocEncoding does not support all Unicode characters whereas UTF-16BE does.
basicString
- the basic PDF string, as offered by PDFObject.getStringValue()
public static java.lang.String asPDFDocEncoded(java.lang.String basicString)
basicString
- the basic PDF string, as offered by PDFObject.getStringValue()
public byte[] toPDFDocEncoded(java.lang.String string) throws java.nio.charset.CharacterCodingException
java.nio.charset.CharacterCodingException
public static java.lang.String asUTF16BEEncoded(java.lang.String basicString)
basicString
- the basic PDF string, as offered by PDFObject.getStringValue()
public static byte[] asBytes(java.lang.String basicString)
basicString
- the basic PDF string, as offered by PDFObject.getStringValue()
public static java.lang.String asBasicString(byte[] bytes, int offset, int length)
bytes
- the source of the bytes for the basic stringoffset
- the offset into butes where the string startslength
- the number of bytes to turn into a stringpublic static java.lang.String asBasicString(byte[] bytes)
bytes
- the bytes, all of which are used