Module Mail::Multibyte::Unicode
In: lib/mail/multibyte/unicode.rb

Methods

Classes and Modules

Class Mail::Multibyte::Unicode::Codepoint
Class Mail::Multibyte::Unicode::UnicodeDatabase

Constants

NORMALIZATION_FORMS = [:c, :kc, :d, :kd]   A list of all available normalization forms. See www.unicode.org/reports/tr15/tr15-29.html for more information about normalization.
UNICODE_VERSION = '5.2.0'   The Unicode version that is supported by the implementation
HANGUL_SBASE = 0xAC00   Hangul character boundaries and properties
HANGUL_LBASE = 0x1100
HANGUL_VBASE = 0x1161
HANGUL_TBASE = 0x11A7
HANGUL_LCOUNT = 19
HANGUL_VCOUNT = 21
HANGUL_TCOUNT = 28
HANGUL_NCOUNT = HANGUL_VCOUNT * HANGUL_TCOUNT
HANGUL_SCOUNT = 11172
HANGUL_SLAST = HANGUL_SBASE + HANGUL_SCOUNT
HANGUL_JAMO_FIRST = 0x1100
HANGUL_JAMO_LAST = 0x11FF
WHITESPACE = [ (0x0009..0x000D).to_a, # White_Space # Cc [5] <control-0009>..<control-000D> 0x0020, # White_Space # Zs SPACE 0x0085, # White_Space # Cc <control-0085> 0x00A0, # White_Space # Zs NO-BREAK SPACE 0x1680, # White_Space # Zs OGHAM SPACE MARK 0x180E, # White_Space # Zs MONGOLIAN VOWEL SEPARATOR (0x2000..0x200A).to_a, # White_Space # Zs [11] EN QUAD..HAIR SPACE 0x2028, # White_Space # Zl LINE SEPARATOR 0x2029, # White_Space # Zp PARAGRAPH SEPARATOR 0x202F, # White_Space # Zs NARROW NO-BREAK SPACE 0x205F, # White_Space # Zs MEDIUM MATHEMATICAL SPACE 0x3000, # White_Space # Zs IDEOGRAPHIC SPACE ].flatten.freeze   All the unicode whitespace
LEADERS_AND_TRAILERS = WHITESPACE + [65279]   BOM (byte order mark) can also be seen as whitespace, it‘s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored.
TRAILERS_PAT = /(#{codepoints_to_pattern(LEADERS_AND_TRAILERS)})+\Z/u
LEADERS_PAT = /\A(#{codepoints_to_pattern(LEADERS_AND_TRAILERS)})+/u

Attributes

default_normalization_form  [RW]  The default normalization used for operations that require normalization. It can be set to any of the normalizations in NORMALIZATION_FORMS.

Example:

  Mail::Multibyte::Unicode.default_normalization_form = :c

Public Instance methods

Compose decomposed characters to the composed form.

Decompose composed characters to the decomposed form.

Reverse operation of g_unpack.

Example:

  Unicode.g_pack(Unicode.g_unpack('क्षि')) # => 'क्षि'

Unpack the string at grapheme boundaries. Returns a list of character lists.

Example:

  Unicode.g_unpack('क्षि') # => [[2325, 2381], [2359], [2367]]
  Unicode.g_unpack('Café') # => [[67], [97], [102], [233]]

Detect whether the codepoint is in a certain character class. Returns true when it‘s in the specified character class and false otherwise. Valid character classes are: :cr, :lf, :l, :v, :lv, :lvt and :t.

Primarily used by the grapheme cluster support.

Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.

  • string - The string to perform normalization on.
  • form - The form you want to normalize in. Should be one of the following: :c, :kc, :d, or :kd. Default is Mail::Multibyte.default_normalization_form

Re-order codepoints so the string becomes canonical.

Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.

Passing true will forcibly tidy all bytes, assuming that the string‘s encoding is entirely CP1252 or ISO-8859-1.

Unpack the string at codepoints boundaries. Raises an EncodingError when the encoding of the string isn‘t valid UTF-8.

Example:

  Unicode.u_unpack('Café') # => [67, 97, 102, 233]

[Validate]