1 \section{\module{unicodedata
} ---
4 \declaremodule{standard
}{unicodedata
}
5 \modulesynopsis{Access the Unicode Database.
}
6 \moduleauthor{Marc-Andre Lemburg
}{mal@lemburg.com
}
7 \sectionauthor{Marc-Andre Lemburg
}{mal@lemburg.com
}
8 \sectionauthor{Martin v. L\"owis
}{martin@v.loewis.de
}
12 \indexii{Unicode
}{database
}
14 This module provides access to the Unicode Character Database which
15 defines character properties for all Unicode characters. The data in
16 this database is based on the
\file{UnicodeData.txt
} file version
17 3.2.0 which is publically available from
\url{ftp://ftp.unicode.org/
}.
19 The module uses the same names and symbols as defined by the
20 UnicodeData File Format
3.2.0 (see
21 \url{http://www.unicode.org/Public/UNIDATA/UnicodeData.html
}). It
22 defines the following functions:
24 \begin{funcdesc
}{lookup
}{name
}
25 Look up character by name. If a character with the
26 given name is found, return the corresponding Unicode
27 character. If not found,
\exception{KeyError
} is raised.
30 \begin{funcdesc
}{name
}{unichr
\optional{, default
}}
31 Returns the name assigned to the Unicode character
32 \var{unichr
} as a string. If no name is defined,
33 \var{default
} is returned, or, if not given,
34 \exception{ValueError
} is raised.
37 \begin{funcdesc
}{decimal
}{unichr
\optional{, default
}}
38 Returns the decimal value assigned to the Unicode character
39 \var{unichr
} as integer. If no such value is defined,
40 \var{default
} is returned, or, if not given,
41 \exception{ValueError
} is raised.
44 \begin{funcdesc
}{digit
}{unichr
\optional{, default
}}
45 Returns the digit value assigned to the Unicode character
46 \var{unichr
} as integer. If no such value is defined,
47 \var{default
} is returned, or, if not given,
48 \exception{ValueError
} is raised.
51 \begin{funcdesc
}{numeric
}{unichr
\optional{, default
}}
52 Returns the numeric value assigned to the Unicode character
53 \var{unichr
} as float. If no such value is defined,
\var{default
} is
54 returned, or, if not given,
\exception{ValueError
} is raised.
57 \begin{funcdesc
}{category
}{unichr
}
58 Returns the general category assigned to the Unicode character
59 \var{unichr
} as string.
62 \begin{funcdesc
}{bidirectional
}{unichr
}
63 Returns the bidirectional category assigned to the Unicode character
64 \var{unichr
} as string. If no such value is defined, an empty string
68 \begin{funcdesc
}{combining
}{unichr
}
69 Returns the canonical combining class assigned to the Unicode
70 character
\var{unichr
} as integer. Returns
\code{0} if no combining
74 \begin{funcdesc
}{mirrored
}{unichr
}
75 Returns the mirrored property of assigned to the Unicode character
76 \var{unichr
} as integer. Returns
\code{1} if the character has been
77 identified as a ``mirrored'' character in bidirectional text,
81 \begin{funcdesc
}{decomposition
}{unichr
}
82 Returns the character decomposition mapping assigned to the Unicode
83 character
\var{unichr
} as string. An empty string is returned in case
84 no such mapping is defined.
87 \begin{funcdesc
}{normalize
}{form, unistr
}
89 Return the normal form
\var{form
} for the Unicode string
\var{unistr
}.
90 Valid values for
\var{form
} are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
92 The Unicode standard defines various normalization forms of a Unicode
93 string, based on the definition of canonical equivalence and
94 compatibility equivalence. In Unicode, several characters can be
95 expressed in various way. For example, the character U+
00C7 (LATIN
96 CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
97 U+
0043 (LATIN CAPITAL LETTER C) U+
0327 (COMBINING CEDILLA).
99 For each character, there are two normal forms: normal form C and
100 normal form D. Normal form D (NFD) is also known as canonical
101 decomposition, and translates each character into its decomposed form.
102 Normal form C (NFC) first applies a canonical decomposition, then
103 composes pre-combined characters again.
105 In addition to these two forms, there two additional normal forms
106 based on compatibility equivalence. In Unicode, certain characters are
107 supported which normally would be unified with other characters. For
108 example, U+
2160 (ROMAN NUMERAL ONE) is really the same thing as U+
0049
109 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for
110 compatibility with existing character sets (e.g. gb2312).
112 The normal form KD (NFKD) will apply the compatibility decomposition,
113 i.e. replace all compatibility characters with their equivalents. The
114 normal form KC (NFKC) first applies the compatibility decomposition,
115 followed by the canonical composition.
120 In addition, the module exposes the following constant:
122 \begin{datadesc
}{unidata_version
}
123 The version of the Unicode database used in this module.