Doc/c-api/unicode.rst

   1 .. highlightlang:: c
   2
   3 .. _unicodeobjects:
   4
   5 Unicode Objects and Codecs
   6 --------------------------
   7
   8 .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
   9
  10 Unicode Objects
  11 ^^^^^^^^^^^^^^^
  12
  13 These are the basic Unicode object types used for the Unicode implementation in
  14 Python:
  15
  16 .. % --- Unicode Type -------------------------------------------------------
  17
  18
  19 .. ctype:: Py_UNICODE
  20
  21    This type represents the storage type which is used by Python internally as
  22    basis for holding Unicode ordinals.  Python's default builds use a 16-bit type
  23    for :ctype:`Py_UNICODE` and store Unicode values internally as UCS2. It is also
  24    possible to build a UCS4 version of Python (most recent Linux distributions come
  25    with UCS4 builds of Python). These builds then use a 32-bit type for
  26    :ctype:`Py_UNICODE` and store Unicode data internally as UCS4. On platforms
  27    where :ctype:`wchar_t` is available and compatible with the chosen Python
  28    Unicode build variant, :ctype:`Py_UNICODE` is a typedef alias for
  29    :ctype:`wchar_t` to enhance native platform compatibility. On all other
  30    platforms, :ctype:`Py_UNICODE` is a typedef alias for either :ctype:`unsigned
  31    short` (UCS2) or :ctype:`unsigned long` (UCS4).
  32
  33 Note that UCS2 and UCS4 Python builds are not binary compatible. Please keep
  34 this in mind when writing extensions or interfaces.
  35
  36
  37 .. ctype:: PyUnicodeObject
  38
  39    This subtype of :ctype:`PyObject` represents a Python Unicode object.
  40
  41
  42 .. cvar:: PyTypeObject PyUnicode_Type
  43
  44    This instance of :ctype:`PyTypeObject` represents the Python Unicode type.  It
  45    is exposed to Python code as ``str``.
  46
  47 The following APIs are really C macros and can be used to do fast checks and to
  48 access internal read-only data of Unicode objects:
  49
  50
  51 .. cfunction:: int PyUnicode_Check(PyObject *o)
  52
  53    Return true if the object *o* is a Unicode object or an instance of a Unicode
  54    subtype.
  55
  56
  57 .. cfunction:: int PyUnicode_CheckExact(PyObject *o)
  58
  59    Return true if the object *o* is a Unicode object, but not an instance of a
  60    subtype.
  61
  62
  63 .. cfunction:: Py_ssize_t PyUnicode_GET_SIZE(PyObject *o)
  64
  65    Return the size of the object.  *o* has to be a :ctype:`PyUnicodeObject` (not
  66    checked).
  67
  68
  69 .. cfunction:: Py_ssize_t PyUnicode_GET_DATA_SIZE(PyObject *o)
  70
  71    Return the size of the object's internal buffer in bytes.  *o* has to be a
  72    :ctype:`PyUnicodeObject` (not checked).
  73
  74
  75 .. cfunction:: Py_UNICODE* PyUnicode_AS_UNICODE(PyObject *o)
  76
  77    Return a pointer to the internal :ctype:`Py_UNICODE` buffer of the object.  *o*
  78    has to be a :ctype:`PyUnicodeObject` (not checked).
  79
  80
  81 .. cfunction:: const char* PyUnicode_AS_DATA(PyObject *o)
  82
  83    Return a pointer to the internal buffer of the object. *o* has to be a
  84    :ctype:`PyUnicodeObject` (not checked).
  85
  86
  87 .. cfunction:: int PyUnicode_ClearFreeList(void)
  88
  89    Clear the free list. Return the total number of freed items.
  90
  91 Unicode provides many different character properties. The most often needed ones
  92 are available through these macros which are mapped to C functions depending on
  93 the Python configuration.
  94
  95 .. % --- Unicode character properties ---------------------------------------
  96
  97
  98 .. cfunction:: int Py_UNICODE_ISSPACE(Py_UNICODE ch)
  99
 100    Return 1 or 0 depending on whether *ch* is a whitespace character.
 101
 102
 103 .. cfunction:: int Py_UNICODE_ISLOWER(Py_UNICODE ch)
 104
 105    Return 1 or 0 depending on whether *ch* is a lowercase character.
 106
 107
 108 .. cfunction:: int Py_UNICODE_ISUPPER(Py_UNICODE ch)
 109
 110    Return 1 or 0 depending on whether *ch* is an uppercase character.
 111
 112
 113 .. cfunction:: int Py_UNICODE_ISTITLE(Py_UNICODE ch)
 114
 115    Return 1 or 0 depending on whether *ch* is a titlecase character.
 116
 117
 118 .. cfunction:: int Py_UNICODE_ISLINEBREAK(Py_UNICODE ch)
 119
 120    Return 1 or 0 depending on whether *ch* is a linebreak character.
 121
 122
 123 .. cfunction:: int Py_UNICODE_ISDECIMAL(Py_UNICODE ch)
 124
 125    Return 1 or 0 depending on whether *ch* is a decimal character.
 126
 127
 128 .. cfunction:: int Py_UNICODE_ISDIGIT(Py_UNICODE ch)
 129
 130    Return 1 or 0 depending on whether *ch* is a digit character.
 131
 132
 133 .. cfunction:: int Py_UNICODE_ISNUMERIC(Py_UNICODE ch)
 134
 135    Return 1 or 0 depending on whether *ch* is a numeric character.
 136
 137
 138 .. cfunction:: int Py_UNICODE_ISALPHA(Py_UNICODE ch)
 139
 140    Return 1 or 0 depending on whether *ch* is an alphabetic character.
 141
 142
 143 .. cfunction:: int Py_UNICODE_ISALNUM(Py_UNICODE ch)
 144
 145    Return 1 or 0 depending on whether *ch* is an alphanumeric character.
 146
 147 These APIs can be used for fast direct character conversions:
 148
 149
 150 .. cfunction:: Py_UNICODE Py_UNICODE_TOLOWER(Py_UNICODE ch)
 151
 152    Return the character *ch* converted to lower case.
 153
 154
 155 .. cfunction:: Py_UNICODE Py_UNICODE_TOUPPER(Py_UNICODE ch)
 156
 157    Return the character *ch* converted to upper case.
 158
 159
 160 .. cfunction:: Py_UNICODE Py_UNICODE_TOTITLE(Py_UNICODE ch)
 161
 162    Return the character *ch* converted to title case.
 163
 164
 165 .. cfunction:: int Py_UNICODE_TODECIMAL(Py_UNICODE ch)
 166
 167    Return the character *ch* converted to a decimal positive integer.  Return
 168    ``-1`` if this is not possible.  This macro does not raise exceptions.
 169
 170
 171 .. cfunction:: int Py_UNICODE_TODIGIT(Py_UNICODE ch)
 172
 173    Return the character *ch* converted to a single digit integer. Return ``-1`` if
 174    this is not possible.  This macro does not raise exceptions.
 175
 176
 177 .. cfunction:: double Py_UNICODE_TONUMERIC(Py_UNICODE ch)
 178
 179    Return the character *ch* converted to a double. Return ``-1.0`` if this is not
 180    possible.  This macro does not raise exceptions.
 181
 182 To create Unicode objects and access their basic sequence properties, use these
 183 APIs:
 184
 185 .. % --- Plain Py_UNICODE ---------------------------------------------------
 186
 187
 188 .. cfunction:: PyObject* PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
 189
 190    Create a Unicode Object from the Py_UNICODE buffer *u* of the given size. *u*
 191    may be *NULL* which causes the contents to be undefined. It is the user's
 192    responsibility to fill in the needed data.  The buffer is copied into the new
 193    object. If the buffer is not *NULL*, the return value might be a shared object.
 194    Therefore, modification of the resulting Unicode object is only allowed when *u*
 195    is *NULL*.
 196
 197
 198 .. cfunction:: PyObject* PyUnicode_FromStringAndSize(const char *u, Py_ssize_t size)
 199
 200    Create a Unicode Object from the char buffer *u*.  The bytes will be interpreted
 201    as being UTF-8 encoded.  *u* may also be *NULL* which
 202    causes the contents to be undefined. It is the user's responsibility to fill in
 203    the needed data.  The buffer is copied into the new object. If the buffer is not
 204    *NULL*, the return value might be a shared object. Therefore, modification of
 205    the resulting Unicode object is only allowed when *u* is *NULL*.
 206
 207
 208 .. cfunction:: PyObject *PyUnicode_FromString(const char *u)
 209
 210    Create a Unicode object from an UTF-8 encoded null-terminated char buffer
 211    *u*.
 212
 213
 214 .. cfunction:: PyObject* PyUnicode_FromFormat(const char *format, ...)
 215
 216    Take a C :cfunc:`printf`\ -style *format* string and a variable number of
 217    arguments, calculate the size of the resulting Python unicode string and return
 218    a string with the values formatted into it.  The variable arguments must be C
 219    types and must correspond exactly to the format characters in the *format*
 220    string.  The following format characters are allowed:
 221
 222    .. % The descriptions for %zd and %zu are wrong, but the truth is complicated
 223    .. % because not all compilers support the %z width modifier -- we fake it
 224    .. % when necessary via interpolating PY_FORMAT_SIZE_T.
 225
 226    +-------------------+---------------------+--------------------------------+
 227    | Format Characters | Type                | Comment                        |
 228    +===================+=====================+================================+
 229    | :attr:`%%`        | *n/a*               | The literal % character.       |
 230    +-------------------+---------------------+--------------------------------+
 231    | :attr:`%c`        | int                 | A single character,            |
 232    |                   |                     | represented as an C int.       |
 233    +-------------------+---------------------+--------------------------------+
 234    | :attr:`%d`        | int                 | Exactly equivalent to          |
 235    |                   |                     | ``printf("%d")``.              |
 236    +-------------------+---------------------+--------------------------------+
 237    | :attr:`%u`        | unsigned int        | Exactly equivalent to          |
 238    |                   |                     | ``printf("%u")``.              |
 239    +-------------------+---------------------+--------------------------------+
 240    | :attr:`%ld`       | long                | Exactly equivalent to          |
 241    |                   |                     | ``printf("%ld")``.             |
 242    +-------------------+---------------------+--------------------------------+
 243    | :attr:`%lu`       | unsigned long       | Exactly equivalent to          |
 244    |                   |                     | ``printf("%lu")``.             |
 245    +-------------------+---------------------+--------------------------------+
 246    | :attr:`%zd`       | Py_ssize_t          | Exactly equivalent to          |
 247    |                   |                     | ``printf("%zd")``.             |
 248    +-------------------+---------------------+--------------------------------+
 249    | :attr:`%zu`       | size_t              | Exactly equivalent to          |
 250    |                   |                     | ``printf("%zu")``.             |
 251    +-------------------+---------------------+--------------------------------+
 252    | :attr:`%i`        | int                 | Exactly equivalent to          |
 253    |                   |                     | ``printf("%i")``.              |
 254    +-------------------+---------------------+--------------------------------+
 255    | :attr:`%x`        | int                 | Exactly equivalent to          |
 256    |                   |                     | ``printf("%x")``.              |
 257    +-------------------+---------------------+--------------------------------+
 258    | :attr:`%s`        | char\*              | A null-terminated C character  |
 259    |                   |                     | array.                         |
 260    +-------------------+---------------------+--------------------------------+
 261    | :attr:`%p`        | void\*              | The hex representation of a C  |
 262    |                   |                     | pointer. Mostly equivalent to  |
 263    |                   |                     | ``printf("%p")`` except that   |
 264    |                   |                     | it is guaranteed to start with |
 265    |                   |                     | the literal ``0x`` regardless  |
 266    |                   |                     | of what the platform's         |
 267    |                   |                     | ``printf`` yields.             |
 268    +-------------------+---------------------+--------------------------------+
 269    | :attr:`%U`        | PyObject\*          | A unicode object.              |
 270    +-------------------+---------------------+--------------------------------+
 271    | :attr:`%V`        | PyObject\*, char \* | A unicode object (which may be |
 272    |                   |                     | *NULL*) and a null-terminated  |
 273    |                   |                     | C character array as a second  |
 274    |                   |                     | parameter (which will be used, |
 275    |                   |                     | if the first parameter is      |
 276    |                   |                     | *NULL*).                       |
 277    +-------------------+---------------------+--------------------------------+
 278    | :attr:`%S`        | PyObject\*          | The result of calling          |
 279    |                   |                     | :func:`PyObject_Unicode`.      |
 280    +-------------------+---------------------+--------------------------------+
 281    | :attr:`%R`        | PyObject\*          | The result of calling          |
 282    |                   |                     | :func:`PyObject_Repr`.         |
 283    +-------------------+---------------------+--------------------------------+
 284
 285    An unrecognized format character causes all the rest of the format string to be
 286    copied as-is to the result string, and any extra arguments discarded.
 287
 288
 289 .. cfunction:: PyObject* PyUnicode_FromFormatV(const char *format, va_list vargs)
 290
 291    Identical to :func:`PyUnicode_FromFormat` except that it takes exactly two
 292    arguments.
 293
 294
 295 .. cfunction:: Py_UNICODE* PyUnicode_AsUnicode(PyObject *unicode)
 296
 297    Return a read-only pointer to the Unicode object's internal :ctype:`Py_UNICODE`
 298    buffer, *NULL* if *unicode* is not a Unicode object.
 299
 300
 301 .. cfunction:: Py_ssize_t PyUnicode_GetSize(PyObject *unicode)
 302
 303    Return the length of the Unicode object.
 304
 305
 306 .. cfunction:: PyObject* PyUnicode_FromEncodedObject(PyObject *obj, const char *encoding, const char *errors)
 307
 308    Coerce an encoded object *obj* to an Unicode object and return a reference with
 309    incremented refcount.
 310
 311    String and other char buffer compatible objects are decoded according to the
 312    given encoding and using the error handling defined by errors.  Both can be
 313    *NULL* to have the interface use the default values (see the next section for
 314    details).
 315
 316    All other objects, including Unicode objects, cause a :exc:`TypeError` to be
 317    set.
 318
 319    The API returns *NULL* if there was an error.  The caller is responsible for
 320    decref'ing the returned objects.
 321
 322
 323 .. cfunction:: PyObject* PyUnicode_FromObject(PyObject *obj)
 324
 325    Shortcut for ``PyUnicode_FromEncodedObject(obj, NULL, "strict")`` which is used
 326    throughout the interpreter whenever coercion to Unicode is needed.
 327
 328 If the platform supports :ctype:`wchar_t` and provides a header file wchar.h,
 329 Python can interface directly to this type using the following functions.
 330 Support is optimized if Python's own :ctype:`Py_UNICODE` type is identical to
 331 the system's :ctype:`wchar_t`.
 332
 333 .. % --- wchar_t support for platforms which support it ---------------------
 334
 335
 336 .. cfunction:: PyObject* PyUnicode_FromWideChar(const wchar_t *w, Py_ssize_t size)
 337
 338    Create a Unicode object from the :ctype:`wchar_t` buffer *w* of the given size.
 339    Passing -1 as the size indicates that the function must itself compute the length,
 340    using wcslen.
 341    Return *NULL* on failure.
 342
 343
 344 .. cfunction:: Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *unicode, wchar_t *w, Py_ssize_t size)
 345
 346    Copy the Unicode object contents into the :ctype:`wchar_t` buffer *w*.  At most
 347    *size* :ctype:`wchar_t` characters are copied (excluding a possibly trailing
 348    0-termination character).  Return the number of :ctype:`wchar_t` characters
 349    copied or -1 in case of an error.  Note that the resulting :ctype:`wchar_t`
 350    string may or may not be 0-terminated.  It is the responsibility of the caller
 351    to make sure that the :ctype:`wchar_t` string is 0-terminated in case this is
 352    required by the application.
 353
 354
 355 .. _builtincodecs:
 356
 357 Built-in Codecs
 358 ^^^^^^^^^^^^^^^
 359
 360 Python provides a set of builtin codecs which are written in C for speed. All of
 361 these codecs are directly usable via the following functions.
 362
 363 Many of the following APIs take two arguments encoding and errors. These
 364 parameters encoding and errors have the same semantics as the ones of the
 365 builtin unicode() Unicode object constructor.
 366
 367 Setting encoding to *NULL* causes the default encoding to be used which is
 368 ASCII.  The file system calls should use :cdata:`Py_FileSystemDefaultEncoding`
 369 as the encoding for file names. This variable should be treated as read-only: On
 370 some systems, it will be a pointer to a static string, on others, it will change
 371 at run-time (such as when the application invokes setlocale).
 372
 373 Error handling is set by errors which may also be set to *NULL* meaning to use
 374 the default handling defined for the codec.  Default error handling for all
 375 builtin codecs is "strict" (:exc:`ValueError` is raised).
 376
 377 The codecs all use a similar interface.  Only deviation from the following
 378 generic ones are documented for simplicity.
 379
 380 These are the generic codec APIs:
 381
 382 .. % --- Generic Codecs -----------------------------------------------------
 383
 384
 385 .. cfunction:: PyObject* PyUnicode_Decode(const char *s, Py_ssize_t size, const char *encoding, const char *errors)
 386
 387    Create a Unicode object by decoding *size* bytes of the encoded string *s*.
 388    *encoding* and *errors* have the same meaning as the parameters of the same name
 389    in the :func:`unicode` builtin function.  The codec to be used is looked up
 390    using the Python codec registry.  Return *NULL* if an exception was raised by
 391    the codec.
 392
 393
 394 .. cfunction:: PyObject* PyUnicode_Encode(const Py_UNICODE *s, Py_ssize_t size, const char *encoding, const char *errors)
 395
 396    Encode the :ctype:`Py_UNICODE` buffer of the given size and return a Python
 397    string object.  *encoding* and *errors* have the same meaning as the parameters
 398    of the same name in the Unicode :meth:`encode` method.  The codec to be used is
 399    looked up using the Python codec registry.  Return *NULL* if an exception was
 400    raised by the codec.
 401
 402
 403 .. cfunction:: PyObject* PyUnicode_AsEncodedString(PyObject *unicode, const char *encoding, const char *errors)
 404
 405    Encode a Unicode object and return the result as Python string object.
 406    *encoding* and *errors* have the same meaning as the parameters of the same name
 407    in the Unicode :meth:`encode` method. The codec to be used is looked up using
 408    the Python codec registry. Return *NULL* if an exception was raised by the
 409    codec.
 410
 411 These are the UTF-8 codec APIs:
 412
 413 .. % --- UTF-8 Codecs -------------------------------------------------------
 414
 415
 416 .. cfunction:: PyObject* PyUnicode_DecodeUTF8(const char *s, Py_ssize_t size, const char *errors)
 417
 418    Create a Unicode object by decoding *size* bytes of the UTF-8 encoded string
 419    *s*. Return *NULL* if an exception was raised by the codec.
 420
 421
 422 .. cfunction:: PyObject* PyUnicode_DecodeUTF8Stateful(const char *s, Py_ssize_t size, const char *errors, Py_ssize_t *consumed)
 423
 424    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF8`. If
 425    *consumed* is not *NULL*, trailing incomplete UTF-8 byte sequences will not be
 426    treated as an error. Those bytes will not be decoded and the number of bytes
 427    that have been decoded will be stored in *consumed*.
 428
 429
 430 .. cfunction:: PyObject* PyUnicode_EncodeUTF8(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
 431
 432    Encode the :ctype:`Py_UNICODE` buffer of the given size using UTF-8 and return a
 433    Python string object.  Return *NULL* if an exception was raised by the codec.
 434
 435
 436 .. cfunction:: PyObject* PyUnicode_AsUTF8String(PyObject *unicode)
 437
 438    Encode a Unicode object using UTF-8 and return the result as Python string
 439    object.  Error handling is "strict".  Return *NULL* if an exception was raised
 440    by the codec.
 441
 442 These are the UTF-32 codec APIs:
 443
 444 .. % --- UTF-32 Codecs ------------------------------------------------------ */
 445
 446
 447 .. cfunction:: PyObject* PyUnicode_DecodeUTF32(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
 448
 449    Decode *length* bytes from a UTF-32 encoded buffer string and return the
 450    corresponding Unicode object.  *errors* (if non-*NULL*) defines the error
 451    handling. It defaults to "strict".
 452
 453    If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
 454    order::
 455
 456       *byteorder == -1: little endian
 457       *byteorder == 0:  native order
 458       *byteorder == 1:  big endian
 459
 460    and then switches if the first four bytes of the input data are a byte order mark
 461    (BOM) and the specified byte order is native order.  This BOM is not copied into
 462    the resulting Unicode string.  After completion, *\*byteorder* is set to the
 463    current byte order at the end of input data.
 464
 465    In a narrow build codepoints outside the BMP will be decoded as surrogate pairs.
 466
 467    If *byteorder* is *NULL*, the codec starts in native order mode.
 468
 469    Return *NULL* if an exception was raised by the codec.
 470
 471
 472 .. cfunction:: PyObject* PyUnicode_DecodeUTF32Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
 473
 474    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF32`. If
 475    *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF32Stateful` will not treat
 476    trailing incomplete UTF-32 byte sequences (such as a number of bytes not divisible
 477    by four) as an error. Those bytes will not be decoded and the number of bytes
 478    that have been decoded will be stored in *consumed*.
 479
 480
 481 .. cfunction:: PyObject* PyUnicode_EncodeUTF32(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
 482
 483    Return a Python bytes object holding the UTF-32 encoded value of the Unicode
 484    data in *s*.  If *byteorder* is not ``0``, output is written according to the
 485    following byte order::
 486
 487       byteorder == -1: little endian
 488       byteorder == 0:  native byte order (writes a BOM mark)
 489       byteorder == 1:  big endian
 490
 491    If byteorder is ``0``, the output string will always start with the Unicode BOM
 492    mark (U+FEFF). In the other two modes, no BOM mark is prepended.
 493
 494    If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output
 495    as a single codepoint.
 496
 497    Return *NULL* if an exception was raised by the codec.
 498
 499
 500 .. cfunction:: PyObject* PyUnicode_AsUTF32String(PyObject *unicode)
 501
 502    Return a Python string using the UTF-32 encoding in native byte order. The
 503    string always starts with a BOM mark.  Error handling is "strict".  Return
 504    *NULL* if an exception was raised by the codec.
 505
 506
 507 These are the UTF-16 codec APIs:
 508
 509 .. % --- UTF-16 Codecs ------------------------------------------------------ */
 510
 511
 512 .. cfunction:: PyObject* PyUnicode_DecodeUTF16(const char *s, Py_ssize_t size, const char *errors, int *byteorder)
 513
 514    Decode *length* bytes from a UTF-16 encoded buffer string and return the
 515    corresponding Unicode object.  *errors* (if non-*NULL*) defines the error
 516    handling. It defaults to "strict".
 517
 518    If *byteorder* is non-*NULL*, the decoder starts decoding using the given byte
 519    order::
 520
 521       *byteorder == -1: little endian
 522       *byteorder == 0:  native order
 523       *byteorder == 1:  big endian
 524
 525    and then switches if the first two bytes of the input data are a byte order mark
 526    (BOM) and the specified byte order is native order.  This BOM is not copied into
 527    the resulting Unicode string.  After completion, *\*byteorder* is set to the
 528    current byte order at the end of input data.
 529
 530    If *byteorder* is *NULL*, the codec starts in native order mode.
 531
 532    Return *NULL* if an exception was raised by the codec.
 533
 534
 535 .. cfunction:: PyObject* PyUnicode_DecodeUTF16Stateful(const char *s, Py_ssize_t size, const char *errors, int *byteorder, Py_ssize_t *consumed)
 536
 537    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeUTF16`. If
 538    *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeUTF16Stateful` will not treat
 539    trailing incomplete UTF-16 byte sequences (such as an odd number of bytes or a
 540    split surrogate pair) as an error. Those bytes will not be decoded and the
 541    number of bytes that have been decoded will be stored in *consumed*.
 542
 543
 544 .. cfunction:: PyObject* PyUnicode_EncodeUTF16(const Py_UNICODE *s, Py_ssize_t size, const char *errors, int byteorder)
 545
 546    Return a Python string object holding the UTF-16 encoded value of the Unicode
 547    data in *s*.  If *byteorder* is not ``0``, output is written according to the
 548    following byte order::
 549
 550       byteorder == -1: little endian
 551       byteorder == 0:  native byte order (writes a BOM mark)
 552       byteorder == 1:  big endian
 553
 554    If byteorder is ``0``, the output string will always start with the Unicode BOM
 555    mark (U+FEFF). In the other two modes, no BOM mark is prepended.
 556
 557    If *Py_UNICODE_WIDE* is defined, a single :ctype:`Py_UNICODE` value may get
 558    represented as a surrogate pair. If it is not defined, each :ctype:`Py_UNICODE`
 559    values is interpreted as an UCS-2 character.
 560
 561    Return *NULL* if an exception was raised by the codec.
 562
 563
 564 .. cfunction:: PyObject* PyUnicode_AsUTF16String(PyObject *unicode)
 565
 566    Return a Python string using the UTF-16 encoding in native byte order. The
 567    string always starts with a BOM mark.  Error handling is "strict".  Return
 568    *NULL* if an exception was raised by the codec.
 569
 570 These are the "Unicode Escape" codec APIs:
 571
 572 .. % --- Unicode-Escape Codecs ----------------------------------------------
 573
 574
 575 .. cfunction:: PyObject* PyUnicode_DecodeUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
 576
 577    Create a Unicode object by decoding *size* bytes of the Unicode-Escape encoded
 578    string *s*.  Return *NULL* if an exception was raised by the codec.
 579
 580
 581 .. cfunction:: PyObject* PyUnicode_EncodeUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size)
 582
 583    Encode the :ctype:`Py_UNICODE` buffer of the given size using Unicode-Escape and
 584    return a Python string object.  Return *NULL* if an exception was raised by the
 585    codec.
 586
 587
 588 .. cfunction:: PyObject* PyUnicode_AsUnicodeEscapeString(PyObject *unicode)
 589
 590    Encode a Unicode object using Unicode-Escape and return the result as Python
 591    string object.  Error handling is "strict". Return *NULL* if an exception was
 592    raised by the codec.
 593
 594 These are the "Raw Unicode Escape" codec APIs:
 595
 596 .. % --- Raw-Unicode-Escape Codecs ------------------------------------------
 597
 598
 599 .. cfunction:: PyObject* PyUnicode_DecodeRawUnicodeEscape(const char *s, Py_ssize_t size, const char *errors)
 600
 601    Create a Unicode object by decoding *size* bytes of the Raw-Unicode-Escape
 602    encoded string *s*.  Return *NULL* if an exception was raised by the codec.
 603
 604
 605 .. cfunction:: PyObject* PyUnicode_EncodeRawUnicodeEscape(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
 606
 607    Encode the :ctype:`Py_UNICODE` buffer of the given size using Raw-Unicode-Escape
 608    and return a Python string object.  Return *NULL* if an exception was raised by
 609    the codec.
 610
 611
 612 .. cfunction:: PyObject* PyUnicode_AsRawUnicodeEscapeString(PyObject *unicode)
 613
 614    Encode a Unicode object using Raw-Unicode-Escape and return the result as
 615    Python string object. Error handling is "strict". Return *NULL* if an exception
 616    was raised by the codec.
 617
 618 These are the Latin-1 codec APIs: Latin-1 corresponds to the first 256 Unicode
 619 ordinals and only these are accepted by the codecs during encoding.
 620
 621 .. % --- Latin-1 Codecs -----------------------------------------------------
 622
 623
 624 .. cfunction:: PyObject* PyUnicode_DecodeLatin1(const char *s, Py_ssize_t size, const char *errors)
 625
 626    Create a Unicode object by decoding *size* bytes of the Latin-1 encoded string
 627    *s*.  Return *NULL* if an exception was raised by the codec.
 628
 629
 630 .. cfunction:: PyObject* PyUnicode_EncodeLatin1(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
 631
 632    Encode the :ctype:`Py_UNICODE` buffer of the given size using Latin-1 and return
 633    a Python string object.  Return *NULL* if an exception was raised by the codec.
 634
 635
 636 .. cfunction:: PyObject* PyUnicode_AsLatin1String(PyObject *unicode)
 637
 638    Encode a Unicode object using Latin-1 and return the result as Python string
 639    object.  Error handling is "strict".  Return *NULL* if an exception was raised
 640    by the codec.
 641
 642 These are the ASCII codec APIs.  Only 7-bit ASCII data is accepted. All other
 643 codes generate errors.
 644
 645 .. % --- ASCII Codecs -------------------------------------------------------
 646
 647
 648 .. cfunction:: PyObject* PyUnicode_DecodeASCII(const char *s, Py_ssize_t size, const char *errors)
 649
 650    Create a Unicode object by decoding *size* bytes of the ASCII encoded string
 651    *s*.  Return *NULL* if an exception was raised by the codec.
 652
 653
 654 .. cfunction:: PyObject* PyUnicode_EncodeASCII(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
 655
 656    Encode the :ctype:`Py_UNICODE` buffer of the given size using ASCII and return a
 657    Python string object.  Return *NULL* if an exception was raised by the codec.
 658
 659
 660 .. cfunction:: PyObject* PyUnicode_AsASCIIString(PyObject *unicode)
 661
 662    Encode a Unicode object using ASCII and return the result as Python string
 663    object.  Error handling is "strict".  Return *NULL* if an exception was raised
 664    by the codec.
 665
 666 These are the mapping codec APIs:
 667
 668 .. % --- Character Map Codecs -----------------------------------------------
 669
 670 This codec is special in that it can be used to implement many different codecs
 671 (and this is in fact what was done to obtain most of the standard codecs
 672 included in the :mod:`encodings` package). The codec uses mapping to encode and
 673 decode characters.
 674
 675 Decoding mappings must map single string characters to single Unicode
 676 characters, integers (which are then interpreted as Unicode ordinals) or None
 677 (meaning "undefined mapping" and causing an error).
 678
 679 Encoding mappings must map single Unicode characters to single string
 680 characters, integers (which are then interpreted as Latin-1 ordinals) or None
 681 (meaning "undefined mapping" and causing an error).
 682
 683 The mapping objects provided must only support the __getitem__ mapping
 684 interface.
 685
 686 If a character lookup fails with a LookupError, the character is copied as-is
 687 meaning that its ordinal value will be interpreted as Unicode or Latin-1 ordinal
 688 resp. Because of this, mappings only need to contain those mappings which map
 689 characters to different code points.
 690
 691
 692 .. cfunction:: PyObject* PyUnicode_DecodeCharmap(const char *s, Py_ssize_t size, PyObject *mapping, const char *errors)
 693
 694    Create a Unicode object by decoding *size* bytes of the encoded string *s* using
 695    the given *mapping* object.  Return *NULL* if an exception was raised by the
 696    codec. If *mapping* is *NULL* latin-1 decoding will be done. Else it can be a
 697    dictionary mapping byte or a unicode string, which is treated as a lookup table.
 698    Byte values greater that the length of the string and U+FFFE "characters" are
 699    treated as "undefined mapping".
 700
 701
 702 .. cfunction:: PyObject* PyUnicode_EncodeCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *mapping, const char *errors)
 703
 704    Encode the :ctype:`Py_UNICODE` buffer of the given size using the given
 705    *mapping* object and return a Python string object. Return *NULL* if an
 706    exception was raised by the codec.
 707
 708
 709 .. cfunction:: PyObject* PyUnicode_AsCharmapString(PyObject *unicode, PyObject *mapping)
 710
 711    Encode a Unicode object using the given *mapping* object and return the result
 712    as Python string object.  Error handling is "strict".  Return *NULL* if an
 713    exception was raised by the codec.
 714
 715 The following codec API is special in that maps Unicode to Unicode.
 716
 717
 718 .. cfunction:: PyObject* PyUnicode_TranslateCharmap(const Py_UNICODE *s, Py_ssize_t size, PyObject *table, const char *errors)
 719
 720    Translate a :ctype:`Py_UNICODE` buffer of the given length by applying a
 721    character mapping *table* to it and return the resulting Unicode object.  Return
 722    *NULL* when an exception was raised by the codec.
 723
 724    The *mapping* table must map Unicode ordinal integers to Unicode ordinal
 725    integers or None (causing deletion of the character).
 726
 727    Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
 728    and sequences work well.  Unmapped character ordinals (ones which cause a
 729    :exc:`LookupError`) are left untouched and are copied as-is.
 730
 731 These are the MBCS codec APIs. They are currently only available on Windows and
 732 use the Win32 MBCS converters to implement the conversions.  Note that MBCS (or
 733 DBCS) is a class of encodings, not just one.  The target encoding is defined by
 734 the user settings on the machine running the codec.
 735
 736 .. % --- MBCS codecs for Windows --------------------------------------------
 737
 738
 739 .. cfunction:: PyObject* PyUnicode_DecodeMBCS(const char *s, Py_ssize_t size, const char *errors)
 740
 741    Create a Unicode object by decoding *size* bytes of the MBCS encoded string *s*.
 742    Return *NULL* if an exception was raised by the codec.
 743
 744
 745 .. cfunction:: PyObject* PyUnicode_DecodeMBCSStateful(const char *s, int size, const char *errors, int *consumed)
 746
 747    If *consumed* is *NULL*, behave like :cfunc:`PyUnicode_DecodeMBCS`. If
 748    *consumed* is not *NULL*, :cfunc:`PyUnicode_DecodeMBCSStateful` will not decode
 749    trailing lead byte and the number of bytes that have been decoded will be stored
 750    in *consumed*.
 751
 752
 753 .. cfunction:: PyObject* PyUnicode_EncodeMBCS(const Py_UNICODE *s, Py_ssize_t size, const char *errors)
 754
 755    Encode the :ctype:`Py_UNICODE` buffer of the given size using MBCS and return a
 756    Python string object.  Return *NULL* if an exception was raised by the codec.
 757
 758
 759 .. cfunction:: PyObject* PyUnicode_AsMBCSString(PyObject *unicode)
 760
 761    Encode a Unicode object using MBCS and return the result as Python string
 762    object.  Error handling is "strict".  Return *NULL* if an exception was raised
 763    by the codec.
 764
 765 .. % --- Methods & Slots ----------------------------------------------------
 766
 767
 768 .. _unicodemethodsandslots:
 769
 770 Methods and Slot Functions
 771 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 772
 773 The following APIs are capable of handling Unicode objects and strings on input
 774 (we refer to them as strings in the descriptions) and return Unicode objects or
 775 integers as appropriate.
 776
 777 They all return *NULL* or ``-1`` if an exception occurs.
 778
 779
 780 .. cfunction:: PyObject* PyUnicode_Concat(PyObject *left, PyObject *right)
 781
 782    Concat two strings giving a new Unicode string.
 783
 784
 785 .. cfunction:: PyObject* PyUnicode_Split(PyObject *s, PyObject *sep, Py_ssize_t maxsplit)
 786
 787    Split a string giving a list of Unicode strings.  If sep is *NULL*, splitting
 788    will be done at all whitespace substrings.  Otherwise, splits occur at the given
 789    separator.  At most *maxsplit* splits will be done.  If negative, no limit is
 790    set.  Separators are not included in the resulting list.
 791
 792
 793 .. cfunction:: PyObject* PyUnicode_Splitlines(PyObject *s, int keepend)
 794
 795    Split a Unicode string at line breaks, returning a list of Unicode strings.
 796    CRLF is considered to be one line break.  If *keepend* is 0, the Line break
 797    characters are not included in the resulting strings.
 798
 799
 800 .. cfunction:: PyObject* PyUnicode_Translate(PyObject *str, PyObject *table, const char *errors)
 801
 802    Translate a string by applying a character mapping table to it and return the
 803    resulting Unicode object.
 804
 805    The mapping table must map Unicode ordinal integers to Unicode ordinal integers
 806    or None (causing deletion of the character).
 807
 808    Mapping tables need only provide the :meth:`__getitem__` interface; dictionaries
 809    and sequences work well.  Unmapped character ordinals (ones which cause a
 810    :exc:`LookupError`) are left untouched and are copied as-is.
 811
 812    *errors* has the usual meaning for codecs. It may be *NULL* which indicates to
 813    use the default error handling.
 814
 815
 816 .. cfunction:: PyObject* PyUnicode_Join(PyObject *separator, PyObject *seq)
 817
 818    Join a sequence of strings using the given separator and return the resulting
 819    Unicode string.
 820
 821
 822 .. cfunction:: int PyUnicode_Tailmatch(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
 823
 824    Return 1 if *substr* matches *str*[*start*:*end*] at the given tail end
 825    (*direction* == -1 means to do a prefix match, *direction* == 1 a suffix match),
 826    0 otherwise. Return ``-1`` if an error occurred.
 827
 828
 829 .. cfunction:: Py_ssize_t PyUnicode_Find(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end, int direction)
 830
 831    Return the first position of *substr* in *str*[*start*:*end*] using the given
 832    *direction* (*direction* == 1 means to do a forward search, *direction* == -1 a
 833    backward search).  The return value is the index of the first match; a value of
 834    ``-1`` indicates that no match was found, and ``-2`` indicates that an error
 835    occurred and an exception has been set.
 836
 837
 838 .. cfunction:: Py_ssize_t PyUnicode_Count(PyObject *str, PyObject *substr, Py_ssize_t start, Py_ssize_t end)
 839
 840    Return the number of non-overlapping occurrences of *substr* in
 841    ``str[start:end]``.  Return ``-1`` if an error occurred.
 842
 843
 844 .. cfunction:: PyObject* PyUnicode_Replace(PyObject *str, PyObject *substr, PyObject *replstr, Py_ssize_t maxcount)
 845
 846    Replace at most *maxcount* occurrences of *substr* in *str* with *replstr* and
 847    return the resulting Unicode object. *maxcount* == -1 means replace all
 848    occurrences.
 849
 850
 851 .. cfunction:: int PyUnicode_Compare(PyObject *left, PyObject *right)
 852
 853    Compare two strings and return -1, 0, 1 for less than, equal, and greater than,
 854    respectively.
 855
 856
 857 .. cfunction:: int PyUnicode_RichCompare(PyObject *left,  PyObject *right,  int op)
 858
 859    Rich compare two unicode strings and return one of the following:
 860
 861    * ``NULL`` in case an exception was raised
 862    * :const:`Py_True` or :const:`Py_False` for successful comparisons
 863    * :const:`Py_NotImplemented` in case the type combination is unknown
 864
 865    Note that :const:`Py_EQ` and :const:`Py_NE` comparisons can cause a
 866    :exc:`UnicodeWarning` in case the conversion of the arguments to Unicode fails
 867    with a :exc:`UnicodeDecodeError`.
 868
 869    Possible values for *op* are :const:`Py_GT`, :const:`Py_GE`, :const:`Py_EQ`,
 870    :const:`Py_NE`, :const:`Py_LT`, and :const:`Py_LE`.
 871
 872
 873 .. cfunction:: PyObject* PyUnicode_Format(PyObject *format, PyObject *args)
 874
 875    Return a new string object from *format* and *args*; this is analogous to
 876    ``format % args``.  The *args* argument must be a tuple.
 877
 878
 879 .. cfunction:: int PyUnicode_Contains(PyObject *container, PyObject *element)
 880
 881    Check whether *element* is contained in *container* and return true or false
 882    accordingly.
 883
 884    *element* has to coerce to a one element Unicode string. ``-1`` is returned if
 885    there was an error.
 886
 887
 888 .. cfunction:: void PyUnicode_InternInPlace(PyObject **string)
 889
 890    Intern the argument *\*string* in place.  The argument must be the address of a
 891    pointer variable pointing to a Python unicode string object.  If there is an
 892    existing interned string that is the same as *\*string*, it sets *\*string* to
 893    it (decrementing the reference count of the old string object and incrementing
 894    the reference count of the interned string object), otherwise it leaves
 895    *\*string* alone and interns it (incrementing its reference count).
 896    (Clarification: even though there is a lot of talk about reference counts, think
 897    of this function as reference-count-neutral; you own the object after the call
 898    if and only if you owned it before the call.)
 899
 900
 901 .. cfunction:: PyObject* PyUnicode_InternFromString(const char *v)
 902
 903    A combination of :cfunc:`PyUnicode_FromString` and
 904    :cfunc:`PyUnicode_InternInPlace`, returning either a new unicode string object
 905    that has been interned, or a new ("owned") reference to an earlier interned
 906    string object with the same value.
 907