Doc/lib/libcodecs.tex

   1 \section{\module{codecs} ---
   2          Codec registry and base classes}
   3
   4 \declaremodule{standard}{codecs}
   5 \modulesynopsis{Encode and decode data and streams.}
   6 \moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
   7 \sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
   8
   9
  10 \index{Unicode}
  11 \index{Codecs}
  12 \indexii{Codecs}{encode}
  13 \indexii{Codecs}{decode}
  14 \index{streams}
  15 \indexii{stackable}{streams}
  16
  17
  18 This module defines base classes for standard Python codecs (encoders
  19 and decoders) and provides access to the internal Python codec
  20 registry which manages the codec lookup process.
  21
  22 It defines the following functions:
  23
  24 \begin{funcdesc}{register}{search_function}
  25 Register a codec search function. Search functions are expected to
  26 take one argument, the encoding name in all lower case letters, and
  27 return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader},
  28 \var{stream_writer})} taking the following arguments:
  29
  30   \var{encoder} and \var{decoder}: These must be functions or methods
  31   which have the same interface as the
  32   \method{encode()}/\method{decode()} methods of Codec instances (see
  33   Codec Interface). The functions/methods are expected to work in a
  34   stateless mode.
  35
  36   \var{stream_reader} and \var{stream_writer}: These have to be
  37   factory functions providing the following interface:
  38
  39         \code{factory(\var{stream}, \var{errors}='strict')}
  40
  41   The factory functions must return objects providing the interfaces
  42   defined by the base classes \class{StreamWriter} and
  43   \class{StreamReader}, respectively. Stream codecs can maintain
  44   state.
  45
  46   Possible values for errors are \code{'strict'} (raise an exception
  47   in case of an encoding error), \code{'replace'} (replace malformed
  48   data with a suitable replacement marker, such as \character{?}) and
  49   \code{'ignore'} (ignore malformed data and continue without further
  50   notice).
  51
  52 In case a search function cannot find a given encoding, it should
  53 return \code{None}.
  54 \end{funcdesc}
  55
  56 \begin{funcdesc}{lookup}{encoding}
  57 Looks up a codec tuple in the Python codec registry and returns the
  58 function tuple as defined above.
  59
  60 Encodings are first looked up in the registry's cache. If not found,
  61 the list of registered search functions is scanned. If no codecs tuple
  62 is found, a \exception{LookupError} is raised. Otherwise, the codecs
  63 tuple is stored in the cache and returned to the caller.
  64 \end{funcdesc}
  65
  66 To simply access to the various codecs, the module provides these
  67 additional functions which use \function{lookup()} for the codec
  68 lookup:
  69
  70 \begin{funcdesc}{getencoder}{encoding}
  71 Lookup up the codec for the given encoding and return its encoder
  72 function.
  73
  74 Raises a \exception{LookupError} in case the encoding cannot be found.
  75 \end{funcdesc}
  76
  77 \begin{funcdesc}{getdecoder}{encoding}
  78 Lookup up the codec for the given encoding and return its decoder
  79 function.
  80
  81 Raises a \exception{LookupError} in case the encoding cannot be found.
  82 \end{funcdesc}
  83
  84 \begin{funcdesc}{getreader}{encoding}
  85 Lookup up the codec for the given encoding and return its StreamReader
  86 class or factory function.
  87
  88 Raises a \exception{LookupError} in case the encoding cannot be found.
  89 \end{funcdesc}
  90
  91 \begin{funcdesc}{getwriter}{encoding}
  92 Lookup up the codec for the given encoding and return its StreamWriter
  93 class or factory function.
  94
  95 Raises a \exception{LookupError} in case the encoding cannot be found.
  96 \end{funcdesc}
  97
  98 To simplify working with encoded files or stream, the module
  99 also defines these utility functions:
 100
 101 \begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
 102                        errors\optional{, buffering}}}}
 103 Open an encoded file using the given \var{mode} and return
 104 a wrapped version providing transparent encoding/decoding.
 105
 106 \note{The wrapped version will only accept the object format
 107 defined by the codecs, i.e.\ Unicode objects for most built-in
 108 codecs.  Output is also codec-dependent and will usually be Unicode as
 109 well.}
 110
 111 \var{encoding} specifies the encoding which is to be used for the
 112 the file.
 113
 114 \var{errors} may be given to define the error handling. It defaults
 115 to \code{'strict'} which causes a \exception{ValueError} to be raised
 116 in case an encoding error occurs.
 117
 118 \var{buffering} has the same meaning as for the built-in
 119 \function{open()} function.  It defaults to line buffered.
 120 \end{funcdesc}
 121
 122 \begin{funcdesc}{EncodedFile}{file, input\optional{,
 123                               output\optional{, errors}}}
 124 Return a wrapped version of file which provides transparent
 125 encoding translation.
 126
 127 Strings written to the wrapped file are interpreted according to the
 128 given \var{input} encoding and then written to the original file as
 129 strings using the \var{output} encoding. The intermediate encoding will
 130 usually be Unicode but depends on the specified codecs.
 131
 132 If \var{output} is not given, it defaults to \var{input}.
 133
 134 \var{errors} may be given to define the error handling. It defaults to
 135 \code{'strict'}, which causes \exception{ValueError} to be raised in case
 136 an encoding error occurs.
 137 \end{funcdesc}
 138
 139 The module also provides the following constants which are useful
 140 for reading and writing to platform dependent files:
 141
 142 \begin{datadesc}{BOM}
 143 \dataline{BOM_BE}
 144 \dataline{BOM_LE}
 145 \dataline{BOM32_BE}
 146 \dataline{BOM32_LE}
 147 \dataline{BOM64_BE}
 148 \dataline{BOM64_LE}
 149 These constants define the byte order marks (BOM) used in data
 150 streams to indicate the byte order used in the stream or file.
 151 \constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE}
 152 depending on the platform's native byte order, while the others
 153 represent big endian (\samp{_BE} suffix) and little endian
 154 (\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings.
 155 \end{datadesc}
 156
 157
 158 \begin{seealso}
 159   \seeurl{http://sourceforge.net/projects/python-codecs/}{A
 160           SourceForge project working on additional support for Asian
 161           codecs for use with Python.  They are in the early stages of
 162           development at the time of this writing --- look in their
 163           FTP area for downloadable files.}
 164 \end{seealso}
 165
 166
 167 \subsection{Codec Base Classes}
 168
 169 The \module{codecs} defines a set of base classes which define the
 170 interface and can also be used to easily write you own codecs for use
 171 in Python.
 172
 173 Each codec has to define four interfaces to make it usable as codec in
 174 Python: stateless encoder, stateless decoder, stream reader and stream
 175 writer. The stream reader and writers typically reuse the stateless
 176 encoder/decoder to implement the file protocols.
 177
 178 The \class{Codec} class defines the interface for stateless
 179 encoders/decoders.
 180
 181 To simplify and standardize error handling, the \method{encode()} and
 182 \method{decode()} methods may implement different error handling
 183 schemes by providing the \var{errors} string argument.  The following
 184 string values are defined and implemented by all standard Python
 185 codecs:
 186
 187 \begin{tableii}{l|l}{code}{Value}{Meaning}
 188   \lineii{'strict'}{Raise \exception{ValueError} (or a subclass);
 189                     this is the default.}
 190   \lineii{'ignore'}{Ignore the character and continue with the next.}
 191   \lineii{'replace'}{Replace with a suitable replacement character;
 192                      Python will use the official U+FFFD REPLACEMENT
 193                      CHARACTER for the built-in Unicode codecs.}
 194 \end{tableii}
 195
 196
 197 \subsubsection{Codec Objects \label{codec-objects}}
 198
 199 The \class{Codec} class defines these methods which also define the
 200 function interfaces of the stateless encoder and decoder:
 201
 202 \begin{methoddesc}{encode}{input\optional{, errors}}
 203   Encodes the object \var{input} and returns a tuple (output object,
 204   length consumed).
 205
 206   \var{errors} defines the error handling to apply. It defaults to
 207   \code{'strict'} handling.
 208
 209   The method may not store state in the \class{Codec} instance. Use
 210   \class{StreamCodec} for codecs which have to keep state in order to
 211   make encoding/decoding efficient.
 212
 213   The encoder must be able to handle zero length input and return an
 214   empty object of the output object type in this situation.
 215 \end{methoddesc}
 216
 217 \begin{methoddesc}{decode}{input\optional{, errors}}
 218   Decodes the object \var{input} and returns a tuple (output object,
 219   length consumed).
 220
 221   \var{input} must be an object which provides the \code{bf_getreadbuf}
 222   buffer slot.  Python strings, buffer objects and memory mapped files
 223   are examples of objects providing this slot.
 224
 225   \var{errors} defines the error handling to apply. It defaults to
 226   \code{'strict'} handling.
 227
 228   The method may not store state in the \class{Codec} instance. Use
 229   \class{StreamCodec} for codecs which have to keep state in order to
 230   make encoding/decoding efficient.
 231
 232   The decoder must be able to handle zero length input and return an
 233   empty object of the output object type in this situation.
 234 \end{methoddesc}
 235
 236 The \class{StreamWriter} and \class{StreamReader} classes provide
 237 generic working interfaces which can be used to implement new
 238 encodings submodules very easily. See \module{encodings.utf_8} for an
 239 example on how this is done.
 240
 241
 242 \subsubsection{StreamWriter Objects \label{stream-writer-objects}}
 243
 244 The \class{StreamWriter} class is a subclass of \class{Codec} and
 245 defines the following methods which every stream writer must define in
 246 order to be compatible to the Python codec registry.
 247
 248 \begin{classdesc}{StreamWriter}{stream\optional{, errors}}
 249   Constructor for a \class{StreamWriter} instance.
 250
 251   All stream writers must provide this constructor interface. They are
 252   free to add additional keyword arguments, but only the ones defined
 253   here are used by the Python codec registry.
 254
 255   \var{stream} must be a file-like object open for writing (binary)
 256   data.
 257
 258   The \class{StreamWriter} may implement different error handling
 259   schemes by providing the \var{errors} keyword argument. These
 260   parameters are defined:
 261
 262   \begin{itemize}
 263     \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
 264                           this is the default.
 265     \item \code{'ignore'} Ignore the character and continue with the next.
 266     \item \code{'replace'} Replace with a suitable replacement character
 267   \end{itemize}
 268 \end{classdesc}
 269
 270 \begin{methoddesc}{write}{object}
 271   Writes the object's contents encoded to the stream.
 272 \end{methoddesc}
 273
 274 \begin{methoddesc}{writelines}{list}
 275   Writes the concatenated list of strings to the stream (possibly by
 276   reusing the \method{write()} method).
 277 \end{methoddesc}
 278
 279 \begin{methoddesc}{reset}{}
 280   Flushes and resets the codec buffers used for keeping state.
 281
 282   Calling this method should ensure that the data on the output is put
 283   into a clean state, that allows appending of new fresh data without
 284   having to rescan the whole stream to recover state.
 285 \end{methoddesc}
 286
 287 In addition to the above methods, the \class{StreamWriter} must also
 288 inherit all other methods and attribute from the underlying stream.
 289
 290
 291 \subsubsection{StreamReader Objects \label{stream-reader-objects}}
 292
 293 The \class{StreamReader} class is a subclass of \class{Codec} and
 294 defines the following methods which every stream reader must define in
 295 order to be compatible to the Python codec registry.
 296
 297 \begin{classdesc}{StreamReader}{stream\optional{, errors}}
 298   Constructor for a \class{StreamReader} instance.
 299
 300   All stream readers must provide this constructor interface. They are
 301   free to add additional keyword arguments, but only the ones defined
 302   here are used by the Python codec registry.
 303
 304   \var{stream} must be a file-like object open for reading (binary)
 305   data.
 306
 307   The \class{StreamReader} may implement different error handling
 308   schemes by providing the \var{errors} keyword argument. These
 309   parameters are defined:
 310
 311   \begin{itemize}
 312     \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
 313                           this is the default.
 314     \item \code{'ignore'} Ignore the character and continue with the next.
 315     \item \code{'replace'} Replace with a suitable replacement character.
 316   \end{itemize}
 317 \end{classdesc}
 318
 319 \begin{methoddesc}{read}{\optional{size}}
 320   Decodes data from the stream and returns the resulting object.
 321
 322   \var{size} indicates the approximate maximum number of bytes to read
 323   from the stream for decoding purposes. The decoder can modify this
 324   setting as appropriate. The default value -1 indicates to read and
 325   decode as much as possible.  \var{size} is intended to prevent having
 326   to decode huge files in one step.
 327
 328   The method should use a greedy read strategy meaning that it should
 329   read as much data as is allowed within the definition of the encoding
 330   and the given size, e.g.  if optional encoding endings or state
 331   markers are available on the stream, these should be read too.
 332 \end{methoddesc}
 333
 334 \begin{methoddesc}{readline}{[size]}
 335   Read one line from the input stream and return the
 336   decoded data.
 337
 338   Unlike the \method{readlines()} method, this method inherits
 339   the line breaking knowledge from the underlying stream's
 340   \method{readline()} method -- there is currently no support for line
 341   breaking using the codec decoder due to lack of line buffering.
 342   Sublcasses should however, if possible, try to implement this method
 343   using their own knowledge of line breaking.
 344
 345   \var{size}, if given, is passed as size argument to the stream's
 346   \method{readline()} method.
 347 \end{methoddesc}
 348
 349 \begin{methoddesc}{readlines}{[sizehint]}
 350   Read all lines available on the input stream and return them as list
 351   of lines.
 352
 353   Line breaks are implemented using the codec's decoder method and are
 354   included in the list entries.
 355
 356   \var{sizehint}, if given, is passed as \var{size} argument to the
 357   stream's \method{read()} method.
 358 \end{methoddesc}
 359
 360 \begin{methoddesc}{reset}{}
 361   Resets the codec buffers used for keeping state.
 362
 363   Note that no stream repositioning should take place.  This method is
 364   primarily intended to be able to recover from decoding errors.
 365 \end{methoddesc}
 366
 367 In addition to the above methods, the \class{StreamReader} must also
 368 inherit all other methods and attribute from the underlying stream.
 369
 370 The next two base classes are included for convenience. They are not
 371 needed by the codec registry, but may provide useful in practice.
 372
 373
 374 \subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
 375
 376 The \class{StreamReaderWriter} allows wrapping streams which work in
 377 both read and write modes.
 378
 379 The design is such that one can use the factory functions returned by
 380 the \function{lookup()} function to construct the instance.
 381
 382 \begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
 383   Creates a \class{StreamReaderWriter} instance.
 384   \var{stream} must be a file-like object.
 385   \var{Reader} and \var{Writer} must be factory functions or classes
 386   providing the \class{StreamReader} and \class{StreamWriter} interface
 387   resp.
 388   Error handling is done in the same way as defined for the
 389   stream readers and writers.
 390 \end{classdesc}
 391
 392 \class{StreamReaderWriter} instances define the combined interfaces of
 393 \class{StreamReader} and \class{StreamWriter} classes. They inherit
 394 all other methods and attribute from the underlying stream.
 395
 396
 397 \subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
 398
 399 The \class{StreamRecoder} provide a frontend - backend view of
 400 encoding data which is sometimes useful when dealing with different
 401 encoding environments.
 402
 403 The design is such that one can use the factory functions returned by
 404 the \function{lookup()} function to construct the instance.
 405
 406 \begin{classdesc}{StreamRecoder}{stream, encode, decode,
 407                                  Reader, Writer, errors}
 408   Creates a \class{StreamRecoder} instance which implements a two-way
 409   conversion: \var{encode} and \var{decode} work on the frontend (the
 410   input to \method{read()} and output of \method{write()}) while
 411   \var{Reader} and \var{Writer} work on the backend (reading and
 412   writing to the stream).
 413
 414   You can use these objects to do transparent direct recodings from
 415   e.g.\ Latin-1 to UTF-8 and back.
 416
 417   \var{stream} must be a file-like object.
 418
 419   \var{encode}, \var{decode} must adhere to the \class{Codec}
 420   interface, \var{Reader}, \var{Writer} must be factory functions or
 421   classes providing objects of the the \class{StreamReader} and
 422   \class{StreamWriter} interface respectively.
 423
 424   \var{encode} and \var{decode} are needed for the frontend
 425   translation, \var{Reader} and \var{Writer} for the backend
 426   translation.  The intermediate format used is determined by the two
 427   sets of codecs, e.g. the Unicode codecs will use Unicode as
 428   intermediate encoding.
 429
 430   Error handling is done in the same way as defined for the
 431   stream readers and writers.
 432 \end{classdesc}
 433
 434 \class{StreamRecoder} instances define the combined interfaces of
 435 \class{StreamReader} and \class{StreamWriter} classes. They inherit
 436 all other methods and attribute from the underlying stream.
 437