Doc/lib/libcodecs.tex

   1 \section{\module{codecs} ---
   2          Codec registry and base classes}
   3
   4 \declaremodule{standard}{codecs}
   5 \modulesynopsis{Encode and decode data and streams.}
   6 \moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
   7 \sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
   8
   9
  10 \index{Unicode}
  11 \index{Codecs}
  12 \indexii{Codecs}{encode}
  13 \indexii{Codecs}{decode}
  14 \index{streams}
  15 \indexii{stackable}{streams}
  16
  17
  18 This module defines base classes for standard Python codecs (encoders
  19 and decoders) and provides access to the internal Python codec
  20 registry which manages the codec lookup process.
  21
  22 It defines the following functions:
  23
  24 \begin{funcdesc}{register}{search_function}
  25 Register a codec search function. Search functions are expected to
  26 take one argument, the encoding name in all lower case letters, and
  27 return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader},
  28 \var{stream_writer})} taking the following arguments:
  29
  30   \var{encoder} and \var{decoder}: These must be functions or methods
  31   which have the same interface as the
  32   \method{encode()}/\method{decode()} methods of Codec instances (see
  33   Codec Interface). The functions/methods are expected to work in a
  34   stateless mode.
  35
  36   \var{stream_reader} and \var{stream_writer}: These have to be
  37   factory functions providing the following interface:
  38
  39         \code{factory(\var{stream}, \var{errors}='strict')}
  40
  41   The factory functions must return objects providing the interfaces
  42   defined by the base classes \class{StreamWriter} and
  43   \class{StreamReader}, respectively. Stream codecs can maintain
  44   state.
  45
  46   Possible values for errors are \code{'strict'} (raise an exception
  47   in case of an encoding error), \code{'replace'} (replace malformed
  48   data with a suitable replacement marker, such as \character{?}) and
  49   \code{'ignore'} (ignore malformed data and continue without further
  50   notice).
  51
  52 In case a search function cannot find a given encoding, it should
  53 return \code{None}.
  54 \end{funcdesc}
  55
  56 \begin{funcdesc}{lookup}{encoding}
  57 Looks up a codec tuple in the Python codec registry and returns the
  58 function tuple as defined above.
  59
  60 Encodings are first looked up in the registry's cache. If not found,
  61 the list of registered search functions is scanned. If no codecs tuple
  62 is found, a \exception{LookupError} is raised. Otherwise, the codecs
  63 tuple is stored in the cache and returned to the caller.
  64 \end{funcdesc}
  65
  66 To simplify working with encoded files or stream, the module
  67 also defines these utility functions:
  68
  69 \begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
  70                        errors\optional{, buffering}}}}
  71 Open an encoded file using the given \var{mode} and return
  72 a wrapped version providing transparent encoding/decoding.
  73
  74 \strong{Note:} The wrapped version will only accept the object format
  75 defined by the codecs, i.e.\ Unicode objects for most built-in
  76 codecs.  Output is also codec-dependent and will usually be Unicode as
  77 well.
  78
  79 \var{encoding} specifies the encoding which is to be used for the
  80 the file.
  81
  82 \var{errors} may be given to define the error handling. It defaults
  83 to \code{'strict'} which causes a \exception{ValueError} to be raised
  84 in case an encoding error occurs.
  85
  86 \var{buffering} has the same meaning as for the built-in
  87 \function{open()} function.  It defaults to line buffered.
  88 \end{funcdesc}
  89
  90 \begin{funcdesc}{EncodedFile}{file, input\optional{,
  91                               output\optional{, errors}}}
  92 Return a wrapped version of file which provides transparent
  93 encoding translation.
  94
  95 Strings written to the wrapped file are interpreted according to the
  96 given \var{input} encoding and then written to the original file as
  97 strings using the \var{output} encoding. The intermediate encoding will
  98 usually be Unicode but depends on the specified codecs.
  99
 100 If \var{output} is not given, it defaults to \var{input}.
 101
 102 \var{errors} may be given to define the error handling. It defaults to
 103 \code{'strict'}, which causes \exception{ValueError} to be raised in case
 104 an encoding error occurs.
 105 \end{funcdesc}
 106
 107 The module also provides the following constants which are useful
 108 for reading and writing to platform dependent files:
 109
 110 \begin{datadesc}{BOM}
 111 \dataline{BOM_BE}
 112 \dataline{BOM_LE}
 113 \dataline{BOM32_BE}
 114 \dataline{BOM32_LE}
 115 \dataline{BOM64_BE}
 116 \dataline{BOM64_LE}
 117 These constants define the byte order marks (BOM) used in data
 118 streams to indicate the byte order used in the stream or file.
 119 \constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE}
 120 depending on the platform's native byte order, while the others
 121 represent big endian (\samp{_BE} suffix) and little endian
 122 (\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings.
 123 \end{datadesc}
 124
 125
 126 \begin{seealso}
 127   \seeurl{http://sourceforge.net/projects/python-codecs/}{A
 128           SourceForge project working on additional support for Asian
 129           codecs for use with Python.  They are in the early stages of
 130           development at the time of this writing --- look in their
 131           FTP area for downloadable files.}
 132 \end{seealso}
 133
 134
 135 \subsection{Codec Base Classes}
 136
 137 The \module{codecs} defines a set of base classes which define the
 138 interface and can also be used to easily write you own codecs for use
 139 in Python.
 140
 141 Each codec has to define four interfaces to make it usable as codec in
 142 Python: stateless encoder, stateless decoder, stream reader and stream
 143 writer. The stream reader and writers typically reuse the stateless
 144 encoder/decoder to implement the file protocols.
 145
 146 The \class{Codec} class defines the interface for stateless
 147 encoders/decoders.
 148
 149 To simplify and standardize error handling, the \method{encode()} and
 150 \method{decode()} methods may implement different error handling
 151 schemes by providing the \var{errors} string argument.  The following
 152 string values are defined and implemented by all standard Python
 153 codecs:
 154
 155 \begin{tableii}{l|l}{code}{Value}{Meaning}
 156   \lineii{'strict'}{Raise \exception{ValueError} (or a subclass);
 157                     this is the default.}
 158   \lineii{'ignore'}{Ignore the character and continue with the next.}
 159   \lineii{'replace'}{Replace with a suitable replacement character;
 160                      Python will use the official U+FFFD REPLACEMENT
 161                      CHARACTER for the built-in Unicode codecs.}
 162 \end{tableii}
 163
 164
 165 \subsubsection{Codec Objects \label{codec-objects}}
 166
 167 The \class{Codec} class defines these methods which also define the
 168 function interfaces of the stateless encoder and decoder:
 169
 170 \begin{methoddesc}{encode}{input\optional{, errors}}
 171   Encodes the object \var{input} and returns a tuple (output object,
 172   length consumed).
 173
 174   \var{errors} defines the error handling to apply. It defaults to
 175   \code{'strict'} handling.
 176
 177   The method may not store state in the \class{Codec} instance. Use
 178   \class{StreamCodec} for codecs which have to keep state in order to
 179   make encoding/decoding efficient.
 180
 181   The encoder must be able to handle zero length input and return an
 182   empty object of the output object type in this situation.
 183 \end{methoddesc}
 184
 185 \begin{methoddesc}{decode}{input\optional{, errors}}
 186   Decodes the object \var{input} and returns a tuple (output object,
 187   length consumed).
 188
 189   \var{input} must be an object which provides the \code{bf_getreadbuf}
 190   buffer slot.  Python strings, buffer objects and memory mapped files
 191   are examples of objects providing this slot.
 192
 193   \var{errors} defines the error handling to apply. It defaults to
 194   \code{'strict'} handling.
 195
 196   The method may not store state in the \class{Codec} instance. Use
 197   \class{StreamCodec} for codecs which have to keep state in order to
 198   make encoding/decoding efficient.
 199
 200   The decoder must be able to handle zero length input and return an
 201   empty object of the output object type in this situation.
 202 \end{methoddesc}
 203
 204 The \class{StreamWriter} and \class{StreamReader} classes provide
 205 generic working interfaces which can be used to implement new
 206 encodings submodules very easily. See \module{encodings.utf_8} for an
 207 example on how this is done.
 208
 209
 210 \subsubsection{StreamWriter Objects \label{stream-writer-objects}}
 211
 212 The \class{StreamWriter} class is a subclass of \class{Codec} and
 213 defines the following methods which every stream writer must define in
 214 order to be compatible to the Python codec registry.
 215
 216 \begin{classdesc}{StreamWriter}{stream\optional{, errors}}
 217   Constructor for a \class{StreamWriter} instance.
 218
 219   All stream writers must provide this constructor interface. They are
 220   free to add additional keyword arguments, but only the ones defined
 221   here are used by the Python codec registry.
 222
 223   \var{stream} must be a file-like object open for writing (binary)
 224   data.
 225
 226   The \class{StreamWriter} may implement different error handling
 227   schemes by providing the \var{errors} keyword argument. These
 228   parameters are defined:
 229
 230   \begin{itemize}
 231     \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
 232                           this is the default.
 233     \item \code{'ignore'} Ignore the character and continue with the next.
 234     \item \code{'replace'} Replace with a suitable replacement character
 235   \end{itemize}
 236 \end{classdesc}
 237
 238 \begin{methoddesc}{write}{object}
 239   Writes the object's contents encoded to the stream.
 240 \end{methoddesc}
 241
 242 \begin{methoddesc}{writelines}{list}
 243   Writes the concatenated list of strings to the stream (possibly by
 244   reusing the \method{write()} method).
 245 \end{methoddesc}
 246
 247 \begin{methoddesc}{reset}{}
 248   Flushes and resets the codec buffers used for keeping state.
 249
 250   Calling this method should ensure that the data on the output is put
 251   into a clean state, that allows appending of new fresh data without
 252   having to rescan the whole stream to recover state.
 253 \end{methoddesc}
 254
 255 In addition to the above methods, the \class{StreamWriter} must also
 256 inherit all other methods and attribute from the underlying stream.
 257
 258
 259 \subsubsection{StreamReader Objects \label{stream-reader-objects}}
 260
 261 The \class{StreamReader} class is a subclass of \class{Codec} and
 262 defines the following methods which every stream reader must define in
 263 order to be compatible to the Python codec registry.
 264
 265 \begin{classdesc}{StreamReader}{stream\optional{, errors}}
 266   Constructor for a \class{StreamReader} instance.
 267
 268   All stream readers must provide this constructor interface. They are
 269   free to add additional keyword arguments, but only the ones defined
 270   here are used by the Python codec registry.
 271
 272   \var{stream} must be a file-like object open for reading (binary)
 273   data.
 274
 275   The \class{StreamReader} may implement different error handling
 276   schemes by providing the \var{errors} keyword argument. These
 277   parameters are defined:
 278
 279   \begin{itemize}
 280     \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
 281                           this is the default.
 282     \item \code{'ignore'} Ignore the character and continue with the next.
 283     \item \code{'replace'} Replace with a suitable replacement character.
 284   \end{itemize}
 285 \end{classdesc}
 286
 287 \begin{methoddesc}{read}{\optional{size}}
 288   Decodes data from the stream and returns the resulting object.
 289
 290   \var{size} indicates the approximate maximum number of bytes to read
 291   from the stream for decoding purposes. The decoder can modify this
 292   setting as appropriate. The default value -1 indicates to read and
 293   decode as much as possible.  \var{size} is intended to prevent having
 294   to decode huge files in one step.
 295
 296   The method should use a greedy read strategy meaning that it should
 297   read as much data as is allowed within the definition of the encoding
 298   and the given size, e.g.  if optional encoding endings or state
 299   markers are available on the stream, these should be read too.
 300 \end{methoddesc}
 301
 302 \begin{methoddesc}{readline}{[size]}
 303   Read one line from the input stream and return the
 304   decoded data.
 305
 306   Note: Unlike the \method{readlines()} method, this method inherits
 307   the line breaking knowledge from the underlying stream's
 308   \method{readline()} method -- there is currently no support for line
 309   breaking using the codec decoder due to lack of line buffering.
 310   Sublcasses should however, if possible, try to implement this method
 311   using their own knowledge of line breaking.
 312
 313   \var{size}, if given, is passed as size argument to the stream's
 314   \method{readline()} method.
 315 \end{methoddesc}
 316
 317 \begin{methoddesc}{readlines}{[sizehint]}
 318   Read all lines available on the input stream and return them as list
 319   of lines.
 320
 321   Line breaks are implemented using the codec's decoder method and are
 322   included in the list entries.
 323
 324   \var{sizehint}, if given, is passed as \var{size} argument to the
 325   stream's \method{read()} method.
 326 \end{methoddesc}
 327
 328 \begin{methoddesc}{reset}{}
 329   Resets the codec buffers used for keeping state.
 330
 331   Note that no stream repositioning should take place.  This method is
 332   primarily intended to be able to recover from decoding errors.
 333 \end{methoddesc}
 334
 335 In addition to the above methods, the \class{StreamReader} must also
 336 inherit all other methods and attribute from the underlying stream.
 337
 338 The next two base classes are included for convenience. They are not
 339 needed by the codec registry, but may provide useful in practice.
 340
 341
 342 \subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
 343
 344 The \class{StreamReaderWriter} allows wrapping streams which work in
 345 both read and write modes.
 346
 347 The design is such that one can use the factory functions returned by
 348 the \function{lookup()} function to construct the instance.
 349
 350 \begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
 351   Creates a \class{StreamReaderWriter} instance.
 352   \var{stream} must be a file-like object.
 353   \var{Reader} and \var{Writer} must be factory functions or classes
 354   providing the \class{StreamReader} and \class{StreamWriter} interface
 355   resp.
 356   Error handling is done in the same way as defined for the
 357   stream readers and writers.
 358 \end{classdesc}
 359
 360 \class{StreamReaderWriter} instances define the combined interfaces of
 361 \class{StreamReader} and \class{StreamWriter} classes. They inherit
 362 all other methods and attribute from the underlying stream.
 363
 364
 365 \subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
 366
 367 The \class{StreamRecoder} provide a frontend - backend view of
 368 encoding data which is sometimes useful when dealing with different
 369 encoding environments.
 370
 371 The design is such that one can use the factory functions returned by
 372 the \function{lookup()} function to construct the instance.
 373
 374 \begin{classdesc}{StreamRecoder}{stream, encode, decode,
 375                                  Reader, Writer, errors}
 376   Creates a \class{StreamRecoder} instance which implements a two-way
 377   conversion: \var{encode} and \var{decode} work on the frontend (the
 378   input to \method{read()} and output of \method{write()}) while
 379   \var{Reader} and \var{Writer} work on the backend (reading and
 380   writing to the stream).
 381
 382   You can use these objects to do transparent direct recodings from
 383   e.g.\ Latin-1 to UTF-8 and back.
 384
 385   \var{stream} must be a file-like object.
 386
 387   \var{encode}, \var{decode} must adhere to the \class{Codec}
 388   interface, \var{Reader}, \var{Writer} must be factory functions or
 389   classes providing objects of the the \class{StreamReader} and
 390   \class{StreamWriter} interface respectively.
 391
 392   \var{encode} and \var{decode} are needed for the frontend
 393   translation, \var{Reader} and \var{Writer} for the backend
 394   translation.  The intermediate format used is determined by the two
 395   sets of codecs, e.g. the Unicode codecs will use Unicode as
 396   intermediate encoding.
 397
 398   Error handling is done in the same way as defined for the
 399   stream readers and writers.
 400 \end{classdesc}
 401
 402 \class{StreamRecoder} instances define the combined interfaces of
 403 \class{StreamReader} and \class{StreamWriter} classes. They inherit
 404 all other methods and attribute from the underlying stream.
 405