This commit was manufactured by cvs2svn to create tag 'r221c2'.
[python/dscho.git] / Doc / lib / libcodecs.tex
blob396b4b3b5f8aa2141fdd97161ce5c2fb9790928b
1 \section{\module{codecs} ---
2 Codec registry and base classes}
4 \declaremodule{standard}{codecs}
5 \modulesynopsis{Encode and decode data and streams.}
6 \moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
7 \sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
10 \index{Unicode}
11 \index{Codecs}
12 \indexii{Codecs}{encode}
13 \indexii{Codecs}{decode}
14 \index{streams}
15 \indexii{stackable}{streams}
18 This module defines base classes for standard Python codecs (encoders
19 and decoders) and provides access to the internal Python codec
20 registry which manages the codec lookup process.
22 It defines the following functions:
24 \begin{funcdesc}{register}{search_function}
25 Register a codec search function. Search functions are expected to
26 take one argument, the encoding name in all lower case letters, and
27 return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader},
28 \var{stream_writer})} taking the following arguments:
30 \var{encoder} and \var{decoder}: These must be functions or methods
31 which have the same interface as the
32 \method{encode()}/\method{decode()} methods of Codec instances (see
33 Codec Interface). The functions/methods are expected to work in a
34 stateless mode.
36 \var{stream_reader} and \var{stream_writer}: These have to be
37 factory functions providing the following interface:
39 \code{factory(\var{stream}, \var{errors}='strict')}
41 The factory functions must return objects providing the interfaces
42 defined by the base classes \class{StreamWriter} and
43 \class{StreamReader}, respectively. Stream codecs can maintain
44 state.
46 Possible values for errors are \code{'strict'} (raise an exception
47 in case of an encoding error), \code{'replace'} (replace malformed
48 data with a suitable replacement marker, such as \character{?}) and
49 \code{'ignore'} (ignore malformed data and continue without further
50 notice).
52 In case a search function cannot find a given encoding, it should
53 return \code{None}.
54 \end{funcdesc}
56 \begin{funcdesc}{lookup}{encoding}
57 Looks up a codec tuple in the Python codec registry and returns the
58 function tuple as defined above.
60 Encodings are first looked up in the registry's cache. If not found,
61 the list of registered search functions is scanned. If no codecs tuple
62 is found, a \exception{LookupError} is raised. Otherwise, the codecs
63 tuple is stored in the cache and returned to the caller.
64 \end{funcdesc}
66 To simply access to the various codecs, the module provides these
67 additional functions which use \function{lookup()} for the codec
68 lookup:
70 \begin{funcdesc}{getencoder}{encoding}
71 Lookup up the codec for the given encoding and return its encoder
72 function.
74 Raises a \exception{LookupError} in case the encoding cannot be found.
75 \end{funcdesc}
77 \begin{funcdesc}{getdecoder}{encoding}
78 Lookup up the codec for the given encoding and return its decoder
79 function.
81 Raises a \exception{LookupError} in case the encoding cannot be found.
82 \end{funcdesc}
84 \begin{funcdesc}{getreader}{encoding}
85 Lookup up the codec for the given encoding and return its StreamReader
86 class or factory function.
88 Raises a \exception{LookupError} in case the encoding cannot be found.
89 \end{funcdesc}
91 \begin{funcdesc}{getwriter}{encoding}
92 Lookup up the codec for the given encoding and return its StreamWriter
93 class or factory function.
95 Raises a \exception{LookupError} in case the encoding cannot be found.
96 \end{funcdesc}
98 To simplify working with encoded files or stream, the module
99 also defines these utility functions:
101 \begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
102 errors\optional{, buffering}}}}
103 Open an encoded file using the given \var{mode} and return
104 a wrapped version providing transparent encoding/decoding.
106 \note{The wrapped version will only accept the object format
107 defined by the codecs, i.e.\ Unicode objects for most built-in
108 codecs. Output is also codec-dependent and will usually be Unicode as
109 well.}
111 \var{encoding} specifies the encoding which is to be used for the
112 the file.
114 \var{errors} may be given to define the error handling. It defaults
115 to \code{'strict'} which causes a \exception{ValueError} to be raised
116 in case an encoding error occurs.
118 \var{buffering} has the same meaning as for the built-in
119 \function{open()} function. It defaults to line buffered.
120 \end{funcdesc}
122 \begin{funcdesc}{EncodedFile}{file, input\optional{,
123 output\optional{, errors}}}
124 Return a wrapped version of file which provides transparent
125 encoding translation.
127 Strings written to the wrapped file are interpreted according to the
128 given \var{input} encoding and then written to the original file as
129 strings using the \var{output} encoding. The intermediate encoding will
130 usually be Unicode but depends on the specified codecs.
132 If \var{output} is not given, it defaults to \var{input}.
134 \var{errors} may be given to define the error handling. It defaults to
135 \code{'strict'}, which causes \exception{ValueError} to be raised in case
136 an encoding error occurs.
137 \end{funcdesc}
139 The module also provides the following constants which are useful
140 for reading and writing to platform dependent files:
142 \begin{datadesc}{BOM}
143 \dataline{BOM_BE}
144 \dataline{BOM_LE}
145 \dataline{BOM32_BE}
146 \dataline{BOM32_LE}
147 \dataline{BOM64_BE}
148 \dataline{BOM64_LE}
149 These constants define the byte order marks (BOM) used in data
150 streams to indicate the byte order used in the stream or file.
151 \constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE}
152 depending on the platform's native byte order, while the others
153 represent big endian (\samp{_BE} suffix) and little endian
154 (\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings.
155 \end{datadesc}
158 \begin{seealso}
159 \seeurl{http://sourceforge.net/projects/python-codecs/}{A
160 SourceForge project working on additional support for Asian
161 codecs for use with Python. They are in the early stages of
162 development at the time of this writing --- look in their
163 FTP area for downloadable files.}
164 \end{seealso}
167 \subsection{Codec Base Classes}
169 The \module{codecs} defines a set of base classes which define the
170 interface and can also be used to easily write you own codecs for use
171 in Python.
173 Each codec has to define four interfaces to make it usable as codec in
174 Python: stateless encoder, stateless decoder, stream reader and stream
175 writer. The stream reader and writers typically reuse the stateless
176 encoder/decoder to implement the file protocols.
178 The \class{Codec} class defines the interface for stateless
179 encoders/decoders.
181 To simplify and standardize error handling, the \method{encode()} and
182 \method{decode()} methods may implement different error handling
183 schemes by providing the \var{errors} string argument. The following
184 string values are defined and implemented by all standard Python
185 codecs:
187 \begin{tableii}{l|l}{code}{Value}{Meaning}
188 \lineii{'strict'}{Raise \exception{ValueError} (or a subclass);
189 this is the default.}
190 \lineii{'ignore'}{Ignore the character and continue with the next.}
191 \lineii{'replace'}{Replace with a suitable replacement character;
192 Python will use the official U+FFFD REPLACEMENT
193 CHARACTER for the built-in Unicode codecs.}
194 \end{tableii}
197 \subsubsection{Codec Objects \label{codec-objects}}
199 The \class{Codec} class defines these methods which also define the
200 function interfaces of the stateless encoder and decoder:
202 \begin{methoddesc}{encode}{input\optional{, errors}}
203 Encodes the object \var{input} and returns a tuple (output object,
204 length consumed).
206 \var{errors} defines the error handling to apply. It defaults to
207 \code{'strict'} handling.
209 The method may not store state in the \class{Codec} instance. Use
210 \class{StreamCodec} for codecs which have to keep state in order to
211 make encoding/decoding efficient.
213 The encoder must be able to handle zero length input and return an
214 empty object of the output object type in this situation.
215 \end{methoddesc}
217 \begin{methoddesc}{decode}{input\optional{, errors}}
218 Decodes the object \var{input} and returns a tuple (output object,
219 length consumed).
221 \var{input} must be an object which provides the \code{bf_getreadbuf}
222 buffer slot. Python strings, buffer objects and memory mapped files
223 are examples of objects providing this slot.
225 \var{errors} defines the error handling to apply. It defaults to
226 \code{'strict'} handling.
228 The method may not store state in the \class{Codec} instance. Use
229 \class{StreamCodec} for codecs which have to keep state in order to
230 make encoding/decoding efficient.
232 The decoder must be able to handle zero length input and return an
233 empty object of the output object type in this situation.
234 \end{methoddesc}
236 The \class{StreamWriter} and \class{StreamReader} classes provide
237 generic working interfaces which can be used to implement new
238 encodings submodules very easily. See \module{encodings.utf_8} for an
239 example on how this is done.
242 \subsubsection{StreamWriter Objects \label{stream-writer-objects}}
244 The \class{StreamWriter} class is a subclass of \class{Codec} and
245 defines the following methods which every stream writer must define in
246 order to be compatible to the Python codec registry.
248 \begin{classdesc}{StreamWriter}{stream\optional{, errors}}
249 Constructor for a \class{StreamWriter} instance.
251 All stream writers must provide this constructor interface. They are
252 free to add additional keyword arguments, but only the ones defined
253 here are used by the Python codec registry.
255 \var{stream} must be a file-like object open for writing (binary)
256 data.
258 The \class{StreamWriter} may implement different error handling
259 schemes by providing the \var{errors} keyword argument. These
260 parameters are defined:
262 \begin{itemize}
263 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
264 this is the default.
265 \item \code{'ignore'} Ignore the character and continue with the next.
266 \item \code{'replace'} Replace with a suitable replacement character
267 \end{itemize}
268 \end{classdesc}
270 \begin{methoddesc}{write}{object}
271 Writes the object's contents encoded to the stream.
272 \end{methoddesc}
274 \begin{methoddesc}{writelines}{list}
275 Writes the concatenated list of strings to the stream (possibly by
276 reusing the \method{write()} method).
277 \end{methoddesc}
279 \begin{methoddesc}{reset}{}
280 Flushes and resets the codec buffers used for keeping state.
282 Calling this method should ensure that the data on the output is put
283 into a clean state, that allows appending of new fresh data without
284 having to rescan the whole stream to recover state.
285 \end{methoddesc}
287 In addition to the above methods, the \class{StreamWriter} must also
288 inherit all other methods and attribute from the underlying stream.
291 \subsubsection{StreamReader Objects \label{stream-reader-objects}}
293 The \class{StreamReader} class is a subclass of \class{Codec} and
294 defines the following methods which every stream reader must define in
295 order to be compatible to the Python codec registry.
297 \begin{classdesc}{StreamReader}{stream\optional{, errors}}
298 Constructor for a \class{StreamReader} instance.
300 All stream readers must provide this constructor interface. They are
301 free to add additional keyword arguments, but only the ones defined
302 here are used by the Python codec registry.
304 \var{stream} must be a file-like object open for reading (binary)
305 data.
307 The \class{StreamReader} may implement different error handling
308 schemes by providing the \var{errors} keyword argument. These
309 parameters are defined:
311 \begin{itemize}
312 \item \code{'strict'} Raise \exception{ValueError} (or a subclass);
313 this is the default.
314 \item \code{'ignore'} Ignore the character and continue with the next.
315 \item \code{'replace'} Replace with a suitable replacement character.
316 \end{itemize}
317 \end{classdesc}
319 \begin{methoddesc}{read}{\optional{size}}
320 Decodes data from the stream and returns the resulting object.
322 \var{size} indicates the approximate maximum number of bytes to read
323 from the stream for decoding purposes. The decoder can modify this
324 setting as appropriate. The default value -1 indicates to read and
325 decode as much as possible. \var{size} is intended to prevent having
326 to decode huge files in one step.
328 The method should use a greedy read strategy meaning that it should
329 read as much data as is allowed within the definition of the encoding
330 and the given size, e.g. if optional encoding endings or state
331 markers are available on the stream, these should be read too.
332 \end{methoddesc}
334 \begin{methoddesc}{readline}{[size]}
335 Read one line from the input stream and return the
336 decoded data.
338 Unlike the \method{readlines()} method, this method inherits
339 the line breaking knowledge from the underlying stream's
340 \method{readline()} method -- there is currently no support for line
341 breaking using the codec decoder due to lack of line buffering.
342 Sublcasses should however, if possible, try to implement this method
343 using their own knowledge of line breaking.
345 \var{size}, if given, is passed as size argument to the stream's
346 \method{readline()} method.
347 \end{methoddesc}
349 \begin{methoddesc}{readlines}{[sizehint]}
350 Read all lines available on the input stream and return them as list
351 of lines.
353 Line breaks are implemented using the codec's decoder method and are
354 included in the list entries.
356 \var{sizehint}, if given, is passed as \var{size} argument to the
357 stream's \method{read()} method.
358 \end{methoddesc}
360 \begin{methoddesc}{reset}{}
361 Resets the codec buffers used for keeping state.
363 Note that no stream repositioning should take place. This method is
364 primarily intended to be able to recover from decoding errors.
365 \end{methoddesc}
367 In addition to the above methods, the \class{StreamReader} must also
368 inherit all other methods and attribute from the underlying stream.
370 The next two base classes are included for convenience. They are not
371 needed by the codec registry, but may provide useful in practice.
374 \subsubsection{StreamReaderWriter Objects \label{stream-reader-writer}}
376 The \class{StreamReaderWriter} allows wrapping streams which work in
377 both read and write modes.
379 The design is such that one can use the factory functions returned by
380 the \function{lookup()} function to construct the instance.
382 \begin{classdesc}{StreamReaderWriter}{stream, Reader, Writer, errors}
383 Creates a \class{StreamReaderWriter} instance.
384 \var{stream} must be a file-like object.
385 \var{Reader} and \var{Writer} must be factory functions or classes
386 providing the \class{StreamReader} and \class{StreamWriter} interface
387 resp.
388 Error handling is done in the same way as defined for the
389 stream readers and writers.
390 \end{classdesc}
392 \class{StreamReaderWriter} instances define the combined interfaces of
393 \class{StreamReader} and \class{StreamWriter} classes. They inherit
394 all other methods and attribute from the underlying stream.
397 \subsubsection{StreamRecoder Objects \label{stream-recoder-objects}}
399 The \class{StreamRecoder} provide a frontend - backend view of
400 encoding data which is sometimes useful when dealing with different
401 encoding environments.
403 The design is such that one can use the factory functions returned by
404 the \function{lookup()} function to construct the instance.
406 \begin{classdesc}{StreamRecoder}{stream, encode, decode,
407 Reader, Writer, errors}
408 Creates a \class{StreamRecoder} instance which implements a two-way
409 conversion: \var{encode} and \var{decode} work on the frontend (the
410 input to \method{read()} and output of \method{write()}) while
411 \var{Reader} and \var{Writer} work on the backend (reading and
412 writing to the stream).
414 You can use these objects to do transparent direct recodings from
415 e.g.\ Latin-1 to UTF-8 and back.
417 \var{stream} must be a file-like object.
419 \var{encode}, \var{decode} must adhere to the \class{Codec}
420 interface, \var{Reader}, \var{Writer} must be factory functions or
421 classes providing objects of the the \class{StreamReader} and
422 \class{StreamWriter} interface respectively.
424 \var{encode} and \var{decode} are needed for the frontend
425 translation, \var{Reader} and \var{Writer} for the backend
426 translation. The intermediate format used is determined by the two
427 sets of codecs, e.g. the Unicode codecs will use Unicode as
428 intermediate encoding.
430 Error handling is done in the same way as defined for the
431 stream readers and writers.
432 \end{classdesc}
434 \class{StreamRecoder} instances define the combined interfaces of
435 \class{StreamReader} and \class{StreamWriter} classes. They inherit
436 all other methods and attribute from the underlying stream.