Clarify portability and main program.
[python/dscho.git] / Doc / lib / librfc822.tex
blob61aadd896927b00e9999038781a6fa70eb69e7d5
1 \section{\module{rfc822} ---
2 Parse RFC 822 mail headers.}
3 \declaremodule{standard}{rfc822}
5 \modulesynopsis{Parse \rfc{822} style mail headers.}
7 This module defines a class, \class{Message}, which represents a
8 collection of ``email headers'' as defined by the Internet standard
9 \rfc{822}. It is used in various contexts, usually to read such
10 headers from a file. This module also defines a helper class
11 \class{AddressList} for parsing RFC822 addresses.
13 Note that there's a separate module to read \UNIX{}, MH, and MMDF
14 style mailbox files: \module{mailbox}\refstmodindex{mailbox}.
16 \begin{classdesc}{Message}{file\optional{, seekable}}
17 A \class{Message} instance is instantiated with an input object as
18 parameter. Message relies only on the input object having a
19 \method{readline()} method; in particular, ordinary file objects
20 qualify. Instantiation reads headers from the input object up to a
21 delimiter line (normally a blank line) and stores them in the
22 instance.
24 This class can work with any input object that supports a
25 \method{readline()} method. If the input object has seek and tell
26 capability, the \method{rewindbody()} method will work; also, illegal
27 lines will be pushed back onto the input stream. If the input object
28 lacks seek but has an \method{unread()} method that can push back a
29 line of input, \class{Message} will use that to push back illegal
30 lines. Thus this class can be used to parse messages coming from a
31 buffered stream.
33 The optional \var{seekable} argument is provided as a workaround for
34 certain stdio libraries in which \cfunction{tell()} discards buffered
35 data before discovering that the \cfunction{lseek()} system call
36 doesn't work. For maximum portability, you should set the seekable
37 argument to zero to prevent that initial \method{tell()} when passing
38 in an unseekable object such as a a file object created from a socket
39 object.
41 Input lines as read from the file may either be terminated by CR-LF or
42 by a single linefeed; a terminating CR-LF is replaced by a single
43 linefeed before the line is stored.
45 All header matching is done independent of upper or lower case;
46 e.g.\ \code{\var{m}['From']}, \code{\var{m}['from']} and
47 \code{\var{m}['FROM']} all yield the same result.
48 \end{classdesc}
50 \begin{classdesc}{AddressList}{field}
51 You may instantiate the AddresssList helper class using a single
52 string parameter, a comma-separated list of \rfc{822} addresses to be
53 parsed. (The parameter \code{None} yields an empty list.)
54 \end{classdesc}
56 \begin{funcdesc}{parsedate}{date}
57 Attempts to parse a date according to the rules in \rfc{822}.
58 however, some mailers don't follow that format as specified, so
59 \function{parsedate()} tries to guess correctly in such cases.
60 \var{date} is a string containing an \rfc{822} date, such as
61 \code{'Mon, 20 Nov 1995 19:12:08 -0500'}. If it succeeds in parsing
62 the date, \function{parsedate()} returns a 9-tuple that can be passed
63 directly to \function{time.mktime()}; otherwise \code{None} will be
64 returned.
65 \end{funcdesc}
67 \begin{funcdesc}{parsedate_tz}{date}
68 Performs the same function as \function{parsedate()}, but returns
69 either \code{None} or a 10-tuple; the first 9 elements make up a tuple
70 that can be passed directly to \function{time.mktime()}, and the tenth
71 is the offset of the date's timezone from UTC (which is the official
72 term for Greenwich Mean Time). (Note that the sign of the timezone
73 offset is the opposite of the sign of the \code{time.timezone}
74 variable for the same timezone; the latter variable follows the
75 \POSIX{} standard while this module follows \rfc{822}.) If the input
76 string has no timezone, the last element of the tuple returned is
77 \code{None}.
78 \end{funcdesc}
80 \begin{funcdesc}{mktime_tz}{tuple}
81 Turn a 10-tuple as returned by \function{parsedate_tz()} into a UTC
82 timestamp. It the timezone item in the tuple is \code{None}, assume
83 local time. Minor deficiency: this first interprets the first 8
84 elements as a local time and then compensates for the timezone
85 difference; this may yield a slight error around daylight savings time
86 switch dates. Not enough to worry about for common use.
87 \end{funcdesc}
89 \subsection{Message Objects}
90 \label{message-objects}
92 A \class{Message} instance has the following methods:
94 \begin{methoddesc}{rewindbody}{}
95 Seek to the start of the message body. This only works if the file
96 object is seekable.
97 \end{methoddesc}
99 \begin{methoddesc}{isheader}{line}
100 Returns a line's canonicalized fieldname (the dictionary key that will
101 be used to index it) if the line is a legal RFC822 header; otherwise
102 returns None (implying that parsing should stop here and the line be
103 pushed back on the input stream). It is sometimes useful to override
104 this method in a subclass.
105 \end{methoddesc}
107 \begin{methoddesc}{islast}{line}
108 Return true if the given line is a delimiter on which Message should
109 stop. The delimiter line is consumed, and the file object's read
110 location positioned immediately after it. By default this method just
111 checks that the line is blank, but you can override it in a subclass.
112 \end{methoddesc}
114 \begin{methoddesc}{iscomment}{line}
115 Return true if the given line should be ignored entirely, just skipped.
116 By default this is a stub that always returns false, but you can
117 override it in a subclass.
118 \end{methoddesc}
120 \begin{methoddesc}{getallmatchingheaders}{name}
121 Return a list of lines consisting of all headers matching
122 \var{name}, if any. Each physical line, whether it is a continuation
123 line or not, is a separate list item. Return the empty list if no
124 header matches \var{name}.
125 \end{methoddesc}
127 \begin{methoddesc}{getfirstmatchingheader}{name}
128 Return a list of lines comprising the first header matching
129 \var{name}, and its continuation line(s), if any. Return \code{None}
130 if there is no header matching \var{name}.
131 \end{methoddesc}
133 \begin{methoddesc}{getrawheader}{name}
134 Return a single string consisting of the text after the colon in the
135 first header matching \var{name}. This includes leading whitespace,
136 the trailing linefeed, and internal linefeeds and whitespace if there
137 any continuation line(s) were present. Return \code{None} if there is
138 no header matching \var{name}.
139 \end{methoddesc}
141 \begin{methoddesc}{getheader}{name\optional{, default}}
142 Like \code{getrawheader(\var{name})}, but strip leading and trailing
143 whitespace. Internal whitespace is not stripped. The optional
144 \var{default} argument can be used to specify a different default to
145 be returned when there is no header matching \var{name}.
146 \end{methoddesc}
148 \begin{methoddesc}{get}{name\optional{, default}}
149 An alias for \method{getheader()}, to make the interface more compatible
150 with regular dictionaries.
151 \end{methoddesc}
153 \begin{methoddesc}{getaddr}{name}
154 Return a pair \code{(\var{full name}, \var{email address})} parsed
155 from the string returned by \code{getheader(\var{name})}. If no
156 header matching \var{name} exists, return \code{(None, None)};
157 otherwise both the full name and the address are (possibly empty)
158 strings.
160 Example: If \var{m}'s first \code{From} header contains the string
161 \code{'jack@cwi.nl (Jack Jansen)'}, then
162 \code{m.getaddr('From')} will yield the pair
163 \code{('Jack Jansen', 'jack@cwi.nl')}.
164 If the header contained
165 \code{'Jack Jansen <jack@cwi.nl>'} instead, it would yield the
166 exact same result.
167 \end{methoddesc}
169 \begin{methoddesc}{getaddrlist}{name}
170 This is similar to \code{getaddr(\var{list})}, but parses a header
171 containing a list of email addresses (e.g.\ a \code{To} header) and
172 returns a list of \code{(\var{full name}, \var{email address})} pairs
173 (even if there was only one address in the header). If there is no
174 header matching \var{name}, return an empty list.
176 XXX The current version of this function is not really correct. It
177 yields bogus results if a full name contains a comma.
178 \end{methoddesc}
180 \begin{methoddesc}{getdate}{name}
181 Retrieve a header using \method{getheader()} and parse it into a 9-tuple
182 compatible with \function{time.mktime()}. If there is no header matching
183 \var{name}, or it is unparsable, return \code{None}.
185 Date parsing appears to be a black art, and not all mailers adhere to
186 the standard. While it has been tested and found correct on a large
187 collection of email from many sources, it is still possible that this
188 function may occasionally yield an incorrect result.
189 \end{methoddesc}
191 \begin{methoddesc}{getdate_tz}{name}
192 Retrieve a header using \method{getheader()} and parse it into a
193 10-tuple; the first 9 elements will make a tuple compatible with
194 \function{time.mktime()}, and the 10th is a number giving the offset
195 of the date's timezone from UTC. Similarly to \method{getdate()}, if
196 there is no header matching \var{name}, or it is unparsable, return
197 \code{None}.
198 \end{methoddesc}
200 \class{Message} instances also support a read-only mapping interface.
201 In particular: \code{\var{m}[name]} is like
202 \code{\var{m}.getheader(name)} but raises \exception{KeyError} if
203 there is no matching header; and \code{len(\var{m})},
204 \code{\var{m}.has_key(name)}, \code{\var{m}.keys()},
205 \code{\var{m}.values()} and \code{\var{m}.items()} act as expected
206 (and consistently).
208 Finally, \class{Message} instances have two public instance variables:
210 \begin{memberdesc}{headers}
211 A list containing the entire set of header lines, in the order in
212 which they were read (except that setitem calls may disturb this
213 order). Each line contains a trailing newline. The
214 blank line terminating the headers is not contained in the list.
215 \end{memberdesc}
217 \begin{memberdesc}{fp}
218 The file object passed at instantiation time.
219 \end{memberdesc}
221 \subsection{AddressList Objects}
222 \label{addresslist-objects}
224 An \class{AddressList} instance has the following methods:
226 \begin{methoddesc}{__len__}{name}
227 Return the number of addresses in the address list.
228 \end{methoddesc}
230 \begin{methoddesc}{__str__}{name}
231 Return a canonicalized string representation of the address list.
232 Addresses are rendered in "name" <host@domain> form, comma-separated.
233 \end{methoddesc}
235 \begin{methoddesc}{__add__}{name}
236 Return an AddressList instance that contains all addresses in both
237 AddressList operands, with duplicates removed (set union).
238 \end{methoddesc}
240 \begin{methoddesc}{__sub__}{name}
241 Return an AddressList instance that contains every address in the
242 left-hand AddressList operand that is not present in the right-hand
243 address operand (set difference).
244 \end{methoddesc}
247 Finally, \class{AddressList} instances have one public instance variable:
249 \begin{memberdesc}{addresslist}
250 A list of tuple string pairs, one per address. In each member, the
251 first is the canonicalized name part of the address, the second is the
252 route-address (@-separated host-domain pair).
253 \end{memberdesc}