1 \section{\module{xmllib
} ---
2 A parser for XML documents
}
4 \declaremodule{standard
}{xmllib
}
5 \modulesynopsis{A parser for XML documents.
}
6 \moduleauthor{Sjoerd Mullender
}{Sjoerd.Mullender@cwi.nl
}
7 \sectionauthor{Sjoerd Mullender
}{Sjoerd.Mullender@cwi.nl
}
11 \index{Extensible Markup Language
}
13 \versionchanged{1.5.2}
15 This module defines a class
\class{XMLParser
} which serves as the basis
16 for parsing text files formatted in XML (Extensible Markup Language).
18 \begin{classdesc
}{XMLParser
}{}
19 The
\class{XMLParser
} class must be instantiated without
20 arguments.
\footnote{Actually, a number of keyword arguments are
21 recognized which influence the parser to accept certain non-standard
22 constructs. The following keyword arguments are currently
23 recognized. The defaults for all of these is
\code{0} (false).
24 \var{accept_unquoted_attributes
} (accept certain attribute values
25 without requiring quotes),
\var{accept_missing_endtag_name
} (accept
26 end tags that look like
\code{</>
}),
\var{map_case
} (map upper case to
27 lower case in tags and attributes),
\var{accept_utf8
} (allow UTF-
8
28 characters in input; this is required according to the XML standard,
29 but Python does not as yet deal properly with these characters, so
30 this is not the default).
}
33 This class provides the following interface methods and instance variables:
35 \begin{memberdesc
}{attributes
}
36 A mapping of element names to mappings. The latter mapping maps
37 attribute names that are valid for the element to the default value of
38 the attribute, or if there is no default to
\code{None
}. The default
39 value is the empty dictionary. This variable is meant to be
40 overridden, not extended since the default is shared by all instances
44 \begin{memberdesc
}{elements
}
45 A mapping of element names to tuples. The tuples contain a function
46 for handling the start and end tag respectively of the element, or
47 \code{None
} if the method
\method{unknown_starttag()
} or
48 \method{unknown_endtag()
} is to be called. The default value is the
49 empty dictionary. This variable is meant to be overridden, not
50 extended since the default is shared by all instances of
54 \begin{memberdesc
}{entitydefs
}
55 A mapping of entitynames to their values. The default value contains
56 definitions for
\code{'lt'
},
\code{'gt'
},
\code{'amp'
},
\code{'quot'
},
60 \begin{methoddesc
}{reset
}{}
61 Reset the instance. Loses all unprocessed data. This is called
62 implicitly at the instantiation time.
65 \begin{methoddesc
}{setnomoretags
}{}
66 Stop processing tags. Treat all following input as literal input
70 \begin{methoddesc
}{setliteral
}{}
71 Enter literal mode (CDATA mode). This mode is automatically exited
72 when the close tag matching the last unclosed open tag is encountered.
75 \begin{methoddesc
}{feed
}{data
}
76 Feed some text to the parser. It is processed insofar as it consists
77 of complete tags; incomplete data is buffered until more data is
78 fed or
\method{close()
} is called.
81 \begin{methoddesc
}{close
}{}
82 Force processing of all buffered data as if it were followed by an
83 end-of-file mark. This method may be redefined by a derived class to
84 define additional processing at the end of the input, but the
85 redefined version should always call
\method{close()
}.
88 \begin{methoddesc
}{translate_references
}{data
}
89 Translate all entity and character references in
\var{data
} and
90 return the translated string.
93 \begin{methoddesc
}{handle_xml
}{encoding, standalone
}
94 This method is called when the
\samp{<?xml ...?>
} tag is processed.
95 The arguments are the values of the encoding and standalone attributes
96 in the tag. Both encoding and standalone are optional. The values
97 passed to
\method{handle_xml()
} default to
\code{None
} and the string
98 \code{'no'
} respectively.
101 \begin{methoddesc
}{handle_doctype
}{tag, data
}
102 This method is called when the
\samp{<!DOCTYPE...>
} tag is processed.
103 The arguments are the name of the root element and the uninterpreted
104 contents of the tag, starting after the white space after the name of
108 \begin{methoddesc
}{handle_starttag
}{tag, method, attributes
}
109 This method is called to handle start tags for which a start tag
110 handler is defined in the instance variable
\member{elements
}. The
111 \var{tag
} argument is the name of the tag, and the
\var{method
}
112 argument is the function (method) which should be used to support semantic
113 interpretation of the start tag. The
\var{attributes
} argument is a
114 dictionary of attributes, the key being the
\var{name
} and the value
115 being the
\var{value
} of the attribute found inside the tag's
116 \code{<>
} brackets. Character and entity references in the
117 \var{value
} have been interpreted. For instance, for the start tag
118 \code{<A HREF="http://www.cwi.nl/">
}, this method would be called as
119 \code{handle_starttag('A', self.elements
['A'
][0], \
{'HREF': 'http://www.cwi.nl/'\
})
}.
120 The base implementation simply calls
\var{method
} with
\var{attributes
}
121 as the only argument.
124 \begin{methoddesc
}{handle_endtag
}{tag, method
}
125 This method is called to handle endtags for which an end tag handler
126 is defined in the instance variable
\member{elements
}. The
\var{tag
}
127 argument is the name of the tag, and the
\var{method
} argument is the
128 function (method) which should be used to support semantic
129 interpretation of the end tag. For instance, for the endtag
130 \code{</A>
}, this method would be called as
\code{handle_endtag('A',
131 self.elements
['A'
][1])
}. The base implementation simply calls
135 \begin{methoddesc
}{handle_data
}{data
}
136 This method is called to process arbitrary data. It is intended to be
137 overridden by a derived class; the base class implementation does
141 \begin{methoddesc
}{handle_charref
}{ref
}
142 This method is called to process a character reference of the form
143 \samp{\&\#
\var{ref
};
}.
\var{ref
} can either be a decimal number,
144 or a hexadecimal number when preceded by an
\character{x
}.
145 In the base implementation,
\var{ref
} must be a number in the
146 range
0-
255. It translates the character to
\ASCII{} and calls the
147 method
\method{handle_data()
} with the character as argument. If
148 \var{ref
} is invalid or out of range, the method
149 \code{unknown_charref(
\var{ref
})
} is called to handle the error. A
150 subclass must override this method to provide support for character
151 references outside of the
\ASCII{} range.
154 \begin{methoddesc
}{handle_comment
}{comment
}
155 This method is called when a comment is encountered. The
156 \var{comment
} argument is a string containing the text between the
157 \samp{<!--
} and
\samp{-->
} delimiters, but not the delimiters
158 themselves. For example, the comment
\samp{<!--text-->
} will
159 cause this method to be called with the argument
\code{'text'
}. The
160 default method does nothing.
163 \begin{methoddesc
}{handle_cdata
}{data
}
164 This method is called when a CDATA element is encountered. The
165 \var{data
} argument is a string containing the text between the
166 \samp{<!
[CDATA
[} and
\samp{]]>
} delimiters, but not the delimiters
167 themselves. For example, the entity
\samp{<!
[CDATA
[text
]]>
} will
168 cause this method to be called with the argument
\code{'text'
}. The
169 default method does nothing, and is intended to be overridden.
172 \begin{methoddesc
}{handle_proc
}{name, data
}
173 This method is called when a processing instruction (PI) is
174 encountered. The
\var{name
} is the PI target, and the
\var{data
}
175 argument is a string containing the text between the PI target and the
176 closing delimiter, but not the delimiter itself. For example, the
177 instruction
\samp{<?XML text?>
} will cause this method to be called
178 with the arguments
\code{'XML'
} and
\code{'text'
}. The default method
179 does nothing. Note that if a
document starts with
\samp{<?xml
180 ..?>
},
\method{handle_xml()
} is called to handle it.
183 \begin{methoddesc
}{handle_special
}{data
}
184 This method is called when a declaration is encountered. The
185 \var{data
} argument is a string containing the text between the
186 \samp{<!
} and
\samp{>
} delimiters, but not the delimiters
187 themselves. For example, the entity
\samp{<!ENTITY text>
} will
188 cause this method to be called with the argument
\code{'ENTITY text'
}. The
189 default method does nothing. Note that
\samp{<!DOCTYPE ...>
} is
190 handled separately if it is located at the start of the
document.
193 \begin{methoddesc
}{syntax_error
}{message
}
194 This method is called when a syntax error is encountered. The
195 \var{message
} is a description of what was wrong. The default method
196 raises a
\exception{RuntimeError
} exception. If this method is
197 overridden, it is permissable for it to return. This method is only
198 called when the error can be recovered from. Unrecoverable errors
199 raise a
\exception{RuntimeError
} without first calling
200 \method{syntax_error()
}.
203 \begin{methoddesc
}{unknown_starttag
}{tag, attributes
}
204 This method is called to process an unknown start tag. It is intended
205 to be overridden by a derived class; the base class implementation
209 \begin{methoddesc
}{unknown_endtag
}{tag
}
210 This method is called to process an unknown end tag. It is intended
211 to be overridden by a derived class; the base class implementation
215 \begin{methoddesc
}{unknown_charref
}{ref
}
216 This method is called to process unresolvable numeric character
217 references. It is intended to be overridden by a derived class; the
218 base class implementation does nothing.
221 \begin{methoddesc
}{unknown_entityref
}{ref
}
222 This method is called to process an unknown entity reference. It is
223 intended to be overridden by a derived class; the base class
224 implementation calls
\method{syntax_error()
} to signal an error.
229 \seetext{The Python XML Topic Guide provides a great deal of information
230 on using XML from Python and links to other sources of information
231 on XML. It's located on the Web at
232 \url{http://www.python.org/topics/xml/
}.
}
234 \seetext{The Python XML Special Interest Group is developing substantial
235 support for processing XML from Python. See
236 \url{http://www.python.org/sigs/xml-sig/
} for more information.
}
240 \subsection{XML Namespaces
\label{xml-namespace
}}
242 This module has support for XML namespaces as defined in the XML
243 Namespaces proposed recommendation.
244 \indexii{XML
}{namespaces
}
246 Tag and attribute names that are defined in an XML namespace are
247 handled as if the name of the tag or element consisted of the
248 namespace (i.e. the URL that defines the namespace) followed by a
249 space and the name of the tag or attribute. For instance, the tag
250 \code{<html xmlns='http://www.w3.org/TR/REC-html40'>
} is treated as if
251 the tag name was
\code{'http://www.w3.org/TR/REC-html40 html'
}, and
252 the tag
\code{<html:a href='http://frob.com'>
} inside the above
253 mentioned element is treated as if the tag name were
254 \code{'http://www.w3.org/TR/REC-html40 a'
} and the attribute name as
255 if it were
\code{'http://www.w3.org/TR/REC-html40 src'
}.
257 An older draft of the XML Namespaces proposal is also recognized, but