1 \section{\module{xmllib
} ---
2 A parser for XML documents
}
4 \declaremodule{standard
}{xmllib
}
5 \modulesynopsis{A parser for XML documents.
}
6 \moduleauthor{Sjoerd Mullender
}{Sjoerd.Mullender@cwi.nl
}
7 \sectionauthor{Sjoerd Mullender
}{Sjoerd.Mullender@cwi.nl
}
11 \index{Extensible Markup Language
}
13 \deprecated{2.0}{Use
\refmodule{xml.sax
} instead. The newer XML
14 package includes full support for XML
1.0.
}
16 \versionchanged[Added namespace support.
]{1.5.2}
18 This module defines a class
\class{XMLParser
} which serves as the basis
19 for parsing text files formatted in XML (Extensible Markup Language).
21 \begin{classdesc
}{XMLParser
}{}
22 The
\class{XMLParser
} class must be instantiated without
23 arguments.
\footnote{Actually, a number of keyword arguments are
24 recognized which influence the parser to accept certain non-standard
25 constructs. The following keyword arguments are currently
26 recognized. The defaults for all of these is
\code{0} (false) except
27 for the last one for which the default is
\code{1} (true).
28 \var{accept_unquoted_attributes
} (accept certain attribute values
29 without requiring quotes),
\var{accept_missing_endtag_name
} (accept
30 end tags that look like
\code{</>
}),
\var{map_case
} (map upper case to
31 lower case in tags and attributes),
\var{accept_utf8
} (allow UTF-
8
32 characters in input; this is required according to the XML standard,
33 but Python does not as yet deal properly with these characters, so
34 this is not the default),
\var{translate_attribute_references
} (don't
35 attempt to translate character and entity references in attribute values).
}
38 This class provides the following interface methods and instance variables:
40 \begin{memberdesc
}{attributes
}
41 A mapping of element names to mappings. The latter mapping maps
42 attribute names that are valid for the element to the default value of
43 the attribute, or if there is no default to
\code{None
}. The default
44 value is the empty dictionary. This variable is meant to be
45 overridden, not extended since the default is shared by all instances
49 \begin{memberdesc
}{elements
}
50 A mapping of element names to tuples. The tuples contain a function
51 for handling the start and end tag respectively of the element, or
52 \code{None
} if the method
\method{unknown_starttag()
} or
53 \method{unknown_endtag()
} is to be called. The default value is the
54 empty dictionary. This variable is meant to be overridden, not
55 extended since the default is shared by all instances of
59 \begin{memberdesc
}{entitydefs
}
60 A mapping of entitynames to their values. The default value contains
61 definitions for
\code{'lt'
},
\code{'gt'
},
\code{'amp'
},
\code{'quot'
},
65 \begin{methoddesc
}{reset
}{}
66 Reset the instance. Loses all unprocessed data. This is called
67 implicitly at the instantiation time.
70 \begin{methoddesc
}{setnomoretags
}{}
71 Stop processing tags. Treat all following input as literal input
75 \begin{methoddesc
}{setliteral
}{}
76 Enter literal mode (CDATA mode). This mode is automatically exited
77 when the close tag matching the last unclosed open tag is encountered.
80 \begin{methoddesc
}{feed
}{data
}
81 Feed some text to the parser. It is processed insofar as it consists
82 of complete tags; incomplete data is buffered until more data is
83 fed or
\method{close()
} is called.
86 \begin{methoddesc
}{close
}{}
87 Force processing of all buffered data as if it were followed by an
88 end-of-file mark. This method may be redefined by a derived class to
89 define additional processing at the end of the input, but the
90 redefined version should always call
\method{close()
}.
93 \begin{methoddesc
}{translate_references
}{data
}
94 Translate all entity and character references in
\var{data
} and
95 return the translated string.
98 \begin{methoddesc
}{getnamespace
}{}
99 Return a mapping of namespace abbreviations to namespace URIs that are
103 \begin{methoddesc
}{handle_xml
}{encoding, standalone
}
104 This method is called when the
\samp{<?xml ...?>
} tag is processed.
105 The arguments are the values of the encoding and standalone attributes
106 in the tag. Both encoding and standalone are optional. The values
107 passed to
\method{handle_xml()
} default to
\code{None
} and the string
108 \code{'no'
} respectively.
111 \begin{methoddesc
}{handle_doctype
}{tag, pubid, syslit, data
}
112 This
\index{DOCTYPE declaration
} method is called when the
113 \samp{<!DOCTYPE...>
} declaration is processed. The arguments are the
114 tag name of the root element, the Formal Public
\index{Formal Public
115 Identifier
} Identifier (or
\code{None
} if not specified), the system
116 identifier, and the uninterpreted contents of the internal DTD subset
117 as a string (or
\code{None
} if not present).
120 \begin{methoddesc
}{handle_starttag
}{tag, method, attributes
}
121 This method is called to handle start tags for which a start tag
122 handler is defined in the instance variable
\member{elements
}. The
123 \var{tag
} argument is the name of the tag, and the
124 \var{method
} argument is the function (method) which should be used to
125 support semantic interpretation of the start tag. The
126 \var{attributes
} argument is a dictionary of attributes, the key being
127 the
\var{name
} and the value being the
\var{value
} of the attribute
128 found inside the tag's
\code{<>
} brackets. Character and entity
129 references in the
\var{value
} have been interpreted. For instance,
130 for the start tag
\code{<A HREF="http://www.cwi.nl/">
}, this method
131 would be called as
\code{handle_starttag('A', self.elements
['A'
][0],
132 \
{'HREF': 'http://www.cwi.nl/'\
})
}. The base implementation simply
133 calls
\var{method
} with
\var{attributes
} as the only argument.
136 \begin{methoddesc
}{handle_endtag
}{tag, method
}
137 This method is called to handle endtags for which an end tag handler
138 is defined in the instance variable
\member{elements
}. The
\var{tag
}
139 argument is the name of the tag, and the
\var{method
} argument is the
140 function (method) which should be used to support semantic
141 interpretation of the end tag. For instance, for the endtag
142 \code{</A>
}, this method would be called as
\code{handle_endtag('A',
143 self.elements
['A'
][1])
}. The base implementation simply calls
147 \begin{methoddesc
}{handle_data
}{data
}
148 This method is called to process arbitrary data. It is intended to be
149 overridden by a derived class; the base class implementation does
153 \begin{methoddesc
}{handle_charref
}{ref
}
154 This method is called to process a character reference of the form
155 \samp{\&\#
\var{ref
};
}.
\var{ref
} can either be a decimal number,
156 or a hexadecimal number when preceded by an
\character{x
}.
157 In the base implementation,
\var{ref
} must be a number in the
158 range
0-
255. It translates the character to
\ASCII{} and calls the
159 method
\method{handle_data()
} with the character as argument. If
160 \var{ref
} is invalid or out of range, the method
161 \code{unknown_charref(
\var{ref
})
} is called to handle the error. A
162 subclass must override this method to provide support for character
163 references outside of the
\ASCII{} range.
166 \begin{methoddesc
}{handle_comment
}{comment
}
167 This method is called when a comment is encountered. The
168 \var{comment
} argument is a string containing the text between the
169 \samp{<!--
} and
\samp{-->
} delimiters, but not the delimiters
170 themselves. For example, the comment
\samp{<!--text-->
} will
171 cause this method to be called with the argument
\code{'text'
}. The
172 default method does nothing.
175 \begin{methoddesc
}{handle_cdata
}{data
}
176 This method is called when a CDATA element is encountered. The
177 \var{data
} argument is a string containing the text between the
178 \samp{<!
[CDATA
[} and
\samp{]]>
} delimiters, but not the delimiters
179 themselves. For example, the entity
\samp{<!
[CDATA
[text
]]>
} will
180 cause this method to be called with the argument
\code{'text'
}. The
181 default method does nothing, and is intended to be overridden.
184 \begin{methoddesc
}{handle_proc
}{name, data
}
185 This method is called when a processing instruction (PI) is
186 encountered. The
\var{name
} is the PI target, and the
\var{data
}
187 argument is a string containing the text between the PI target and the
188 closing delimiter, but not the delimiter itself. For example, the
189 instruction
\samp{<?XML text?>
} will cause this method to be called
190 with the arguments
\code{'XML'
} and
\code{'text'
}. The default method
191 does nothing. Note that if a
document starts with
\samp{<?xml
192 ..?>
},
\method{handle_xml()
} is called to handle it.
195 \begin{methoddesc
}{handle_special
}{data
}
196 This method is called when a declaration is encountered. The
197 \var{data
} argument is a string containing the text between the
198 \samp{<!
} and
\samp{>
} delimiters, but not the delimiters
199 themselves. For example, the
\index{ENTITY declaration
}entity
200 declaration
\samp{<!ENTITY text>
} will cause this method to be called
201 with the argument
\code{'ENTITY text'
}. The default method does
202 nothing. Note that
\samp{<!DOCTYPE ...>
} is handled separately if it
203 is located at the start of the
document.
206 \begin{methoddesc
}{syntax_error
}{message
}
207 This method is called when a syntax error is encountered. The
208 \var{message
} is a description of what was wrong. The default method
209 raises a
\exception{RuntimeError
} exception. If this method is
210 overridden, it is permissible for it to return. This method is only
211 called when the error can be recovered from. Unrecoverable errors
212 raise a
\exception{RuntimeError
} without first calling
213 \method{syntax_error()
}.
216 \begin{methoddesc
}{unknown_starttag
}{tag, attributes
}
217 This method is called to process an unknown start tag. It is intended
218 to be overridden by a derived class; the base class implementation
222 \begin{methoddesc
}{unknown_endtag
}{tag
}
223 This method is called to process an unknown end tag. It is intended
224 to be overridden by a derived class; the base class implementation
228 \begin{methoddesc
}{unknown_charref
}{ref
}
229 This method is called to process unresolvable numeric character
230 references. It is intended to be overridden by a derived class; the
231 base class implementation does nothing.
234 \begin{methoddesc
}{unknown_entityref
}{ref
}
235 This method is called to process an unknown entity reference. It is
236 intended to be overridden by a derived class; the base class
237 implementation calls
\method{syntax_error()
} to signal an error.
242 \seetitle[http://www.w3.org/TR/REC-xml
]{Extensible Markup Language
243 (XML)
1.0}{The XML specification, published by the World
244 Wide Web Consortium (W3C), defines the syntax and
245 processor requirements for XML. References to additional
246 material on XML, including translations of the
247 specification, are available at
248 \url{http://www.w3.org/XML/
}.
}
250 \seetitle[http://www.python.org/topics/xml/
]{Python and XML
251 Processing
}{The Python XML Topic Guide provides a great
252 deal of information on using XML from Python and links to
253 other sources of information on XML.
}
255 \seetitle[http://www.python.org/sigs/xml-sig/
]{SIG for XML
256 Processing in Python
}{The Python XML Special Interest
257 Group is developing substantial support for processing XML
262 \subsection{XML Namespaces
\label{xml-namespace
}}
264 This module has support for XML namespaces as defined in the XML
265 Namespaces proposed recommendation.
266 \indexii{XML
}{namespaces
}
268 Tag and attribute names that are defined in an XML namespace are
269 handled as if the name of the tag or element consisted of the
270 namespace (i.e. the URL that defines the namespace) followed by a
271 space and the name of the tag or attribute. For instance, the tag
272 \code{<html xmlns='http://www.w3.org/TR/REC-html40'>
} is treated as if
273 the tag name was
\code{'http://www.w3.org/TR/REC-html40 html'
}, and
274 the tag
\code{<html:a href='http://frob.com'>
} inside the above
275 mentioned element is treated as if the tag name were
276 \code{'http://www.w3.org/TR/REC-html40 a'
} and the attribute name as
277 if it were
\code{'http://www.w3.org/TR/REC-html40 href'
}.
279 An older draft of the XML Namespaces proposal is also recognized, but
283 \seetitle[http://www.w3.org/TR/REC-xml-names/
]{Namespaces in XML
}{
284 This World-Wide Web Consortium recommendation describes the
285 proper syntax and processing requirements for namespaces in