1 \section{\module{xml.parsers.expat
} ---
2 Fast XML parsing using Expat
}
6 % Many of the attributes of the XMLParser objects are callbacks.
7 % Since signature information must be presented, these are described
8 % using the methoddesc environment. Since they are attributes which
9 % are set by client code, in-text references to these attributes
10 % should be marked using the \member macro and should not include the
11 % parentheses used when marking functions and methods.
13 \declaremodule{standard
}{xml.parsers.expat
}
14 \modulesynopsis{An interface to the Expat non-validating XML parser.
}
15 \moduleauthor{Paul Prescod
}{paul@prescod.net
}
19 The
\module{xml.parsers.expat
} module is a Python interface to the
20 Expat
\index{Expat
} non-validating XML parser.
21 The module provides a single extension type,
\class{xmlparser
}, that
22 represents the current state of an XML parser. After an
23 \class{xmlparser
} object has been created, various attributes of the object
24 can be set to handler functions. When an XML
document is then fed to
25 the parser, the handler functions are called for the character data
26 and markup in the XML
document.
28 This module uses the
\module{pyexpat
}\refbimodindex{pyexpat
} module to
29 provide access to the Expat parser. Direct use of the
30 \module{pyexpat
} module is deprecated.
32 This module provides one exception and one type object:
34 \begin{excdesc
}{ExpatError
}
35 The exception raised when Expat reports an error. See section
36 \ref{expaterror-objects
}, ``ExpatError Exceptions,'' for more
37 information on interpreting Expat errors.
40 \begin{excdesc
}{error
}
41 Alias for
\exception{ExpatError
}.
44 \begin{datadesc
}{XMLParserType
}
45 The type of the return values from the
\function{ParserCreate()
}
50 The
\module{xml.parsers.expat
} module contains two functions:
52 \begin{funcdesc
}{ErrorString
}{errno
}
53 Returns an explanatory string for a given error number
\var{errno
}.
56 \begin{funcdesc
}{ParserCreate
}{\optional{encoding
\optional{,
57 namespace_separator
}}}
58 Creates and returns a new
\class{xmlparser
} object.
59 \var{encoding
}, if specified, must be a string naming the encoding
60 used by the XML data. Expat doesn't support as many encodings as
61 Python does, and its repertoire of encodings can't be extended; it
62 supports UTF-
8, UTF-
16, ISO-
8859-
1 (Latin1), and ASCII. If
63 \var{encoding
} is given it will override the implicit or explicit
64 encoding of the
document.
66 Expat can optionally do XML namespace processing for you, enabled by
67 providing a value for
\var{namespace_separator
}. The value must be a
68 one-character string; a
\exception{ValueError
} will be raised if the
69 string has an illegal length (
\code{None
} is considered the same as
70 omission). When namespace processing is enabled, element type names
71 and attribute names that belong to a namespace will be expanded. The
72 element name passed to the element handlers
73 \member{StartElementHandler
} and
\member{EndElementHandler
}
74 will be the concatenation of the namespace URI, the namespace
75 separator character, and the local part of the name. If the namespace
76 separator is a zero byte (
\code{chr(
0)
}) then the namespace URI and
77 the local part will be concatenated without any separator.
79 For example, if
\var{namespace_separator
} is set to a space character
80 (
\character{ }) and the following
document is parsed:
84 <root xmlns = "http://default-namespace.org/"
85 xmlns:py = "http://www.python.org/ns/">
91 \member{StartElementHandler
} will receive the following strings
95 http://default-namespace.org/ root
96 http://www.python.org/ns/ elem1
102 \subsection{XMLParser Objects
\label{xmlparser-objects
}}
104 \class{xmlparser
} objects have the following methods:
106 \begin{methoddesc
}[xmlparser
]{Parse
}{data
\optional{, isfinal
}}
107 Parses the contents of the string
\var{data
}, calling the appropriate
108 handler functions to process the parsed data.
\var{isfinal
} must be
109 true on the final call to this method.
\var{data
} can be the empty
113 \begin{methoddesc
}[xmlparser
]{ParseFile
}{file
}
114 Parse XML data reading from the object
\var{file
}.
\var{file
} only
115 needs to provide the
\method{read(
\var{nbytes
})
} method, returning the
116 empty string when there's no more data.
119 \begin{methoddesc
}[xmlparser
]{SetBase
}{base
}
120 Sets the base to be used for resolving relative URIs in system
121 identifiers in declarations. Resolving relative identifiers is left
122 to the application: this value will be passed through as the
123 \var{base
} argument to the
\function{ExternalEntityRefHandler
},
124 \function{NotationDeclHandler
}, and
125 \function{UnparsedEntityDeclHandler
} functions.
128 \begin{methoddesc
}[xmlparser
]{GetBase
}{}
129 Returns a string containing the base set by a previous call to
130 \method{SetBase()
}, or
\code{None
} if
131 \method{SetBase()
} hasn't been called.
134 \begin{methoddesc
}[xmlparser
]{GetInputContext
}{}
135 Returns the input data that generated the current event as a string.
136 The data is in the encoding of the entity which contains the text.
137 When called while an event handler is not active, the return value is
142 \begin{methoddesc
}[xmlparser
]{ExternalEntityParserCreate
}{context
\optional{,
144 Create a ``child'' parser which can be used to parse an external
145 parsed entity referred to by content parsed by the parent parser. The
146 \var{context
} parameter should be the string passed to the
147 \method{ExternalEntityRefHandler()
} handler function, described below.
148 The child parser is created with the
\member{ordered_attributes
},
149 \member{returns_unicode
} and
\member{specified_attributes
} set to the
150 values of this parser.
154 \class{xmlparser
} objects have the following attributes:
156 \begin{memberdesc
}[xmlparser
]{ordered_attributes
}
157 Setting this attribute to a non-zero integer causes the attributes to
158 be reported as a list rather than a dictionary. The attributes are
159 presented in the order found in the
document text. For each
160 attribute, two list entries are presented: the attribute name and the
161 attribute value. (Older versions of this module also used this
162 format.) By default, this attribute is false; it may be changed at
167 \begin{memberdesc
}[xmlparser
]{returns_unicode
}
168 If this attribute is set to a non-zero integer, the handler functions
169 will be passed Unicode strings. If
\member{returns_unicode
} is
0,
170 8-bit strings containing UTF-
8 encoded data will be passed to the
172 \versionchanged[Can be changed at any time to affect the result
176 \begin{memberdesc
}[xmlparser
]{specified_attributes
}
177 If set to a non-zero integer, the parser will
report only those
178 attributes which were specified in the
document instance and not those
179 which were derived from attribute declarations. Applications which
180 set this need to be especially careful to use what additional
181 information is available from the declarations as needed to comply
182 with the standards for the behavior of XML processors. By default,
183 this attribute is false; it may be changed at any time.
187 The following attributes contain values relating to the most recent
188 error encountered by an
\class{xmlparser
} object, and will only have
189 correct values once a call to
\method{Parse()
} or
\method{ParseFile()
}
190 has raised a
\exception{xml.parsers.expat.ExpatError
} exception.
192 \begin{memberdesc
}[xmlparser
]{ErrorByteIndex
}
193 Byte index at which an error occurred.
196 \begin{memberdesc
}[xmlparser
]{ErrorCode
}
197 Numeric code specifying the problem. This value can be passed to the
198 \function{ErrorString()
} function, or compared to one of the constants
199 defined in the
\code{errors
} object.
202 \begin{memberdesc
}[xmlparser
]{ErrorColumnNumber
}
203 Column number at which an error occurred.
206 \begin{memberdesc
}[xmlparser
]{ErrorLineNumber
}
207 Line number at which an error occurred.
210 Here is the list of handlers that can be set. To set a handler on an
211 \class{xmlparser
} object
\var{o
}, use
212 \code{\var{o
}.
\var{handlername
} =
\var{func
}}.
\var{handlername
} must
213 be taken from the following list, and
\var{func
} must be a callable
214 object accepting the correct number of arguments. The arguments are
215 all strings, unless otherwise stated.
217 \begin{methoddesc
}[xmlparser
]{XmlDeclHandler
}{version, encoding, standalone
}
218 Called when the XML declaration is parsed. The XML declaration is the
219 (optional) declaration of the applicable version of the XML
220 recommendation, the encoding of the
document text, and an optional
221 ``standalone'' declaration.
\var{version
} and
\var{encoding
} will be
222 strings of the type dictated by the
\member{returns_unicode
}
223 attribute, and
\var{standalone
} will be
\code{1} if the
document is
224 declared standalone,
\code{0} if it is declared not to be standalone,
225 or
\code{-
1} if the standalone clause was omitted.
226 This is only available with Expat version
1.95.0 or newer.
230 \begin{methoddesc
}[xmlparser
]{StartDoctypeDeclHandler
}{doctypeName,
233 Called when Expat begins parsing the
document type declaration
234 (
\code{<!DOCTYPE
\ldots}). The
\var{doctypeName
} is provided exactly
235 as presented. The
\var{systemId
} and
\var{publicId
} parameters give
236 the system and public identifiers if specified, or
\code{None
} if
237 omitted.
\var{has_internal_subset
} will be true if the
document
238 contains and internal
document declaration subset.
239 This requires Expat version
1.2 or newer.
242 \begin{methoddesc
}[xmlparser
]{EndDoctypeDeclHandler
}{}
243 Called when Expat is done parsing the
document type delaration.
244 This requires Expat version
1.2 or newer.
247 \begin{methoddesc
}[xmlparser
]{ElementDeclHandler
}{name, model
}
248 Called once for each element type declaration.
\var{name
} is the name
249 of the element type, and
\var{model
} is a representation of the
253 \begin{methoddesc
}[xmlparser
]{AttlistDeclHandler
}{elname, attname,
254 type, default, required
}
255 Called for each declared attribute for an element type. If an
256 attribute list declaration declares three attributes, this handler is
257 called three times, once for each attribute.
\var{elname
} is the name
258 of the element to which the declaration applies and
\var{attname
} is
259 the name of the attribute declared. The attribute type is a string
260 passed as
\var{type
}; the possible values are
\code{'CDATA'
},
261 \code{'ID'
},
\code{'IDREF'
}, ...
262 \var{default
} gives the default value for the attribute used when the
263 attribute is not specified by the
document instance, or
\code{None
} if
264 there is no default value (
\code{\#IMPLIED
} values). If the attribute
265 is required to be given in the
document instance,
\var{required
} will
267 This requires Expat version
1.95.0 or newer.
270 \begin{methoddesc
}[xmlparser
]{StartElementHandler
}{name, attributes
}
271 Called for the start of every element.
\var{name
} is a string
272 containing the element name, and
\var{attributes
} is a dictionary
273 mapping attribute names to their values.
276 \begin{methoddesc
}[xmlparser
]{EndElementHandler
}{name
}
277 Called for the end of every element.
280 \begin{methoddesc
}[xmlparser
]{ProcessingInstructionHandler
}{target, data
}
281 Called for every processing instruction.
284 \begin{methoddesc
}[xmlparser
]{CharacterDataHandler
}{data
}
285 Called for character data. This will be called for normal character
286 data, CDATA marked content, and ignorable whitespace. Applications
287 which must distinguish these cases can use the
288 \member{StartCdataSectionHandler
},
\member{EndCdataSectionHandler
},
289 and
\member{ElementDeclHandler
} callbacks to collect the required
293 \begin{methoddesc
}[xmlparser
]{UnparsedEntityDeclHandler
}{entityName, base,
296 Called for unparsed (NDATA) entity declarations. This is only present
297 for version
1.2 of the Expat library; for more recent versions, use
298 \member{EntityDeclHandler
} instead. (The underlying function in the
299 Expat library has been declared obsolete.)
302 \begin{methoddesc
}[xmlparser
]{EntityDeclHandler
}{entityName,
303 is_parameter_entity, value,
307 Called for all entity declarations. For parameter and internal
308 entities,
\var{value
} will be a string giving the declared contents
309 of the entity; this will be
\code{None
} for external entities. The
310 \var{notationName
} parameter will be
\code{None
} for parsed entities,
311 and the name of the notation for unparsed entities.
312 \var{is_parameter_entity
} will be true if the entity is a paremeter
313 entity or false for general entities (most applications only need to
314 be concerned with general entities).
315 This is only available starting with version
1.95.0 of the Expat
320 \begin{methoddesc
}[xmlparser
]{NotationDeclHandler
}{notationName, base,
322 Called for notation declarations.
\var{notationName
},
\var{base
}, and
323 \var{systemId
}, and
\var{publicId
} are strings if given. If the
324 public identifier is omitted,
\var{publicId
} will be
\code{None
}.
327 \begin{methoddesc
}[xmlparser
]{StartNamespaceDeclHandler
}{prefix, uri
}
328 Called when an element contains a namespace declaration. Namespace
329 declarations are processed before the
\member{StartElementHandler
} is
330 called for the element on which declarations are placed.
333 \begin{methoddesc
}[xmlparser
]{EndNamespaceDeclHandler
}{prefix
}
334 Called when the closing tag is reached for an element
335 that contained a namespace declaration. This is called once for each
336 namespace declaration on the element in the reverse of the order for
337 which the
\member{StartNamespaceDeclHandler
} was called to indicate
338 the start of each namespace declaration's scope. Calls to this
339 handler are made after the corresponding
\member{EndElementHandler
}
340 for the end of the element.
343 \begin{methoddesc
}[xmlparser
]{CommentHandler
}{data
}
344 Called for comments.
\var{data
} is the text of the comment, excluding
345 the leading `
\code{<!-
}\code{-
}' and trailing `
\code{-
}\code{->
}'.
348 \begin{methoddesc
}[xmlparser
]{StartCdataSectionHandler
}{}
349 Called at the start of a CDATA section. This and
350 \member{StartCdataSectionHandler
} are needed to be able to identify
351 the syntactical start and end for CDATA sections.
354 \begin{methoddesc
}[xmlparser
]{EndCdataSectionHandler
}{}
355 Called at the end of a CDATA section.
358 \begin{methoddesc
}[xmlparser
]{DefaultHandler
}{data
}
359 Called for any characters in the XML
document for
360 which no applicable handler has been specified. This means
361 characters that are part of a construct which could be reported, but
362 for which no handler has been supplied.
365 \begin{methoddesc
}[xmlparser
]{DefaultHandlerExpand
}{data
}
366 This is the same as the
\function{DefaultHandler
},
367 but doesn't inhibit expansion of internal entities.
368 The entity reference will not be passed to the default handler.
371 \begin{methoddesc
}[xmlparser
]{NotStandaloneHandler
}{} Called if the
372 XML
document hasn't been declared as being a standalone
document.
373 This happens when there is an external subset or a reference to a
374 parameter entity, but the XML declaration does not set standalone to
375 \code{yes
} in an XML declaration. If this handler returns
\code{0},
376 then the parser will throw an
\constant{XML_ERROR_NOT_STANDALONE
}
377 error. If this handler is not set, no exception is raised by the
378 parser for this condition.
381 \begin{methoddesc
}[xmlparser
]{ExternalEntityRefHandler
}{context, base,
383 Called for references to external entities.
\var{base
} is the current
384 base, as set by a previous call to
\method{SetBase()
}. The public and
385 system identifiers,
\var{systemId
} and
\var{publicId
}, are strings if
386 given; if the public identifier is not given,
\var{publicId
} will be
387 \code{None
}. The
\var{context
} value is opaque and should only be
388 used as described below.
390 For external entities to be parsed, this handler must be implemented.
391 It is responsible for creating the sub-parser using
392 \code{ExternalEntityParserCreate(
\var{context
})
}, initializing it with
393 the appropriate callbacks, and parsing the entity. This handler
394 should return an integer; if it returns
\code{0}, the parser will
395 throw an
\constant{XML_ERROR_EXTERNAL_ENTITY_HANDLING
} error,
396 otherwise parsing will continue.
398 If this handler is not provided, external entities are reported by the
399 \member{DefaultHandler
} callback, if provided.
403 \subsection{ExpatError Exceptions
\label{expaterror-objects
}}
404 \sectionauthor{Fred L. Drake, Jr.
}{fdrake@acm.org
}
406 \exception{ExpatError
} exceptions have a number of interesting
409 \begin{memberdesc
}[ExpatError
]{code
}
410 Expat's internal error number for the specific error. This will
411 match one of the constants defined in the
\code{errors
} object from
416 \begin{memberdesc
}[ExpatError
]{lineno
}
417 Line number on which the error was detected. The first line is
422 \begin{memberdesc
}[ExpatError
]{offset
}
423 Character offset into the line where the error occurred. The first
424 column is numbered
\code{0}.
429 \subsection{Example
\label{expat-example
}}
431 The following program defines three handlers that just print out their
435 import xml.parsers.expat
437 #
3 handler functions
438 def start_element(name, attrs):
439 print 'Start element:', name, attrs
440 def end_element(name):
441 print 'End element:', name
443 print 'Character data:', repr(data)
445 p = xml.parsers.expat.ParserCreate()
447 p.StartElementHandler = start_element
448 p.EndElementHandler = end_element
449 p.CharacterDataHandler = char_data
451 p.Parse("""<?xml version="
1.0"?>
452 <parent id="top"><child1 name="paul">Text goes here</child1>
453 <child2 name="fred">More text</child2>
457 The output from this program is:
460 Start element: parent
{'id': 'top'
}
461 Start element: child1
{'name': 'paul'
}
462 Character data: 'Text goes here'
465 Start element: child2
{'name': 'fred'
}
466 Character data: 'More text'
473 \subsection{Content Model Descriptions
\label{expat-content-models
}}
474 \sectionauthor{Fred L. Drake, Jr.
}{fdrake@acm.org
}
476 Content modules are described using nested tuples. Each tuple
477 contains four values: the type, the quantifier, the name, and a tuple
478 of children. Children are simply additional content module
481 The values of the first two fields are constants defined in the
482 \code{model
} object of the
\module{xml.parsers.expat
} module. These
483 constants can be collected in two groups: the model type group and the
486 The constants in the model type group are:
488 \begin{datadescni
}{XML_CTYPE_ANY
}
489 The element named by the model name was declared to have a content
493 \begin{datadescni
}{XML_CTYPE_CHOICE
}
494 The named element allows a choice from a number of options; this is
495 used for content models such as
\code{(A | B | C)
}.
498 \begin{datadescni
}{XML_CTYPE_EMPTY
}
499 Elements which are declared to be
\code{EMPTY
} have this model type.
502 \begin{datadescni
}{XML_CTYPE_MIXED
}
505 \begin{datadescni
}{XML_CTYPE_NAME
}
508 \begin{datadescni
}{XML_CTYPE_SEQ
}
509 Models which represent a series of models which follow one after the
510 other are indicated with this model type. This is used for models
511 such as
\code{(A, B, C)
}.
515 The constants in the quantifier group are:
517 \begin{datadescni
}{XML_CQUANT_NONE
}
518 No modifier is given, so it can appear exactly once, as for
\code{A
}.
521 \begin{datadescni
}{XML_CQUANT_OPT
}
522 The model is optional: it can appear once or not at all, as for
526 \begin{datadescni
}{XML_CQUANT_PLUS
}
527 The model must occur one or more times (like
\code{A+
}).
530 \begin{datadescni
}{XML_CQUANT_REP
}
531 The model must occur zero or more times, as for
\code{A*
}.
535 \subsection{Expat error constants
\label{expat-errors
}}
537 The following constants are provided in the
\code{errors
} object of
538 the
\refmodule{xml.parsers.expat
} module. These constants are useful
539 in interpreting some of the attributes of the
\exception{ExpatError
}
540 exception objects raised when an error has occurred.
542 The
\code{errors
} object has the following attributes:
544 \begin{datadescni
}{XML_ERROR_ASYNC_ENTITY
}
547 \begin{datadescni
}{XML_ERROR_ATTRIBUTE_EXTERNAL_ENTITY_REF
}
548 An entity reference in an attribute value referred to an external
549 entity instead of an internal entity.
552 \begin{datadescni
}{XML_ERROR_BAD_CHAR_REF
}
553 A character reference referred to a character which is illegal in XML
554 (for example, character
\code{0}, or `
\code{\&\
#0;
}'.
557 \begin{datadescni
}{XML_ERROR_BINARY_ENTITY_REF
}
558 An entity reference referred to an entity which was declared with a
559 notation, so cannot be parsed.
562 \begin{datadescni
}{XML_ERROR_DUPLICATE_ATTRIBUTE
}
563 An attribute was used more than once in a start tag.
566 \begin{datadescni
}{XML_ERROR_INCORRECT_ENCODING
}
569 \begin{datadescni
}{XML_ERROR_INVALID_TOKEN
}
570 Raised when an input byte could not properly be assigned to a
571 character; for example, a NUL byte (value
\code{0}) in a UTF-
8 input
575 \begin{datadescni
}{XML_ERROR_JUNK_AFTER_DOC_ELEMENT
}
576 Something other than whitespace occurred after the
document element.
579 \begin{datadescni
}{XML_ERROR_MISPLACED_XML_PI
}
580 An XML declaration was found somewhere other than the start of the
584 \begin{datadescni
}{XML_ERROR_NO_ELEMENTS
}
585 The
document contains no elements (XML requires all documents to
586 contain exactly one top-level element)..
589 \begin{datadescni
}{XML_ERROR_NO_MEMORY
}
590 Expat was not able to allocate memory internally.
593 \begin{datadescni
}{XML_ERROR_PARAM_ENTITY_REF
}
594 A parameter entity reference was found where it was not allowed.
597 \begin{datadescni
}{XML_ERROR_PARTIAL_CHAR
}
601 \begin{datadescni
}{XML_ERROR_RECURSIVE_ENTITY_REF
}
602 An entity reference contained another reference to the same entity;
603 possibly via a different name, and possibly indirectly.
606 \begin{datadescni
}{XML_ERROR_SYNTAX
}
607 Some unspecified syntax error was encountered.
610 \begin{datadescni
}{XML_ERROR_TAG_MISMATCH
}
611 An end tag did not match the innermost open start tag.
614 \begin{datadescni
}{XML_ERROR_UNCLOSED_TOKEN
}
615 Some token (such as a start tag) was not closed before the end of the
616 stream or the next token was encountered.
619 \begin{datadescni
}{XML_ERROR_UNDEFINED_ENTITY
}
620 A reference was made to a entity which was not defined.
623 \begin{datadescni
}{XML_ERROR_UNKNOWN_ENCODING
}
624 The
document encoding is not supported by Expat.