Use py_resource module
[python/dscho.git] / Doc / libsgmllib.tex
blob129bdd9adc33520b096c60793fba2679b095dd94
1 \section{Standard Module \sectcode{sgmllib}}
2 \stmodindex{sgmllib}
3 \index{SGML}
5 \renewcommand{\indexsubitem}{(in module sgmllib)}
7 This module defines a class \code{SGMLParser} which serves as the
8 basis for parsing text files formatted in SGML (Standard Generalized
9 Mark-up Language). In fact, it does not provide a full SGML parser
10 --- it only parses SGML insofar as it is used by HTML, and the module only
11 exists as a basis for the \code{htmllib} module.
12 \stmodindex{htmllib}
14 In particular, the parser is hardcoded to recognize the following
15 elements:
17 \begin{itemize}
19 \item
20 Opening and closing tags of the form
21 ``\code{<\var{tag} \var{attr}="\var{value}" ...>}'' and
22 ``\code{</\var{tag}>}'', respectively.
24 \item
25 Character references of the form ``\code{\&\#\var{name};}''.
27 \item
28 Entity references of the form ``\code{\&\var{name};}''.
30 \item
31 SGML comments of the form ``\code{<!--\var{text}>}''.
33 \end{itemize}
35 The \code{SGMLParser} class must be instantiated without arguments.
36 It has the following interface methods:
38 \begin{funcdesc}{reset}{}
39 Reset the instance. Loses all unprocessed data. This is called
40 implicitly at instantiation time.
41 \end{funcdesc}
43 \begin{funcdesc}{setnomoretags}{}
44 Stop processing tags. Treat all following input as literal input
45 (CDATA). (This is only provided so the HTML tag \code{<PLAINTEXT>}
46 can be implemented.)
47 \end{funcdesc}
49 \begin{funcdesc}{setliteral}{}
50 Enter literal mode (CDATA mode).
51 \end{funcdesc}
53 \begin{funcdesc}{feed}{data}
54 Feed some text to the parser. It is processed insofar as it consists
55 of complete elements; incomplete data is buffered until more data is
56 fed or \code{close()} is called.
57 \end{funcdesc}
59 \begin{funcdesc}{close}{}
60 Force processing of all buffered data as if it were followed by an
61 end-of-file mark. This method may be redefined by a derived class to
62 define additional processing at the end of the input, but the
63 redefined version should always call \code{SGMLParser.close()}.
64 \end{funcdesc}
66 \begin{funcdesc}{handle_charref}{ref}
67 This method is called to process a character reference of the form
68 ``\code{\&\#\var{ref};}'' where \var{ref} is a decimal number in the
69 range 0-255. It translates the character to \ASCII{} and calls the
70 method \code{handle_data()} with the character as argument. If
71 \var{ref} is invalid or out of range, the method
72 \code{unknown_charref(\var{ref})} is called instead.
73 \end{funcdesc}
75 \begin{funcdesc}{handle_entityref}{ref}
76 This method is called to process an entity reference of the form
77 ``\code{\&\var{ref};}'' where \var{ref} is an alphabetic entity
78 reference. It looks for \var{ref} in the instance (or class)
79 variable \code{entitydefs} which should give the entity's translation.
80 If a translation is found, it calls the method \code{handle_data()}
81 with the translation; otherwise, it calls the method
82 \code{unknown_entityref(\var{ref})}.
83 \end{funcdesc}
85 \begin{funcdesc}{handle_data}{data}
86 This method is called to process arbitrary data. It is intended to be
87 overridden by a derived class; the base class implementation does
88 nothing.
89 \end{funcdesc}
91 \begin{funcdesc}{unknown_starttag}{tag\, attributes}
92 This method is called to process an unknown start tag. It is intended
93 to be overridden by a derived class; the base class implementation
94 does nothing. The \var{attributes} argument is a list of
95 (\var{name}, \var{value}) pairs containing the attributes found inside
96 the tag's \code{<>} brackets. The \var{name} has been translated to
97 lower case and double quotes and backslashes in the \var{value} have
98 been interpreted. For instance, for the tag
99 \code{<A HREF="http://www.cwi.nl/">}, this method would be
100 called as \code{unknown_starttag('a', [('href', 'http://www.cwi.nl/')])}.
101 \end{funcdesc}
103 \begin{funcdesc}{unknown_endtag}{tag}
104 This method is called to process an unknown end tag. It is intended
105 to be overridden by a derived class; the base class implementation
106 does nothing.
107 \end{funcdesc}
109 \begin{funcdesc}{unknown_charref}{ref}
110 This method is called to process an unknown character reference. It
111 is intended to be overridden by a derived class; the base class
112 implementation does nothing.
113 \end{funcdesc}
115 \begin{funcdesc}{unknown_entityref}{ref}
116 This method is called to process an unknown entity reference. It is
117 intended to be overridden by a derived class; the base class
118 implementation does nothing.
119 \end{funcdesc}
121 Apart from overriding or extending the methods listed above, derived
122 classes may also define methods of the following form to define
123 processing of specific tags. Tag names in the input stream are case
124 independent; the \var{tag} occurring in method names must be in lower
125 case:
127 \begin{funcdesc}{start_\var{tag}}{attributes}
128 This method is called to process an opening tag \var{tag}. It has
129 preference over \code{do_\var{tag}()}. The \var{attributes} argument
130 has the same meaning as described for \code{unknown_tag()} above.
131 \end{funcdesc}
133 \begin{funcdesc}{do_\var{tag}}{attributes}
134 This method is called to process an opening tag \var{tag} that does
135 not come with a matching closing tag. The \var{attributes} argument
136 has the same meaning as described for \code{unknown_tag()} above.
137 \end{funcdesc}
139 \begin{funcdesc}{end_\var{tag}}{}
140 This method is called to process a closing tag \var{tag}.
141 \end{funcdesc}
143 Note that the parser maintains a stack of opening tags for which no
144 matching closing tag has been found yet. Only tags processed by
145 \code{start_\var{tag}()} are pushed on this stack. Definition of a
146 \code{end_\var{tag}()} method is optional for these tags. For tags
147 processed by \code{do_\var{tag}()} or by \code{unknown_tag()}, no
148 \code{end_\var{tag}()} method must be defined.