1 \section{\module{sgmllib
} ---
4 \declaremodule{standard
}{sgmllib
}
5 \modulesynopsis{Only as much of an SGML parser as needed to parse HTML.
}
9 This module defines a class
\class{SGMLParser
} which serves as the
10 basis for parsing text files formatted in SGML (Standard Generalized
11 Mark-up Language). In fact, it does not provide a full SGML parser
12 --- it only parses SGML insofar as it is used by HTML, and the module
13 only exists as a base for the
\refmodule{htmllib
}\refstmodindex{htmllib
}
17 \begin{classdesc
}{SGMLParser
}{}
18 The
\class{SGMLParser
} class is instantiated without arguments.
19 The parser is hardcoded to recognize the following
24 Opening and closing tags of the form
25 \samp{<
\var{tag
} \var{attr
}="
\var{value
}" ...>
} and
26 \samp{</
\var{tag
}>
}, respectively.
29 Numeric character references of the form
\samp{\&\#
\var{name
};
}.
32 Entity references of the form
\samp{\&
\var{name
};
}.
35 SGML comments of the form
\samp{<!--
\var{text
}-->
}. Note that
36 spaces, tabs, and newlines are allowed between the trailing
37 \samp{>
} and the immediately preceding
\samp{--
}.
42 \class{SGMLParser
} instances have the following interface methods:
45 \begin{methoddesc
}{reset
}{}
46 Reset the instance. Loses all unprocessed data. This is called
47 implicitly at instantiation time.
50 \begin{methoddesc
}{setnomoretags
}{}
51 Stop processing tags. Treat all following input as literal input
52 (CDATA). (This is only provided so the HTML tag
53 \code{<PLAINTEXT>
} can be implemented.)
56 \begin{methoddesc
}{setliteral
}{}
57 Enter literal mode (CDATA mode).
60 \begin{methoddesc
}{feed
}{data
}
61 Feed some text to the parser. It is processed insofar as it consists
62 of complete elements; incomplete data is buffered until more data is
63 fed or
\method{close()
} is called.
66 \begin{methoddesc
}{close
}{}
67 Force processing of all buffered data as if it were followed by an
68 end-of-file mark. This method may be redefined by a derived class to
69 define additional processing at the end of the input, but the
70 redefined version should always call
\method{close()
}.
73 \begin{methoddesc
}{get_starttag_text
}{}
74 Return the text of the most recently opened start tag. This should
75 not normally be needed for structured processing, but may be useful in
76 dealing with HTML ``as deployed'' or for re-generating input with
77 minimal changes (whitespace between attributes can be preserved,
81 \begin{methoddesc
}{handle_starttag
}{tag, method, attributes
}
82 This method is called to handle start tags for which either a
83 \method{start_
\var{tag
}()
} or
\method{do_
\var{tag
}()
} method has been
84 defined. The
\var{tag
} argument is the name of the tag converted to
85 lower case, and the
\var{method
} argument is the bound method which
86 should be used to support semantic interpretation of the start tag.
87 The
\var{attributes
} argument is a list of
\code{(
\var{name
},
88 \var{value
})
} pairs containing the attributes found inside the tag's
89 \code{<>
} brackets. The
\var{name
} has been translated to lower case
90 and double quotes and backslashes in the
\var{value
} have been interpreted.
91 For instance, for the tag
\code{<A HREF="http://www.cwi.nl/">
}, this
92 method would be called as
\samp{unknown_starttag('a',
[('href',
93 'http://www.cwi.nl/')
])
}. The base implementation simply calls
94 \var{method
} with
\var{attributes
} as the only argument.
97 \begin{methoddesc
}{handle_endtag
}{tag, method
}
98 This method is called to handle endtags for which an
99 \method{end_
\var{tag
}()
} method has been defined. The
100 \var{tag
} argument is the name of the tag converted to lower case, and
101 the
\var{method
} argument is the bound method which should be used to
102 support semantic interpretation of the end tag. If no
103 \method{end_
\var{tag
}()
} method is defined for the closing element,
104 this handler is not called. The base implementation simply calls
108 \begin{methoddesc
}{handle_data
}{data
}
109 This method is called to process arbitrary data. It is intended to be
110 overridden by a derived class; the base class implementation does
114 \begin{methoddesc
}{handle_charref
}{ref
}
115 This method is called to process a character reference of the form
116 \samp{\&\#
\var{ref
};
}. In the base implementation,
\var{ref
} must
117 be a decimal number in the
118 range
0-
255. It translates the character to
\ASCII{} and calls the
119 method
\method{handle_data()
} with the character as argument. If
120 \var{ref
} is invalid or out of range, the method
121 \code{unknown_charref(
\var{ref
})
} is called to handle the error. A
122 subclass must override this method to provide support for named
126 \begin{methoddesc
}{handle_entityref
}{ref
}
127 This method is called to process a general entity reference of the
128 form
\samp{\&
\var{ref
};
} where
\var{ref
} is an general entity
129 reference. It looks for
\var{ref
} in the instance (or class)
130 variable
\member{entitydefs
} which should be a mapping from entity
131 names to corresponding translations. If a translation is found, it
132 calls the method
\method{handle_data()
} with the translation;
133 otherwise, it calls the method
\code{unknown_entityref(
\var{ref
})
}.
134 The default
\member{entitydefs
} defines translations for
135 \code{\&
},
\code{\&apos
},
\code{\>
},
\code{\<
}, and
139 \begin{methoddesc
}{handle_comment
}{comment
}
140 This method is called when a comment is encountered. The
141 \var{comment
} argument is a string containing the text between the
142 \samp{<!--
} and
\samp{-->
} delimiters, but not the delimiters
143 themselves. For example, the comment
\samp{<!--text-->
} will
144 cause this method to be called with the argument
\code{'text'
}. The
145 default method does nothing.
148 \begin{methoddesc
}{report_unbalanced
}{tag
}
149 This method is called when an end tag is found which does not
150 correspond to any open element.
153 \begin{methoddesc
}{unknown_starttag
}{tag, attributes
}
154 This method is called to process an unknown start tag. It is intended
155 to be overridden by a derived class; the base class implementation
159 \begin{methoddesc
}{unknown_endtag
}{tag
}
160 This method is called to process an unknown end tag. It is intended
161 to be overridden by a derived class; the base class implementation
165 \begin{methoddesc
}{unknown_charref
}{ref
}
166 This method is called to process unresolvable numeric character
167 references. Refer to
\method{handle_charref()
} to determine what is
168 handled by default. It is intended to be overridden by a derived
169 class; the base class implementation does nothing.
172 \begin{methoddesc
}{unknown_entityref
}{ref
}
173 This method is called to process an unknown entity reference. It is
174 intended to be overridden by a derived class; the base class
175 implementation does nothing.
178 Apart from overriding or extending the methods listed above, derived
179 classes may also define methods of the following form to define
180 processing of specific tags. Tag names in the input stream are case
181 independent; the
\var{tag
} occurring in method names must be in lower
184 \begin{methoddescni
}{start_
\var{tag
}}{attributes
}
185 This method is called to process an opening tag
\var{tag
}. It has
186 preference over
\method{do_
\var{tag
}()
}. The
187 \var{attributes
} argument has the same meaning as described for
188 \method{handle_starttag()
} above.
191 \begin{methoddescni
}{do_
\var{tag
}}{attributes
}
192 This method is called to process an opening tag
\var{tag
} that does
193 not come with a matching closing tag. The
\var{attributes
} argument
194 has the same meaning as described for
\method{handle_starttag()
} above.
197 \begin{methoddescni
}{end_
\var{tag
}}{}
198 This method is called to process a closing tag
\var{tag
}.
201 Note that the parser maintains a stack of open elements for which no
202 end tag has been found yet. Only tags processed by
203 \method{start_
\var{tag
}()
} are pushed on this stack. Definition of an
204 \method{end_
\var{tag
}()
} method is optional for these tags. For tags
205 processed by
\method{do_
\var{tag
}()
} or by
\method{unknown_tag()
}, no
206 \method{end_
\var{tag
}()
} method must be defined; if defined, it will
207 not be used. If both
\method{start_
\var{tag
}()
} and
208 \method{do_
\var{tag
}()
} methods exist for a tag, the
209 \method{start_
\var{tag
}()
} method takes precedence.