1 \section{\module{sgmllib
} ---
4 \declaremodule{standard
}{sgmllib
}
5 \modulesynopsis{Only as much of an SGML parser as needed to parse HTML.
}
9 This module defines a class
\class{SGMLParser
} which serves as the
10 basis for parsing text files formatted in SGML (Standard Generalized
11 Mark-up Language). In fact, it does not provide a full SGML parser
12 --- it only parses SGML insofar as it is used by HTML, and the module
13 only exists as a base for the
\refmodule{htmllib
}\refstmodindex{htmllib
}
17 \begin{classdesc
}{SGMLParser
}{}
18 The
\class{SGMLParser
} class is instantiated without arguments.
19 The parser is hardcoded to recognize the following
24 Opening and closing tags of the form
25 \samp{<
\var{tag
} \var{attr
}="
\var{value
}" ...>
} and
26 \samp{</
\var{tag
}>
}, respectively.
29 Numeric character references of the form
\samp{\&\#
\var{name
};
}.
32 Entity references of the form
\samp{\&
\var{name
};
}.
35 SGML comments of the form
\samp{<!--
\var{text
}-->
}. Note that
36 spaces, tabs, and newlines are allowed between the trailing
37 \samp{>
} and the immediately preceeding
\samp{--
}.
42 \class{SGMLParser
} instances have the following interface methods:
45 \begin{methoddesc
}{reset
}{}
46 Reset the instance. Loses all unprocessed data. This is called
47 implicitly at instantiation time.
50 \begin{methoddesc
}{setnomoretags
}{}
51 Stop processing tags. Treat all following input as literal input
52 (CDATA). (This is only provided so the HTML tag
53 \code{<PLAINTEXT>
} can be implemented.)
56 \begin{methoddesc
}{setliteral
}{}
57 Enter literal mode (CDATA mode).
60 \begin{methoddesc
}{feed
}{data
}
61 Feed some text to the parser. It is processed insofar as it consists
62 of complete elements; incomplete data is buffered until more data is
63 fed or
\method{close()
} is called.
66 \begin{methoddesc
}{close
}{}
67 Force processing of all buffered data as if it were followed by an
68 end-of-file mark. This method may be redefined by a derived class to
69 define additional processing at the end of the input, but the
70 redefined version should always call
\method{close()
}.
73 \begin{methoddesc
}{handle_starttag
}{tag, method, attributes
}
74 This method is called to handle start tags for which either a
75 \method{start_
\var{tag
}()
} or
\method{do_
\var{tag
}()
} method has been
76 defined. The
\var{tag
} argument is the name of the tag converted to
77 lower case, and the
\var{method
} argument is the bound method which
78 should be used to support semantic interpretation of the start tag.
79 The
\var{attributes
} argument is a list of
\code{(
\var{name
},
80 \var{value
})
} pairs containing the attributes found inside the tag's
81 \code{<>
} brackets. The
\var{name
} has been translated to lower case
82 and double quotes and backslashes in the
\var{value
} have been interpreted.
83 For instance, for the tag
\code{<A HREF="http://www.cwi.nl/">
}, this
84 method would be called as
\samp{unknown_starttag('a',
[('href',
85 'http://www.cwi.nl/')
])
}. The base implementation simply calls
86 \var{method
} with
\var{attributes
} as the only argument.
89 \begin{methoddesc
}{handle_endtag
}{tag, method
}
90 This method is called to handle endtags for which an
91 \method{end_
\var{tag
}()
} method has been defined. The
92 \var{tag
} argument is the name of the tag converted to lower case, and
93 the
\var{method
} argument is the bound method which should be used to
94 support semantic interpretation of the end tag. If no
95 \method{end_
\var{tag
}()
} method is defined for the closing element,
96 this handler is not called. The base implementation simply calls
100 \begin{methoddesc
}{handle_data
}{data
}
101 This method is called to process arbitrary data. It is intended to be
102 overridden by a derived class; the base class implementation does
106 \begin{methoddesc
}{handle_charref
}{ref
}
107 This method is called to process a character reference of the form
108 \samp{\&\#
\var{ref
};
}. In the base implementation,
\var{ref
} must
109 be a decimal number in the
110 range
0-
255. It translates the character to
\ASCII{} and calls the
111 method
\method{handle_data()
} with the character as argument. If
112 \var{ref
} is invalid or out of range, the method
113 \code{unknown_charref(
\var{ref
})
} is called to handle the error. A
114 subclass must override this method to provide support for named
118 \begin{methoddesc
}{handle_entityref
}{ref
}
119 This method is called to process a general entity reference of the
120 form
\samp{\&
\var{ref
};
} where
\var{ref
} is an general entity
121 reference. It looks for
\var{ref
} in the instance (or class)
122 variable
\member{entitydefs
} which should be a mapping from entity
123 names to corresponding translations. If a translation is found, it
124 calls the method
\method{handle_data()
} with the translation;
125 otherwise, it calls the method
\code{unknown_entityref(
\var{ref
})
}.
126 The default
\member{entitydefs
} defines translations for
127 \code{\&
},
\code{\&apos
},
\code{\>
},
\code{\<
}, and
131 \begin{methoddesc
}{handle_comment
}{comment
}
132 This method is called when a comment is encountered. The
133 \var{comment
} argument is a string containing the text between the
134 \samp{<!--
} and
\samp{-->
} delimiters, but not the delimiters
135 themselves. For example, the comment
\samp{<!--text-->
} will
136 cause this method to be called with the argument
\code{'text'
}. The
137 default method does nothing.
140 \begin{methoddesc
}{report_unbalanced
}{tag
}
141 This method is called when an end tag is found which does not
142 correspond to any open element.
145 \begin{methoddesc
}{unknown_starttag
}{tag, attributes
}
146 This method is called to process an unknown start tag. It is intended
147 to be overridden by a derived class; the base class implementation
151 \begin{methoddesc
}{unknown_endtag
}{tag
}
152 This method is called to process an unknown end tag. It is intended
153 to be overridden by a derived class; the base class implementation
157 \begin{methoddesc
}{unknown_charref
}{ref
}
158 This method is called to process unresolvable numeric character
159 references. Refer to
\method{handle_charref()
} to determine what is
160 handled by default. It is intended to be overridden by a derived
161 class; the base class implementation does nothing.
164 \begin{methoddesc
}{unknown_entityref
}{ref
}
165 This method is called to process an unknown entity reference. It is
166 intended to be overridden by a derived class; the base class
167 implementation does nothing.
170 Apart from overriding or extending the methods listed above, derived
171 classes may also define methods of the following form to define
172 processing of specific tags. Tag names in the input stream are case
173 independent; the
\var{tag
} occurring in method names must be in lower
176 \begin{methoddescni
}{start_
\var{tag
}}{attributes
}
177 This method is called to process an opening tag
\var{tag
}. It has
178 preference over
\method{do_
\var{tag
}()
}. The
179 \var{attributes
} argument has the same meaning as described for
180 \method{handle_starttag()
} above.
183 \begin{methoddescni
}{do_
\var{tag
}}{attributes
}
184 This method is called to process an opening tag
\var{tag
} that does
185 not come with a matching closing tag. The
\var{attributes
} argument
186 has the same meaning as described for
\method{handle_starttag()
} above.
189 \begin{methoddescni
}{end_
\var{tag
}}{}
190 This method is called to process a closing tag
\var{tag
}.
193 Note that the parser maintains a stack of open elements for which no
194 end tag has been found yet. Only tags processed by
195 \method{start_
\var{tag
}()
} are pushed on this stack. Definition of an
196 \method{end_
\var{tag
}()
} method is optional for these tags. For tags
197 processed by
\method{do_
\var{tag
}()
} or by
\method{unknown_tag()
}, no
198 \method{end_
\var{tag
}()
} method must be defined; if defined, it will
199 not be used. If both
\method{start_
\var{tag
}()
} and
200 \method{do_
\var{tag
}()
} methods exist for a tag, the
201 \method{start_
\var{tag
}()
} method takes precedence.