8 parsel - Select parts of a HTML document based on CSS selectors
12 parsel <B<SELECTOR>> [<B<SELECTOR>> [...]] < document.html
16 This command takes an HTML document in STDIN and some CSS selectors in
17 arguments. See 'parsel' and 'cssselect' python modules to see which
18 selectors and pseudo selectors are supported.
20 Each B<SELECTOR> selects a part in the DOM, but unlike CSS, does not
21 narrow the DOM tree down for subsequent selectors. So a sequence of
22 C<div p> arguments (2 arguments) selects all C<< <DIV> >> and then all C<< <P> >> in
23 the document; in other words it is NOT equivalent to the C<div p> css
24 selector which selects only those <P> which are under any <DIV>.
25 To combine selectors, see the C</> (slash) operator below.
27 Each B<SELECTOR> also outputs what was matched, in the following format:
28 First output an integer how many distinct HTML parts were selected, then
29 output the selected parts themself each in its own line.
30 CR, LF, and Backslash chars are escaped by one Backslash char. It's
31 useful for programmatic consumption, because you only have to fist read
32 a line which tells how many subsequent lines to read: each one is one
33 selected DOM sub-tree on its own (or text, see C<::text> and C<[[ATTRIB]]> below).
34 Then just unescape Backslash-R, Backslash-N, and double Backslashes
35 (for example with C<sed -e 's/\\\\/\\/g; s/\\r/\r/g; s/\\n/\n/g'>)
36 to get the HTML content.
38 Additionally it takes these special arguments as well:
44 Prefix your selector with an C<@> at sign to suppress output.
45 Mnemonic: Command line echo suppression in DOS batch and in Makefile.
47 =item B<text{}> or B<::text>
49 Remove HTML tags and leaves text content only before output.
50 C<text{}> syntax is borrowed from pup(1).
51 C<::text> form is there for you if curly brackets are magical in your shell and you don't want to type escaping.
52 Note, C<::text> is not a standard CSS pseudo selector at the moment.
54 =item B<attr{ATTRIB}> or B<[[ATTRIB]]>
56 Output only the value of the uppermost selected element's ATTRIB attribute.
57 C<attr{}> syntax is borrowed from pup(1).
58 Mnemonic for the C<[[ATTRIB]]> form: in CSS you filter by tag attribute
59 with C<[attr]> square brackets, but as it's a valid selector,
60 parsel(1) takes double square brackets to actually output the attribute.
62 =item B</> (forward slash)
64 A stand-alone C</> takes the current selection as a base for the rest of the selectors.
65 Therefore the subsequent I<SELECTOR>s work on the previously selected elements,
66 not on the document root.
67 Mnemonic: one directory level deeper.
68 So this arg sequence: C<.content / p div> selects only those P and DIV elements
69 which are inside a "content" class.
70 This is useful because with css only, you can not group P and DIV together here.
71 In other words neither C<.content p, div> nor C<.content E<gt> p, div> provides
74 =item B<SEL1/SEL2/SEL3>
76 A series of selectors delmited by C</> forward slashes in a single argument
77 is to delve into the DOM tree, but show only those elements which the last selector yields.
78 In contrast to the multi-argument variant C<SEL1 / SEL2 / SEL3>, which shows everything
79 SEL1, SEL2, SEL3, etc produces.
80 Similar to this 5 words argument: C<@SEL1 / @SEL2 / SEL3>, except C<SEL1/SEL2/SEL3>
81 rewinds the base selection to the one before SEL1, while the former one moves the
82 base selection to SEL3 at the end.
84 You may still silence its output by prepending C<@>, like: C<@SEL1/SEL2/SEL3>, so
85 not even SEL3 will be shown.
86 This is useful when you want only its attributes or inner text (see B<text{}> and B<attr{}>).
88 Since slashes may occour normally in valid CSS selectors,
89 please double those C</> slashes which are not meant to separate selectors,
90 but are part of a selector - usually an URL in a tag attribute.
91 Eg. instead of C<a[href="http://example.net/page"]>, input C<a[href="http:////example.net//page"]>.
93 =item B<..> (double period)
95 A stand-alone C<..> rewinds the base DOM selection to the
96 previous base selection before the last C</>.
97 Mnemonic: parent directory.
98 Note, it does not select the parent element in the DOM tree,
99 but the stuff previously selected in this parsel(1) run.
100 To select the parent element(s) use C<parent{}>.
102 =item B<parent{}> or B<:parent>
104 Select the currently selected elements' parent elements on the DOM tree.
105 Note, C<:parent> is not a standard CSS selector at the moment.
106 Use the C<parent{}> form to disambiguate it from real (standardized) CSS selectors in your code.
110 Rewind base selection back to the DOM's root.
111 Note, C<:root> is also a valid CSS pseudo selector, but in a subtree (entered into by C</>)
112 it would yield only that subtree, not the original DOM, so parsel(1) goes back to it at this point.
113 You likely need C<@> too to suppress output the whole document here.
123 Show only the first element found.
124 The output is not escaped in this case.
128 =head1 EXAMPLE OUTPUT
130 $ parsel input[type=text] < page.html
132 <input type="text" name="domain" />
133 <input type="text" name="username" />
135 $ parsel input[type=text] [[name]] < page.html
137 <input type="text" name="domain" />
138 <input type="text" name="username" />
143 $ parsel @input[type=text] [[name]] < page.html
148 $ parsel @form ::text < page.html
150 Enter your logon details:\r\nDomain:\r\nUsername:\r\nPassword:\r\nClick here to login:\r\n
157 =item https://www.w3schools.com/cssref/css_selectors.php
159 =item https://developer.mozilla.org/en-US/docs/Web/CSS/Reference#selectors
161 =item https://github.com/scrapy/cssselect
163 =item https://cssselect.readthedocs.io/en/latest/#supported-selectors
171 =item L<https://github.com/ericchiang/pup>
173 =item L<https://github.com/suntong/cascadia>
175 =item L<https://github.com/mgdm/htmlq>
186 from parsel import Selector
192 argparser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
193 argparser.add_argument('-1', action='store_true', help="show only 1 element unescaped")
194 argparser.add_argument('SELECTORS', nargs='*', help="see man page for details")
195 cli_args = argparser.parse_args()
196 opt_single_hit = getattr(cli_args, '1')
199 def parsel_escape(s):
200 s = s.replace('\\', '\\\\')
201 s = s.replace('\r', '\\r')
202 s = s.replace('\n', '\\n')
205 def show_hits(selection):
207 if not opt_single_hit:
208 print(len(selection))
210 for hit in selection:
214 out = w3lib.html.remove_tags(out)
215 if output_attribute_only is not None:
216 out = hit.attrib.get(output_attribute_only, '')
222 print(parsel_escape(out))
225 html = ''.join(sys.stdin.readlines())
226 whole_selection = Selector(text = html)
228 base_selection = parsel.selector.SelectorList([whole_selection])
229 curr_selection = base_selection
232 for arg in cli_args.SELECTORS:
235 output_text_only = False
236 output_attribute_only = None
237 apply_current_selector = True
239 if arg.startswith('@'):
243 sub_selectors = re.split(r'(?<!/)/(?!/)', arg)
244 if len(sub_selectors) > 1 and arg != '/':
245 sub_selectors = [sel.replace('//', '/') for sel in sub_selectors]
246 sub_selection = base_selection
248 for sel in sub_selectors:
250 sub_selection = sub_selection.css(sel)
252 sys.stderr.write("CSS selection error at '%s' in '%s'.\n" % (sel, arg))
255 show_hits(sub_selection)
257 curr_selection = sub_selection
260 arg = arg.replace('//', '/')
263 prev_selections.append(curr_selection)
264 base_selection = curr_selection
268 base_selection = prev_selections.pop()
272 # cssselector knows this ':root' pseudo selector, but it'd select
273 # only the current scope's root which we have narrowed, so step in
274 # here and rewind to the original DOM.
275 base_selection = parsel.selector.SelectorList([whole_selection])
277 if arg == ':parent' or arg == 'parent{}':
278 curr_selection = curr_selection.xpath('..')
279 apply_current_selector = False
281 if arg == '::text' or arg == 'text{}':
282 output_text_only = True
283 apply_current_selector = False
285 attr_match = re.search('^attr\{(.+)\}$', arg) or re.search('^\[\[(.+)\]\]$', arg)
287 output_attribute_only = attr_match.group(1)
288 apply_current_selector = False
290 if apply_current_selector:
292 curr_selection = base_selection.css(arg)
294 sys.stderr.write("CSS selection error at '%s'.\n" % arg)
297 show_hits(curr_selection)