doc/README.display_filter

   1 (This is a consolidation of documentation written by stig, sahlberg, and gram)
   2
   3 What is the display filter system?
   4 ==================================
   5 The display filter system allows the user to select packets by testing
   6 for values in the proto_tree that Wireshark constructs for that packet.
   7 Every proto_item in the proto_tree has an 'abbrev' field
   8 and a 'type' field, which tells the display filter engine the name
   9 of the field and its type (what values it can hold).
  10
  11 For example, this is the definition of the ip.proto field from packet-ip.c:
  12
  13 { &hf_ip_proto,
  14       { "Protocol", "ip.proto", FT_UINT8, BASE_DEC | BASE_EXT_STRING,
  15               &ipproto_val_ext, 0x0, NULL, HFILL }},
  16
  17 This definition says that "ip.proto" is the display-filter name for
  18 this field, and that its field-type is FT_UINT8.
  19
  20 The display filter system has 3 major parts to it:
  21
  22     1. A type system (field types, or "ftypes")
  23     2. A parser, to convert a user's query to an internal representation
  24     3. An engine that uses the internal representation to select packets.
  25
  26
  27 code:
  28 epan/dfilter/* - the display filter engine, including
  29                 scanner, parser, syntax-tree semantics checker, DFVM bytecode
  30                 generator, and DFVM engine.
  31 epan/ftypes/* - the definitions of the various FT_* field types.
  32 epan/proto.c   - proto_tree-related routines
  33
  34
  35 The field type system
  36 =====================
  37 The field type system is stored in epan/ftypes.
  38
  39 The proto_tree system #includes ftypes.h, which gives it the ftenum
  40 definition, which is the enum of all possible ftypes:
  41
  42 /* field types */
  43 enum ftenum {
  44         FT_NONE,        /* used for text labels with no value */
  45         FT_PROTOCOL,
  46         FT_BOOLEAN,
  47         FT_CHAR,        /* 1-octet character as 0-255 */
  48         FT_UINT8,
  49         FT_UINT16,
  50         FT_UINT24,      /* really a UINT32, but displayed as 6 hex-digits if FD_HEX*/
  51         FT_UINT32,
  52         FT_UINT40,      /* really a UINT64, but displayed as 10 hex-digits if FD_HEX*/
  53         FT_UINT48,      /* really a UINT64, but displayed as 12 hex-digits if FD_HEX*/
  54         FT_UINT56,      /* really a UINT64, but displayed as 14 hex-digits if FD_HEX*/
  55         FT_UINT64,
  56     etc., etc.
  57 }
  58
  59 It also provides the definition of fvalue_t, the struct that holds the *value*
  60 that corresponds to the type. Each proto_item (proto_node) holds an fvalue_t
  61 due to having a field_info struct (defined in proto.h).
  62
  63 The fvalue_t is mostly just a gigantic union of possible C-language types
  64 (as opposed to FT_* types):
  65
  66 typedef struct _fvalue_t {
  67         ftype_t *ftype;
  68         union {
  69                 /* Put a few basic types in here */
  70                 uint32_t                uinteger;
  71                 int32_t                 sinteger;
  72                 uint64_t                uinteger64;
  73                 int64_t                 sinteger64;
  74                 double                  floating;
  75                 wmem_strbuf_t           *strbuf;
  76                 GByteArray              *bytes;
  77                 ipv4_addr_and_mask      ipv4;
  78                 ipv6_addr_and_prefix    ipv6;
  79                 e_guid_t                guid;
  80                 nstime_t                time;
  81                 protocol_value_t        protocol;
  82                 uint16_t                sfloat_ieee_11073;
  83                 uint32_t                float_ieee_11073;
  84         } value;
  85 } fvalue_t;
  86
  87
  88 Defining a field type
  89 ---------------------
  90 The ftype system itself is designed to be modular, so that new field types
  91 can be added when necessary.
  92
  93 Each field type must implement an ftype_t structure, defined in
  94 ftypes-int.h. This is the way a field type is registered with the ftype engine.
  95
  96 If you take a look at ftype-integer.c, you will see that it provides
  97 an ftype_register_integers() function, that fills in many such ftype_t
  98 structs. It creates one for each integer type: FT_UINT8, FT_UINT16,
  99 FT_UINT32, etc.
 100
 101 The ftype_t struct defines the things needed for the ftype:
 102
 103     * its ftenum value
 104     * a string representation of the FT name ("FT_UINT8")
 105     * how much data it consumes in the packet
 106     * how to store that value in an fvalue_t: new(), free(),
 107         various value-related functions
 108     * how to compare that value against another
 109     * how to slice that value (strings and byte ranges can be sliced)
 110
 111 Using an fvalue_t
 112 -----------------
 113 Once the value of a field is stored in an fvalue_t (stored in
 114 each proto_item via field_info), it's easy to use those values,
 115 thanks to the various fvalue_*() functions defined in ftypes.h.
 116
 117 Functions like fvalue_get(), fvalue_eq(), etc., are all generic
 118 interfaces to get information about the field's value. They work
 119 on any field type because of the ftype_t struct, which is the lookup
 120 table that the field-type engine uses to work with any field type.
 121
 122 The display filter parser
 123 =========================
 124 The display filter parser (along with the comparison engine)
 125 is stored in epan/dfilter.
 126
 127 The scanner/parser pair read the string representing the display filter
 128 and convert it into a very simple syntax tree.  The syntax tree is very
 129 simple in that it is possible that many of the nodes contain unparsed
 130 chunks of text from the display filter.
 131
 132 There are four phases to parsing a user's request:
 133
 134  1. Scanning the string for dfilter syntax
 135  2. Parsing the keywords according to the dfilter grammar, into a
 136         syntax tree
 137  3. Doing a semantic check of the nodes in that syntax tree
 138  4. Converting the syntax tree into a series of DFVM byte codes
 139
 140 The dfilter_compile() function, in epan/dfilter/dfilter.c,
 141 runs these 4 phases. The end result is a dfwork_t object (dfw), that
 142 can be passed to dfilter_apply() to actually run the display filter
 143 against a set of proto_trees.
 144
 145
 146 Scanning the display filter string
 147 ----------------------------------
 148 epan/dfilter/scanner.l is the lex scanner for finding keywords
 149 in the user's display filter string.
 150
 151 Its operation is simple. It finds the special function and comparison
 152 operators ("==", "!=", "eq", "ne", etc.), it finds slice operations
 153 ( "[0:1]" ), quoted strings, IP addresses, numbers, and any other "special"
 154 keywords or string types.
 155
 156 Anything it doesn't know how to handle is passed to the grammar parser
 157 as an unparsed string (TOKEN_UNPARSED). This includes field names. The
 158 scanner does not interpret any protocol field names at all.
 159
 160 The scanner has to return a token type (TOKEN_*, and in many cases,
 161 a value. The value will be an stnode_t struct, which is a syntax
 162 tree node object. Since the final storage of the parse will
 163 be in a syntax tree, it is convenient for the scanner to fill in
 164 syntax tree nodes with values when it can.
 165
 166 The stnode_t definition is in epan/dfilter/syntax-tree.h
 167
 168
 169 Parsing the keywords according to the dfilter grammar
 170 -----------------------------------------------------
 171 The grammar parser is implemented with the 'lemon' tool,
 172 rather than the traditional yacc or bison grammar parser,
 173 as lemon grammars were found to be easier to work with. The
 174 lemon parser specification (epan/dfilter/grammar.lemon) is
 175 much easier to read than its bison counterpart would be,
 176 thanks to lemon's feature of being able to name fields, rather
 177 then using numbers ($1, $2, etc.)
 178
 179 The lemon tool is located in tools/lemon in the Wireshark
 180 distribution.
 181
 182 An on-line introduction to lemon is available at:
 183
 184 http://www.sqlite.org/src/doc/trunk/doc/lemon.html
 185
 186 The grammar specifies which type of constructs are possible
 187 within the dfilter language ("dfilter-lang")
 188
 189 An "expression" in dfilter-lang can be a relational test or a logical test.
 190
 191 A relational test compares a value against another, which is usually
 192 a field (or a slice of a field) against some static value, like:
 193
 194     ip.proto == 1
 195     eth.dst != ff:ff:ff:ff:ff:ff
 196
 197 A logical test combines other expressions with "and", "or", and "not".
 198
 199 At the end of the grammatical parsing, the dfw object will
 200 have a valid syntax tree, pointed at by dfw->st_root.
 201
 202 If there is an error in the syntax, the parser will call dfilter_fail()
 203 with an appropriate error message, which the UI will need to report
 204 to the user.
 205
 206 The syntax tree system
 207 ----------------------
 208 The syntax tree is created as a result of running the lemon-based
 209 grammar parser on the scanned tokens. The syntax tree code
 210 is in epan/dfilter/syntax-tree* and epan/dfilter/sttype-*. It too
 211 uses a set of code modules that implement different syntax node types,
 212 similar to how the field-type system registers a set of ftypes
 213 with a central engine.
 214
 215 Each node (stnode_t) in the syntax tree has a type (sttype).
 216 These sttypes are very much related to ftypes (field types), but there
 217 is not a one-to-one correspondence. The syntax tree nodes are slightly
 218 higher-level abstractions. The root node of the syntax tree is the main
 219 test or comparison being done.
 220
 221 Semantic Check
 222 --------------
 223 After the parsing is done and a syntax tree is available, the
 224 code in semcheck.c does a semantic check of what is in the syntax
 225 tree.
 226
 227 The semantics of the simple syntax tree are checked to make sure that
 228 the fields that are being compared are being compared to appropriate
 229 values.  For example, if a field is an integer, it can't be compared to
 230 a string, unless a value_string has been defined for that field.
 231
 232 During the process of checking the semantics, the simple syntax tree is
 233 fleshed out and no longer contains nodes with unparsed information.  The
 234 syntax tree is no longer in its simple form, but in its complete form.
 235
 236 For example, if the dfilter is slicing a field and comparing
 237 against a set of bytes, semcheck.c has to check that the field
 238 in question can indeed be sliced.
 239
 240 Or, can a field be compared against a certain type of value (string,
 241 integer, float, IPv4 address, etc.)
 242
 243 The semcheck code also makes adjustments to the syntax tree
 244 when it needs to. The parser sometimes stores raw, unparsed strings
 245 in the syntax tree, and semcheck has to convert them to
 246 certain types. For example, the display filter may contain
 247 a value_string string (the "enum" type that protocols can use
 248 to define the possible textual descriptions of numeric fields), and
 249 semcheck will convert that value_string string into the correct
 250 integer value.
 251
 252 Truth be told, the semcheck.c code is a bit disorganized, and could
 253 be re-designed & re-written.
 254
 255 DFVM Byte Codes
 256 ---------------
 257 The syntax tree is analyzed to create a sequence of bytecodes in the
 258 "DFVM" language.  "DFVM" stands for Display Filter Virtual Machine.  The
 259 DFVM is similar in spirit, but not in definition, to the BPF VM that
 260 libpcap uses to analyze packets.
 261
 262 A virtual bytecode is created and used so that the actual process of
 263 filtering packets will be fast.  That is, it should be faster to process
 264 a list of VM bytecodes than to attempt to filter packets directly from
 265 the syntax tree.  (heh...  no measurement has been made to support this
 266 supposition)
 267
 268 The DFVM opcodes are defined in epan/dfilter/dfvm.h (dfvm_opcode_t).
 269 Similar to how the BPF opcode system works in libpcap, there is a
 270 limited set of opcodes. They operate by loading values from the
 271 proto_tree into registers, loading pre-defined values into
 272 registers, and comparing them. The opcodes are checked in sequence, and
 273 there are only 2 branching opcodes: IF_TRUE_GOTO and IF_FALSE_GOTO.
 274 Both of these can only branch forwards, and never backwards. In this way
 275 sets of DFVM instructions will never get into an infinite loop.
 276
 277 The epan/dfilter/gencode.c code converts the syntax tree
 278 into a set of dfvm instructions.
 279
 280 The constants that are in the DFVM instructions (the constant
 281 values that the user is checking against) are pre-loaded
 282 into registers via the dfvm_init_const() call, and stored
 283 in the dfilter_t structure for when the display filter is
 284 actually applied.
 285
 286
 287 DFVM Engine
 288 ===========
 289 Once the DFVM bytecode has been produced, it's a simple matter of
 290 running the DFVM engine against the proto_tree from the packet
 291 dissection, using the DFVM bytecodes as instructions.  If the DFVM
 292 bytecode is known before packet dissection occurs, the
 293 proto_tree-related code can be "primed" to store away pointers to
 294 field_info structures that are interesting to the display filter.  This
 295 makes lookup of those field_info structures during the filtering process
 296 faster.
 297
 298 The dfilter_apply() function runs a single pre-compiled
 299 display filter against a single proto_tree function, and returns
 300 true or false, meaning that the filter matched or not.
 301
 302 That function calls dfvm_apply(), which runs across the DFVM
 303 instructions, loading protocol field values into DFVM registers
 304 and doing the comparisons.
 305
 306 There is a top-level Makefile target called 'dftest' which
 307 builds a 'dftest' executable that will print out the DFVM
 308 bytecode for any display filter given on the command-line.
 309 To build it, run:
 310
 311 $ make dftest
 312
 313 To use it, give it the display filter on the command-line:
 314
 315 $ ./dftest 'ip.addr == 127.0.0.1'
 316 Filter: ip.addr == 127.0.0.1
 317
 318 Constants:
 319 00000 PUT_FVALUE        127.0.0.1 <FT_IPv4> -> reg#1
 320
 321 Instructions:
 322 00000 READ_TREE         ip.addr -> reg#0
 323 00001 IF-FALSE-GOTO     3
 324 00002 ANY_EQ            reg#0 == reg#1
 325 00003 RETURN
 326
 327
 328 The output shows the original display filter, then the opcodes
 329 that put constant values into registers. The registers are
 330 numbered, and are shown in the output as "reg#n", where 'n' is the
 331 identifying number.
 332
 333 Then the instructions are shown. These are the instructions
 334 which are run for each proto_tree.
 335
 336 This is what happens in this example:
 337
 338 00000 READ_TREE         ip.addr -> reg#0
 339
 340 Any ip.addr fields in the proto_tree are loaded into register 0. Yes,
 341 multiple values can be loaded into a single register. As a result
 342 of this READ_TREE, the accumulator will hold true or false, indicating
 343 if any field's value was loaded, or not.
 344
 345 00001 IF-FALSE-GOTO     3
 346
 347 If the load failed because there were no ip.addr fields
 348 in the proto_tree, then we jump to instruction 3.
 349
 350 00002 ANY_EQ            reg#0 == reg#1
 351
 352 This checks to see if any of the fields in register 1
 353 (which has the pre-loaded constant value of 127.0.0.1) are equal
 354 to any of the fields in register 0 (which are all of the ip.addr
 355 fields in the proto tree). The resulting value in the
 356 accumulator will be true if any of the fields match, or false
 357 if none match.
 358
 359 00003 RETURN
 360
 361 This returns the accumulator's value, either true or false.
 362
 363 In addition to dftest, there is also a unit-test script for the
 364 display filter engine - test/suite_dfilter/dfiltertest.py.
 365 It makes use of tshark to run specific display filters against
 366 specific captures in test/captures. See the "Wireshark Tests" chapter
 367 in the Wireshark Developer’s Guide.
 368
 369
 370
 371 Display Filter Functions
 372 ========================
 373 You define a display filter function by adding an entry to
 374 the df_functions table in epan/dfilter/dfunctions.c. The record struct
 375 is defined in dfunctions.h, and shown here:
 376
 377 typedef struct {
 378     char            *name;
 379     DFFuncType      function;
 380     ftenum_t        retval_ftype;
 381     unsigned        min_nargs;
 382     unsigned        max_nargs;
 383     DFSemCheckType  semcheck_param_function;
 384 } df_func_def_t;
 385
 386 name - the name of the function; this is how the user will call your
 387     function in the display filter language
 388
 389 function - this is the run-time processing of your function.
 390
 391 retval_ftype - what type of FT_* type does your function return?
 392
 393 min_nargs - minimum number of arguments your function accepts
 394 max_nargs - maximum number of arguments your function accepts
 395
 396 semcheck_param_function - called during the semantic check of the
 397     display filter string.
 398
 399 DFFuncType function
 400 -------------------
 401 typedef bool (*DFFuncType)(GList *arg1list, GList *arg2list, GList **retval);
 402
 403 The return value of your function is a bool; true if processing went fine,
 404 or false if there was some sort of exception.
 405
 406 For now, display filter functions can accept a maximum of 2 arguments.
 407 The "arg1list" parameter is the GList for the first argument. The
 408 'arg2list" parameter is the GList for the second argument. All arguments
 409 to display filter functions are lists. This is because in the display
 410 filter language a protocol field may have multiple instances. For example,
 411 a field like "ip.addr" will exist more than once in a single frame. So
 412 when the user invokes this display filter:
 413
 414     somefunc(ip.addr) == true
 415
 416 even though "ip.addr" is a single argument, the "somefunc" function will
 417 receive a GList of *all* the values of "ip.addr" in the frame.
 418
 419 Similarly, the return value of the function needs to be a GList, since all
 420 values in the display filter language are lists. The GList** retval argument
 421 is passed to your function so you can set the pointer to your return value.
 422
 423 DFSemCheckType
 424 --------------
 425 typedef void (*DFSemCheckType)(dfwork_t *dfw, int param_num, stnode_t *st_node);
 426
 427 For each parameter in the syntax tree, this function will be called.
 428 "param_num" will indicate the number of the parameter, starting with 0.
 429 The "stnode_t" is the syntax-tree node representing that parameter.
 430 If everything is okay with the value of that stnode_t, your function
 431 does nothing --- it merely returns. If something is wrong, however,
 432 it should call dfilter_fail(dfw,...) and THROW a TypeError exception.
 433
 434
 435 Example: add an 'in' display filter operation
 436 =============================================
 437
 438 This example has been discussed on ethereal-dev in April 2004.
 439 [Ethereal-dev] Need for an 'in' dfilter operator?
 440 (https://lists.wireshark.org/archives/ethereal-dev/200404/msg00372.html)
 441 It illustrates how a more complex operation can be added to the display filter language.
 442
 443 Question:
 444
 445         If I want to add an 'in' display filter operation, I need to define
 446         several things. This can happen in different ways. For instance,
 447         every value from the "in" value collection will result in a test.
 448         There are 2 options here, either a test for a single value:
 449
 450                 (x in {a b c})
 451
 452         or a test for a value in a given range:
 453
 454                 (x in {a ... z})
 455
 456         or even a combination of both. The former example can be reduced to:
 457
 458                 ((x == a) or (x == b) or (x == c))
 459
 460         while the latter can be reduced to
 461
 462                 ((x >= MIN(a, z)) and (x <= MAX(a, z)))
 463
 464         I understand that I can replace "x in {" with the following steps:
 465         first store x in the "in" test buffer, then add "(" to the display
 466         filter expression internally.
 467
 468         Similarly I can replace the closing brace "}" with the following
 469         steps: release x from the "in" test buffer and then add ")"
 470         to the display filter expression internally.
 471
 472         How could I do this?
 473
 474 Answer:
 475
 476         This could be done in grammar.lemon. The grammar would produce
 477         syntax tree nodes, combining them with "or", when it is given
 478         tokens that represent the "in" syntax.
 479
 480         It could also be done later in the process, maybe in
 481         semcheck.c. But if you can do it earlier, in grammar.lemon,
 482         then you shouldn't have to worry about modifying anything in
 483         semcheck.c, as the syntax tree that is passed to semcheck.c
 484         won't contain any new type of operators... just lots of nodes
 485         combined with "or".
 486
 487 How to add an operator FOO to the display filter language?
 488 ==========================================================
 489
 490 Go to wireshark/epan/dfilter/
 491
 492 Edit grammar.lemon and add the operator. Add the operator FOO and the
 493 test logic (defining TEST_OP_FOO).
 494
 495 Edit scanner.l and add the operator name(s) hence defining
 496 TOKEN_TEST_FOO. Also update the simple() or add the new operand's code.
 497
 498 Edit sttype-test.h and add the TEST_OP_FOO to the list of test operations.
 499
 500 Edit sttype-test.c and add TEST_OP_FOO to the num_operands() method.
 501
 502 Edit gencode.c, add TEST_OP_FOO in the gen_test() method by defining
 503 ANY_FOO.
 504
 505 Edit dfvm.h and add ANY_FOO to the enum dfvm_opcode_t structure.
 506
 507 Edit dfvm.c and add ANY_FOO to dfvm_dump() (for the dftest display filter
 508 test binary), to dfvm_apply() hence defining the methods fvalue_foo().
 509
 510 Edit semcheck.c and look at the check_relation_XXX() methods if they
 511 still apply to the foo operator; if not, amend the code. Start from the
 512 check_test() method to discover the logic.
 513
 514 Go to wireshark/epan/ftypes/
 515
 516 Edit ftypes.h and declare the fvalue_foo(), ftype_can_foo() and
 517 fvalue_foo() methods. Add the cmp_foo() method to the struct _ftype_t.
 518
 519 This is the first time that a make in wireshark/epan/dfilter/ can
 520 succeed. If it fails, then some code in the previously edited files must
 521 be corrected.
 522
 523 Edit ftypes.c and define the fvalue_foo() method with its associated
 524 logic. Define also the ftype_can_foo() and fvalue_foo() methods.
 525
 526 Edit all ftype-*.c files and add the required fvalue_foo() methods.
 527
 528 This is the point where you should be able to compile without errors in
 529 wireshark/epan/ftypes/. If not, first fix the errors.
 530
 531 Go to wireshark/epan/ and run make. If this one succeeds, then we're
 532 almost done as no errors should occur here.
 533
 534 Go to wireshark/ and run make. One thing to do is make dftest and see
 535 if you can construct valid display filters with your new operator. Or
 536 you may want to move directly to the generation of Wireshark.
 537
 538 Also look at ui/qt/display_filter_expression_dialog.cpp and the display
 539 filter expression generator.
 540
 541 How to add a new test to the test suite
 542 =======================================
 543
 544 All display filter tests are located in test/suite_dfilter.
 545 You can add a test to an existing file or create a new file.
 546
 547 Each new test class must define "trace_file", which names
 548 a capture file in "test/captures". All the tests
 549 run in that class will use that one capture file.
 550
 551 There are 2 fixtures you can use for testing:
 552
 553 checkDFilterCount(dfilter, expected_count)
 554
 555     This will run the display filter through tshark, on the
 556     file named by "trace_file", and assert that the
 557     number of resulting packets equals "expected_count". This
 558     also asserts that tshark does not fail; success with zero
 559     matches is not the same as failure to compile the display
 560     filter string.
 561
 562 checkDFilterFail(dfilter, error)
 563
 564     This will run dftest with the display filter, and check
 565     that it fails with a given error message. This is useful
 566     when expecting display filter syntax errors to be caught.
 567
 568 To execute tests:
 569
 570 # Run all dfilter tests
 571 $ test/test.py suite_dfilter
 572
 573 # Run all tests from group_tvb.py:
 574 $ test/test.py suite_dfilter.group_tvb
 575
 576 # For faster, parallel tests, install the "pytest-xdist" first
 577 # (for example, using "pip install pytest-xdist"), then:
 578 $ pytest -nauto test -k suite_dfilter
 579
 580 # Run all tests from group_tvb.py, in parallel:
 581 $ pytest -nauto test -k case_tvb
 582
 583 # Run a single test from group_tvb.py, case_tvb.test_slice_4:
 584 $ pytest test -k "case_tvb and test_slice_4"
 585
 586 See also https://www.wireshark.org/docs/wsdg_html_chunked/ChapterTests.html