8 DataFlowSanitizerDesign
16 DataFlowSanitizer is a generalised dynamic data flow analysis.
18 Unlike other Sanitizer tools, this tool is not designed to detect a
19 specific class of bugs on its own. Instead, it provides a generic
20 dynamic data flow analysis framework to be used by clients to help
21 detect application-specific issues within their own code.
23 How to build libc++ with DFSan
24 ==============================
26 DFSan requires either all of your code to be instrumented or for uninstrumented
27 functions to be listed as ``uninstrumented`` in the `ABI list`_.
29 If you'd like to have instrumented libc++ functions, then you need to build it
30 with DFSan instrumentation from source. Here is an example of how to build
31 libc++ and the libc++ ABI with data flow sanitizer instrumentation.
33 .. code-block:: console
37 # An example using ninja
38 cmake -GNinja path/to/llvm-project/llvm \
39 -DCMAKE_C_COMPILER=clang \
40 -DCMAKE_CXX_COMPILER=clang++ \
41 -DLLVM_USE_SANITIZER="DataFlow" \
42 -DLLVM_ENABLE_LIBCXX=ON \
43 -DLLVM_ENABLE_PROJECTS="libcxx;libcxxabi"
47 Note: Ensure you are building with a sufficiently new version of Clang.
52 With no program changes, applying DataFlowSanitizer to a program
53 will not alter its behavior. To use DataFlowSanitizer, the program
54 uses API functions to apply tags to data to cause it to be tracked, and to
55 check the tag of a specific data item. DataFlowSanitizer manages
56 the propagation of tags through the program according to its data flow.
58 The APIs are defined in the header file ``sanitizer/dfsan_interface.h``.
59 For further information about each function, please refer to the header
67 DataFlowSanitizer uses a list of functions known as an ABI list to decide
68 whether a call to a specific function should use the operating system's native
69 ABI or whether it should use a variant of this ABI that also propagates labels
70 through function parameters and return values. The ABI list file also controls
71 how labels are propagated in the former case. DataFlowSanitizer comes with a
72 default ABI list which is intended to eventually cover the glibc library on
73 Linux but it may become necessary for users to extend the ABI list in cases
74 where a particular library or function cannot be instrumented (e.g. because
75 it is implemented in assembly or another language which DataFlowSanitizer does
76 not support) or a function is called from a library or function which cannot
79 DataFlowSanitizer's ABI list file is a :doc:`SanitizerSpecialCaseList`.
80 The pass treats every function in the ``uninstrumented`` category in the
81 ABI list file as conforming to the native ABI. Unless the ABI list contains
82 additional categories for those functions, a call to one of those functions
83 will produce a warning message, as the labelling behavior of the function
84 is unknown. The other supported categories are ``discard``, ``functional``
87 * ``discard`` -- To the extent that this function writes to (user-accessible)
88 memory, it also updates labels in shadow memory (this condition is trivially
89 satisfied for functions which do not write to user-accessible memory). Its
90 return value is unlabelled.
91 * ``functional`` -- Like ``discard``, except that the label of its return value
92 is the union of the label of its arguments.
93 * ``custom`` -- Instead of calling the function, a custom wrapper ``__dfsw_F``
94 is called, where ``F`` is the name of the function. This function may wrap
95 the original function or provide its own implementation. This category is
96 generally used for uninstrumentable functions which write to user-accessible
97 memory or which have more complex label propagation behavior. The signature
98 of ``__dfsw_F`` is based on that of ``F`` with each argument having a
99 label of type ``dfsan_label`` appended to the argument list. If ``F``
100 is of non-void return type a final argument of type ``dfsan_label *``
101 is appended to which the custom function can store the label for the
102 return value. For example:
107 void __dfsw_f(int x, dfsan_label x_label);
109 void *memcpy(void *dest, const void *src, size_t n);
110 void *__dfsw_memcpy(void *dest, const void *src, size_t n,
111 dfsan_label dest_label, dfsan_label src_label,
112 dfsan_label n_label, dfsan_label *ret_label);
114 If a function defined in the translation unit being compiled belongs to the
115 ``uninstrumented`` category, it will be compiled so as to conform to the
116 native ABI. Its arguments will be assumed to be unlabelled, but it will
117 propagate labels in shadow memory.
123 # main is called by the C runtime using the native ABI.
124 fun:main=uninstrumented
127 # malloc only writes to its internal data structures, not user-accessible memory.
128 fun:malloc=uninstrumented
131 # tolower is a pure function.
132 fun:tolower=uninstrumented
133 fun:tolower=functional
135 # memcpy needs to copy the shadow from the source to the destination region.
136 # This is done in a custom function.
137 fun:memcpy=uninstrumented
143 * ``-dfsan-abilist`` -- The additional ABI list files that control how shadow
144 parameters are passed. File names are separated by comma.
145 * ``-dfsan-combine-pointer-labels-on-load`` -- Controls whether to include or
146 ignore the labels of pointers in load instructions. Its default value is true.
153 If the flag is true, the label of ``v`` is the union of the label of ``p`` and
154 the label of ``*p``. If the flag is false, the label of ``v`` is the label of
157 * ``-dfsan-combine-pointer-labels-on-store`` -- Controls whether to include or
158 ignore the labels of pointers in store instructions. Its default value is
165 If the flag is true, the label of ``*p`` is the union of the label of ``p`` and
166 the label of ``v``. If the flag is false, the label of ``*p`` is the label of
169 * ``-dfsan-combine-offset-labels-on-gep`` -- Controls whether to propagate
170 labels of offsets in GEP instructions. Its default value is true. For example:
176 If the flag is true, the label of ``p`` is the union of the label of ``p`` and
177 the label of ``i``. If the flag is false, the label of ``p`` is unchanged.
179 * ``-dfsan-track-select-control-flow`` -- Controls whether to track the control
180 flow of select instructions. Its default value is true. For example:
186 If the flag is true, the label of ``v`` is the union of the labels of ``b``,
187 ``v1`` and ``v2``. If the flag is false, the label of ``v`` is the union of the
188 labels of just ``v1`` and ``v2``.
190 * ``-dfsan-event-callbacks`` -- An experimental feature that inserts callbacks for
191 certain data events. Currently callbacks are only inserted for loads, stores,
192 memory transfers (i.e. memcpy and memmove), and comparisons. Its default value
193 is false. If this flag is set to true, a user must provide definitions for the
194 following callback functions:
198 void __dfsan_load_callback(dfsan_label Label, void* Addr);
199 void __dfsan_store_callback(dfsan_label Label, void* Addr);
200 void __dfsan_mem_transfer_callback(dfsan_label *Start, size_t Len);
201 void __dfsan_cmp_callback(dfsan_label CombinedLabel);
203 * ``-dfsan-track-origins`` -- Controls how to track origins. When its value is
204 0, the runtime does not track origins. When its value is 1, the runtime tracks
205 origins at memory store operations. When its value is 2, the runtime tracks
206 origins at memory load and store operations. Its default value is 0.
208 * ``-dfsan-instrument-with-call-threshold`` -- If a function being instrumented
209 requires more than this number of origin stores, use callbacks instead of
210 inline checks (-1 means never use callbacks). Its default value is 3500.
212 Environment Variables
213 ---------------------
215 * ``warn_unimplemented`` -- Whether to warn on unimplemented functions. Its
216 default value is false.
217 * ``strict_data_dependencies`` -- Whether to propagate labels only when there is
218 explicit obvious data dependency (e.g., when comparing strings, ignore the fact
219 that the output of the comparison might be implicit data-dependent on the
220 content of the strings). This applies only to functions with ``custom`` category
221 in ABI list. Its default value is true.
222 * ``origin_history_size`` -- The limit of origin chain length. Non-positive values
223 mean unlimited. Its default value is 16.
224 * ``origin_history_per_stack_limit`` -- The limit of origin node's references count.
225 Non-positive values mean unlimited. Its default value is 20000.
226 * ``store_context_size`` -- The depth limit of origin tracking stack traces. Its
228 * ``zero_in_malloc`` -- Whether to zero shadow space of new allocated memory. Its
229 default value is true.
230 * ``zero_in_free`` --- Whether to zero shadow space of deallocated memory. Its
231 default value is true.
236 DataFlowSanitizer supports up to 8 labels, to achieve low CPU and code
237 size overhead. Base labels are simply 8-bit unsigned integers that are
238 powers of 2 (i.e. 1, 2, 4, 8, ..., 128), and union labels are created
239 by ORing base labels.
241 The following program demonstrates label propagation by checking that
242 the correct labels are propagated.
246 #include <sanitizer/dfsan_interface.h>
253 dfsan_label i_label = 1;
254 dfsan_label j_label = 2;
255 dfsan_label k_label = 4;
256 dfsan_set_label(i_label, &i, sizeof(i));
257 dfsan_set_label(j_label, &j, sizeof(j));
258 dfsan_set_label(k_label, &k, sizeof(k));
260 dfsan_label ij_label = dfsan_get_label(i + j);
262 assert(ij_label & i_label); // ij_label has i_label
263 assert(ij_label & j_label); // ij_label has j_label
264 assert(!(ij_label & k_label)); // ij_label doesn't have k_label
265 assert(ij_label == 3); // Verifies all of the above
268 assert(dfsan_has_label(ij_label, i_label));
269 assert(dfsan_has_label(ij_label, j_label));
270 assert(!dfsan_has_label(ij_label, k_label));
272 dfsan_label ijk_label = dfsan_get_label(i + j + k);
274 assert(ijk_label & i_label); // ijk_label has i_label
275 assert(ijk_label & j_label); // ijk_label has j_label
276 assert(ijk_label & k_label); // ijk_label has k_label
277 assert(ijk_label == 7); // Verifies all of the above
280 assert(dfsan_has_label(ijk_label, i_label));
281 assert(dfsan_has_label(ijk_label, j_label));
282 assert(dfsan_has_label(ijk_label, k_label));
290 DataFlowSanitizer can track origins of labeled values. This feature is enabled by
291 ``-mllvm -dfsan-track-origins=1``. For example,
293 .. code-block:: console
296 #include <sanitizer/dfsan_interface.h>
299 int main(int argc, char** argv) {
301 dfsan_set_label(i_label, &i, sizeof(i));
303 dfsan_print_origin_trace(&j, "A flow from i to j");
307 % clang++ -fsanitize=dataflow -mllvm -dfsan-track-origins=1 -fno-omit-frame-pointer -g -O2 test.cc
309 Taint value 0x1 (at 0x7ffd42bf415c) origin tracking (A flow from i to j)
310 Origin value: 0x13900001, Taint value was stored to memory at
311 #0 0x55676db85a62 in main test.cc:7:7
312 #1 0x7f0083611bbc in __libc_start_main libc-start.c:285
314 Origin value: 0x9e00001, Taint value was created at
315 #0 0x55676db85a08 in main test.cc:6:3
316 #1 0x7f0083611bbc in __libc_start_main libc-start.c:285
318 By ``-mllvm -dfsan-track-origins=1`` DataFlowSanitizer collects only
319 intermediate stores a labeled value went through. Origin tracking slows down
320 program execution by a factor of 2x on top of the usual DataFlowSanitizer
321 slowdown and increases memory overhead by 1x. By ``-mllvm -dfsan-track-origins=2``
322 DataFlowSanitizer also collects intermediate loads a labeled value went through.
323 This mode slows down program execution by a factor of 4x.
328 DataFlowSanitizer is a work in progress, currently under development for
334 Please refer to the :doc:`design document<DataFlowSanitizerDesign>`.