6 perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool
11 'perf record' -e ibs_op//
12 'perf record' -e ibs_fetch//
17 Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP)
18 profiling support on AMD platforms. IBS has two independent components: IBS
19 Op and IBS Fetch. IBS Op sampling provides information about instruction
20 execution (micro-op execution to be precise) with details like d-cache
21 hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch
22 behavior etc. IBS Fetch sampling provides information about instruction fetch
23 with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is
24 per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.
26 Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited
27 using the Linux perf utility. The following files will be created at boot time
28 if IBS is supported by the hardware and kernel.
30 /sys/bus/event_source/devices/ibs_op/
31 /sys/bus/event_source/devices/ibs_fetch/
33 IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports
36 IBS PMUs do not have user/kernel filtering capability and thus it requires
37 CAP_SYS_ADMIN or CAP_PERFMON privilege.
39 IBS VS. REGULAR CORE PMU
40 ------------------------
42 IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has
43 no skid. Whereas the IP recorded by regular core PMU will have some skid
44 (sample was generated at IP X but perf would record it at IP X+n). Hence,
45 regular core PMU might not help for profiling with instruction level
46 precision. Further, IBS provides additional information about the sample in
47 question. On the other hand, regular core PMU has it's own advantages like
48 plethora of events, counting mode (less interference), up to 6 parallel
49 counters, event grouping support, filtering capabilities etc.
51 Three regular core PMU events are internally forwarded to IBS Op PMU when
52 precise_ip attribute is set:
54 -e cpu-cycles:p becomes -e ibs_op//
55 -e r076:p becomes -e ibs_op//
56 -e r0C1:p becomes -e ibs_op/cnt_ctl=1/
64 System-wide profile, cycles event, sampling period: 100000
66 # perf record -e ibs_op// -c 100000 -a
68 Per-cpu profile (cpu10), cycles event, sampling period: 100000
70 # perf record -e ibs_op// -c 100000 -C 10
72 Per-cpu profile (cpu10), cycles event, sampling freq: 1000
74 # perf record -e ibs_op// -F 1000 -C 10
76 System-wide profile, uOps event, sampling period: 100000
78 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a
80 Same command, but also capture IBS register raw dump along with perf sample:
82 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples
84 System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)
86 # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a
88 Per process(upstream v6.2 onward), uOps event, sampling period: 100000
90 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234
92 Per process(upstream v6.2 onward), uOps event, sampling period: 100000
94 # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls
96 To analyse recorded profile in aggregate mode
99 /* Select a line and press 'a' to drill down at instruction level. */
101 To go over each sample
105 Raw dump of IBS registers when profiled with --raw-samples
108 /* Look for PERF_RECORD_SAMPLE */
110 Example register raw dump:
112 ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1
113 Val 1 CntCtl 0=cycles CurCnt 707
114 IbsOpRip: ffffffff8204aea7
115 ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597
116 BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1
117 ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM
118 ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
119 DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
120 DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
121 DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
122 DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
123 OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0
124 IbsDCLinAd: ff110008a5398920
125 IbsDCPhysAd: 00000008a5398920
127 IBS applied in a real world usecase
129 ~90% regression was observed in tbench with specific scheduler hint
130 which was counter intuitive. IBS profile of good and bad run captured
131 using perf helped in identifying exact cause of the problem:
133 https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com
138 Similar commands can be used with Fetch PMU as well.
140 System-wide profile, fetch ops event, sampling period: 100000
142 # perf record -e ibs_fetch// -c 100000 -a
144 System-wide profile, fetch ops event, sampling period: 100000, Random enable
146 # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a
148 Random enable adds small degree of variability to sample period. This
149 helps in cases like long running loops where PMU is tagging the same
150 instruction over and over because of fixed sample period.
154 PERF MEM AND PERF C2C
155 ---------------------
157 perf mem is a memory access profiler tool and perf c2c is a shared data
158 cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD.
159 Below is a simple example of the perf mem tool.
161 # perf mem record -c 100000 -- make
164 A normal perf mem report output will provide detailed memory access profile.
165 However, it can also be aggregated based on output fields. For example:
167 # perf mem report -F mem,sample,snoop
168 Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876
169 Memory access Samples Snoop
176 Remote node, same socket RAM hit 3241 N/A
177 Remote core, same node Any cache hit 1572 HitM
178 Remote core, same node Any cache hit 514 N/A
179 Remote node, same socket Any cache hit 1216 HitM
180 Remote node, same socket Any cache hit 350 N/A
183 Please refer to their man page for more detail.
188 linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
189 linkperf:perf-mem[1], linkperf:perf-c2c[1]