tools/perf/Documentation/perf-amd-ibs.txt

   1 perf-amd-ibs(1)
   2 ===============
   3
   4 NAME
   5 ----
   6 perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool
   7
   8 SYNOPSIS
   9 --------
  10 [verse]
  11 'perf record' -e ibs_op//
  12 'perf record' -e ibs_fetch//
  13
  14 DESCRIPTION
  15 -----------
  16
  17 Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP)
  18 profiling support on AMD platforms. IBS has two independent components: IBS
  19 Op and IBS Fetch. IBS Op sampling provides information about instruction
  20 execution (micro-op execution to be precise) with details like d-cache
  21 hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch
  22 behavior etc. IBS Fetch sampling provides information about instruction fetch
  23 with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is
  24 per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.
  25
  26 Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited
  27 using the Linux perf utility. The following files will be created at boot time
  28 if IBS is supported by the hardware and kernel.
  29
  30   /sys/bus/event_source/devices/ibs_op/
  31   /sys/bus/event_source/devices/ibs_fetch/
  32
  33 IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports
  34 one event: fetch ops.
  35
  36 IBS PMUs do not have user/kernel filtering capability and thus it requires
  37 CAP_SYS_ADMIN or CAP_PERFMON privilege.
  38
  39 IBS VS. REGULAR CORE PMU
  40 ------------------------
  41
  42 IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has
  43 no skid. Whereas the IP recorded by regular core PMU will have some skid
  44 (sample was generated at IP X but perf would record it at IP X+n). Hence,
  45 regular core PMU might not help for profiling with instruction level
  46 precision. Further, IBS provides additional information about the sample in
  47 question. On the other hand, regular core PMU has it's own advantages like
  48 plethora of events, counting mode (less interference), up to 6 parallel
  49 counters, event grouping support, filtering capabilities etc.
  50
  51 Three regular core PMU events are internally forwarded to IBS Op PMU when
  52 precise_ip attribute is set:
  53
  54         -e cpu-cycles:p becomes -e ibs_op//
  55         -e r076:p becomes -e ibs_op//
  56         -e r0C1:p becomes -e ibs_op/cnt_ctl=1/
  57
  58 EXAMPLES
  59 --------
  60
  61 IBS Op PMU
  62 ~~~~~~~~~~
  63
  64 System-wide profile, cycles event, sampling period: 100000
  65
  66         # perf record -e ibs_op// -c 100000 -a
  67
  68 Per-cpu profile (cpu10), cycles event, sampling period: 100000
  69
  70         # perf record -e ibs_op// -c 100000 -C 10
  71
  72 Per-cpu profile (cpu10), cycles event, sampling freq: 1000
  73
  74         # perf record -e ibs_op// -F 1000 -C 10
  75
  76 System-wide profile, uOps event, sampling period: 100000
  77
  78         # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a
  79
  80 Same command, but also capture IBS register raw dump along with perf sample:
  81
  82         # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples
  83
  84 System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)
  85
  86         # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a
  87
  88 Per process(upstream v6.2 onward), uOps event, sampling period: 100000
  89
  90         # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234
  91
  92 Per process(upstream v6.2 onward), uOps event, sampling period: 100000
  93
  94         # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls
  95
  96 To analyse recorded profile in aggregate mode
  97
  98         # perf report
  99         /* Select a line and press 'a' to drill down at instruction level. */
 100
 101 To go over each sample
 102
 103         # perf script
 104
 105 Raw dump of IBS registers when profiled with --raw-samples
 106
 107         # perf report -D
 108         /* Look for PERF_RECORD_SAMPLE */
 109
 110         Example register raw dump:
 111
 112         ibs_op_ctl:     000002c30006186a MaxCnt    100000 L3MissOnly 0 En 1
 113                 Val 1 CntCtl 0=cycles CurCnt       707
 114         IbsOpRip:       ffffffff8204aea7
 115         ibs_op_data:    0000010002550001 CompToRetCtr     1 TagToRetCtr   597
 116                 BrnRet 0  RipInvalid 0 BrnFuse 0 Microcode 1
 117         ibs_op_data2:   0000000000000013 RmtNode 1 DataSrc 3=DRAM
 118         ibs_op_data3:   0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
 119                 DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
 120                 DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
 121                 DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
 122                 DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
 123                 OpDcMissOpenMemReqs 12 DcMissLat     0 TlbRefillLat     0
 124         IbsDCLinAd:     ff110008a5398920
 125         IbsDCPhysAd:    00000008a5398920
 126
 127 IBS applied in a real world usecase
 128
 129         ~90% regression was observed in tbench with specific scheduler hint
 130         which was counter intuitive. IBS profile of good and bad run captured
 131         using perf helped in identifying exact cause of the problem:
 132
 133         https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com
 134
 135 IBS Fetch PMU
 136 ~~~~~~~~~~~~~
 137
 138 Similar commands can be used with Fetch PMU as well.
 139
 140 System-wide profile, fetch ops event, sampling period: 100000
 141
 142         # perf record -e ibs_fetch// -c 100000 -a
 143
 144 System-wide profile, fetch ops event, sampling period: 100000, Random enable
 145
 146         # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a
 147
 148         Random enable adds small degree of variability to sample period. This
 149         helps in cases like long running loops where PMU is tagging the same
 150         instruction over and over because of fixed sample period.
 151
 152 etc.
 153
 154 PERF MEM AND PERF C2C
 155 ---------------------
 156
 157 perf mem is a memory access profiler tool and perf c2c is a shared data
 158 cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD.
 159 Below is a simple example of the perf mem tool.
 160
 161         # perf mem record -c 100000 -- make
 162         # perf mem report
 163
 164 A normal perf mem report output will provide detailed memory access profile.
 165 However, it can also be aggregated based on output fields. For example:
 166
 167         # perf mem report -F mem,sample,snoop
 168         Samples: 3M of event 'ibs_op//', Event count (approx.): 23524876
 169         Memory access                                 Samples  Snoop
 170         N/A                                           1903343  N/A
 171         L1 hit                                        1056754  N/A
 172         L2 hit                                          75231  N/A
 173         L3 hit                                           9496  HitM
 174         L3 hit                                           2270  N/A
 175         RAM hit                                          8710  N/A
 176         Remote node, same socket RAM hit                 3241  N/A
 177         Remote core, same node Any cache hit             1572  HitM
 178         Remote core, same node Any cache hit              514  N/A
 179         Remote node, same socket Any cache hit           1216  HitM
 180         Remote node, same socket Any cache hit            350  N/A
 181         Uncached hit                                       18  N/A
 182
 183 Please refer to their man page for more detail.
 184
 185 SEE ALSO
 186 --------
 187
 188 linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
 189 linkperf:perf-mem[1], linkperf:perf-c2c[1]