Documentation/gpu/amdgpu/display/dc-debug.rst

   1 ========================
   2 Display Core Debug tools
   3 ========================
   4
   5 In this section, you will find helpful information on debugging the amdgpu
   6 driver from the display perspective. This page introduces debug mechanisms and
   7 procedures to help you identify if some issues are related to display code.
   8
   9 Narrow down display issues
  10 ==========================
  11
  12 Since the display is the driver's visual component, it is common to see users
  13 reporting issues as a display when another component causes the problem. This
  14 section equips users to determine if a specific issue was caused by the display
  15 component or another part of the driver.
  16
  17 DC dmesg important messages
  18 ---------------------------
  19
  20 The dmesg log is the first source of information to be checked, and amdgpu
  21 takes advantage of this feature by logging some valuable information. When
  22 looking for the issues associated with amdgpu, remember that each component of
  23 the driver (e.g., smu, PSP, dm, etc.) is loaded one by one, and this
  24 information can be found in the dmesg log. In this sense, look for the part of
  25 the log that looks like the below log snippet::
  26
  27   [    4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8).
  28   [    4.254718] [drm] register mmio base: 0xFCB00000
  29   [    4.254918] [drm] register mmio size: 1048576
  30   [    4.260095] [drm] add ip block number 0 <soc21_common>
  31   [    4.260318] [drm] add ip block number 1 <gmc_v11_0>
  32   [    4.260510] [drm] add ip block number 2 <ih_v6_0>
  33   [    4.260696] [drm] add ip block number 3 <psp>
  34   [    4.260878] [drm] add ip block number 4 <smu>
  35   [    4.261057] [drm] add ip block number 5 <dm>
  36   [    4.261231] [drm] add ip block number 6 <gfx_v11_0>
  37   [    4.261402] [drm] add ip block number 7 <sdma_v6_0>
  38   [    4.261568] [drm] add ip block number 8 <vcn_v4_0>
  39   [    4.261729] [drm] add ip block number 9 <jpeg_v4_0>
  40   [    4.261887] [drm] add ip block number 10 <mes_v11_0>
  41
  42 From the above example, you can see the line that reports that `<dm>`,
  43 (**Display Manager**), was loaded, which means that display can be part of the
  44 issue. If you do not see that line, something else might have failed before
  45 amdgpu loads the display component, indicating that we don't have a
  46 display issue.
  47
  48 After you identified that the DM was loaded correctly, you can check for the
  49 display version of the hardware in use, which can be retrieved from the dmesg
  50 log with the command::
  51
  52   dmesg | grep -i 'display core'
  53
  54 This command shows a message that looks like this::
  55
  56   [    4.655828] [drm] Display Core v3.2.285 initialized on DCN 3.2
  57
  58 This message has two key pieces of information:
  59
  60 * **The DC version (e.g., v3.2.285)**: Display developers release a new DC version
  61   every week, and this information can be advantageous in a situation where a
  62   user/developer must find a good point versus a bad point based on a tested
  63   version of the display code. Remember from page :ref:`Display Core <amdgpu-display-core>`,
  64   that every week the new patches for display are heavily tested with IGT and
  65   manual tests.
  66 * **The DCN version (e.g., DCN 3.2)**: The DCN block is associated with the
  67   hardware generation, and the DCN version conveys the hardware generation that
  68   the driver is currently running. This information helps to narrow down the
  69   code debug area since each DCN version has its files in the DC folder per DCN
  70   component (from the example, the developer might want to focus on
  71   files/folders/functions/structs with the dcn32 label might be executed).
  72   However, keep in mind that DC reuses code across different DCN versions; for
  73   example, it is expected to have some callbacks set in one DCN that are the same
  74   as those from another DCN. In summary, use the DCN version just as a guide.
  75
  76 From the dmesg file, it is also possible to get the ATOM bios code by using::
  77
  78   dmesg  | grep -i 'ATOM BIOS'
  79
  80 Which generates an output that looks like this::
  81
  82   [    4.274534] amdgpu: ATOM BIOS: 113-D7020100-102
  83
  84 This type of information is useful to be reported.
  85
  86 Avoid loading display core
  87 --------------------------
  88
  89 Sometimes, it might be hard to figure out which part of the driver is causing
  90 the issue; if you suspect that the display is not part of the problem and your
  91 bug scenario is simple (e.g., some desktop configuration) you can try to remove
  92 the display component from the equation. First, you need to identify `dm` ID
  93 from the dmesg log; for example, search for the following log::
  94
  95   [    4.254295] [drm] initializing kernel modesetting (IP DISCOVERY 0x1002:0x744C 0x1002:0x0E3B 0xC8).
  96   [..]
  97   [    4.260095] [drm] add ip block number 0 <soc21_common>
  98   [    4.260318] [drm] add ip block number 1 <gmc_v11_0>
  99   [..]
 100   [    4.261057] [drm] add ip block number 5 <dm>
 101
 102 Notice from the above example that the `dm` id is 5 for this specific hardware.
 103 Next, you need to run the following binary operation to identify the IP block
 104 mask::
 105
 106   0xffffffff & ~(1 << [DM ID])
 107
 108 From our example the IP mask is::
 109
 110  0xffffffff & ~(1 << 5) = 0xffffffdf
 111
 112 Finally, to disable DC, you just need to set the below parameter in your
 113 bootloader::
 114
 115  amdgpu.ip_block_mask = 0xffffffdf
 116
 117 If you can boot your system with the DC disabled and still see the issue, it
 118 means you can rule DC out of the equation. However, if the bug disappears, you
 119 still need to consider the DC part of the problem and keep narrowing down the
 120 issue. In some scenarios, disabling DC is impossible since it might be
 121 necessary to use the display component to reproduce the issue (e.g., play a
 122 game).
 123
 124 **Note: This will probably lead to the absence of a display output.**
 125
 126 Display flickering
 127 ------------------
 128
 129 Display flickering might have multiple causes; one is the lack of proper power
 130 to the GPU or problems in the DPM switches. A good first generic verification
 131 is to set the GPU to use high voltage::
 132
 133    bash -c "echo high > /sys/class/drm/card0/device/power_dpm_force_performance_level"
 134
 135 The above command sets the GPU/APU to use the maximum power allowed which
 136 disables DPM switches. If forcing DPM levels high does not fix the issue, it
 137 is less likely that the issue is related to power management. If the issue
 138 disappears, there is a good chance that other components might be involved, and
 139 the display should not be ignored since this could be a DPM issues. From the
 140 display side, if the power increase fixes the issue, it is worth debugging the
 141 clock configuration and the pipe split police used in the specific
 142 configuration.
 143
 144 Display artifacts
 145 -----------------
 146
 147 Users may see some screen artifacts that can be categorized into two different
 148 types: localized artifacts and general artifacts. The localized artifacts
 149 happen in some specific areas, such as around the UI window corners; if you see
 150 this type of issue, there is a considerable chance that you have a userspace
 151 problem, likely Mesa or similar. The general artifacts usually happen on the
 152 entire screen. They might be caused by a misconfiguration at the driver level
 153 of the display parameters, but the userspace might also cause this issue. One
 154 way to identify the source of the problem is to take a screenshot or make a
 155 desktop video capture when the problem happens; after checking the
 156 screenshot/video recording, if you don't see any of the artifacts, it means
 157 that the issue is likely on the the driver side. If you can still see the
 158 problem in the data collected, it is an issue that probably happened during
 159 rendering, and the display code just got the framebuffer already corrupted.
 160
 161 Disabling/Enabling specific features
 162 ====================================
 163
 164 DC has a struct named `dc_debug_options`, which is statically initialized by
 165 all DCE/DCN components based on the specific hardware characteristic. This
 166 structure usually facilitates the bring-up phase since developers can start
 167 with many disabled features and enable them individually. This is also an
 168 important debug feature since users can change it when debugging specific
 169 issues.
 170
 171 For example, dGPU users sometimes see a problem where a horizontal fillet of
 172 flickering happens in some specific part of the screen. This could be an
 173 indication of Sub-Viewport issues; after the users identified the target DCN,
 174 they can set the `force_disable_subvp` field to true in the statically
 175 initialized version of `dc_debug_options` to see if the issue gets fixed. Along
 176 the same lines, users/developers can also try to turn off `fams2_config` and
 177 `enable_single_display_2to1_odm_policy`. In summary, the `dc_debug_options` is
 178 an interesting form for identifying the problem.
 179
 180 DC Visual Confirmation
 181 ======================
 182
 183 Display core provides a feature named visual confirmation, which is a set of
 184 bars added at the scanout time by the driver to convey some specific
 185 information. In general, you can enable this debug option by using::
 186
 187   echo <N> > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
 188
 189 Where `N` is an integer number for some specific scenarios that the developer
 190 wants to enable, you will see some of these debug cases in the following
 191 subsection.
 192
 193 Multiple Planes Debug
 194 ---------------------
 195
 196 If you want to enable or debug multiple planes in a specific user-space
 197 application, you can leverage a debug feature named visual confirm. For
 198 enabling it, you will need::
 199
 200   echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
 201
 202 You need to reload your GUI to see the visual confirmation. When the plane
 203 configuration changes or a full update occurs there will be a colored bar at
 204 the bottom of each hardware plane being drawn on the screen.
 205
 206 * The color indicates the format - For example, red is AR24 and green is NV12
 207 * The height of the bar indicates the index of the plane
 208 * Pipe split can be observed if there are two bars with a difference in height
 209   covering the same plane
 210
 211 Consider the video playback case in which a video is played in a specific
 212 plane, and the desktop is drawn in another plane. The video plane should
 213 feature one or two green bars at the bottom of the video depending on pipe
 214 split configuration.
 215
 216 * There should **not** be any visual corruption
 217 * There should **not** be any underflow or screen flashes
 218 * There should **not** be any black screens
 219 * There should **not** be any cursor corruption
 220 * Multiple plane **may** be briefly disabled during window transitions or
 221   resizing but should come back after the action has finished
 222
 223 Pipe Split Debug
 224 ----------------
 225
 226 Sometimes we need to debug if DCN is splitting pipes correctly, and visual
 227 confirmation is also handy for this case. Similar to the MPO case, you can use
 228 the below command to enable visual confirmation::
 229
 230   echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_visual_confirm
 231
 232 In this case, if you have a pipe split, you will see one small red bar at the
 233 bottom of the display covering the entire display width and another bar
 234 covering the second pipe. In other words, you will see a bit high bar in the
 235 second pipe.
 236
 237 DTN Debug
 238 =========
 239
 240 DC (DCN) provides an extensive log that dumps multiple details from our
 241 hardware configuration. Via debugfs, you can capture those status values by
 242 using Display Test Next (DTN) log, which can be captured via debugfs by using::
 243
 244   cat /sys/kernel/debug/dri/0/amdgpu_dm_dtn_log
 245
 246 Since this log is updated accordingly with DCN status, you can also follow the
 247 change in real-time by using something like::
 248
 249   sudo watch -d cat /sys/kernel/debug/dri/0/amdgpu_dm_dtn_log
 250
 251 When reporting a bug related to DC, consider attaching this log before and
 252 after you reproduce the bug.
 253
 254 Collect Firmware information
 255 ============================
 256
 257 When reporting issues, it is important to have the firmware information since
 258 it can be helpful for debugging purposes. To get all the firmware information,
 259 use the command::
 260
 261   cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
 262
 263 From the display perspective, pay attention to the firmware of the DMCU and
 264 DMCUB.
 265
 266 DMUB Firmware Debug
 267 ===================
 268
 269 Sometimes, dmesg logs aren't enough. This is especially true if a feature is
 270 implemented primarily in DMUB firmware. In such cases, all we see in dmesg when
 271 an issue arises is some generic timeout error. So, to get more relevant
 272 information, we can trace DMUB commands by enabling the relevant bits in
 273 `amdgpu_dm_dmub_trace_mask`.
 274
 275 Currently, we support the tracing of the following groups:
 276
 277 Trace Groups
 278 ------------
 279
 280 .. csv-table::
 281    :header-rows: 1
 282    :widths: 1, 1
 283    :file: ./trace-groups-table.csv
 284
 285 **Note: Not all ASICs support all of the listed trace groups**
 286
 287 So, to enable just PSR tracing you can use the following command::
 288
 289   # echo 0x8020 > /sys/kernel/debug/dri/0/amdgpu_dm_dmub_trace_mask
 290
 291 Then, you need to enable logging trace events to the buffer, which you can do
 292 using the following::
 293
 294   # echo 1 > /sys/kernel/debug/dri/0/amdgpu_dm_dmcub_trace_event_en
 295
 296 Lastly, after you are able to reproduce the issue you are trying to debug,
 297 you can disable tracing and read the trace log by using the following::
 298
 299   # echo 0 > /sys/kernel/debug/dri/0/amdgpu_dm_dmcub_trace_event_en
 300   # cat /sys/kernel/debug/dri/0/amdgpu_dm_dmub_tracebuffer
 301
 302 So, when reporting bugs related to features such as PSR and ABM, consider
 303 enabling the relevant bits in the mask before reproducing the issue and
 304 attach the log that you obtain from the trace buffer in any bug reports that you
 305 create.