1 .. SPDX-License-Identifier: GPL-2.0
10 The ``devlink`` health mechanism is targeted for Real Time Alerting, in
11 order to know when something bad happened to a PCI device.
13 * Provide alert debug information.
15 * If problem needs vendor support, provide a way to gather all needed
16 debugging information.
21 The main idea is to unify and centralize driver health reports in the
22 generic ``devlink`` instance and allow the user to set different
23 attributes of the health reporting and recovery procedures.
25 The ``devlink`` health reporter:
26 Device driver creates a "health reporter" per each error/health type.
27 Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
28 or unknown (driver specific).
29 For each registered health reporter a driver can issue error/health reports
30 asynchronously. All health reports handling is done by ``devlink``.
31 Device driver can provide specific callbacks for each "health reporter", e.g.:
34 * Diagnostics procedures
35 * Object dump procedures
36 * OOB initial parameters
38 Different parts of the driver can register different types of health reporters
39 with different handlers.
44 Once an error is reported, devlink health will perform the following actions:
46 * A log is being send to the kernel trace events buffer
47 * Health status and statistics are being updated for the reporter instance
48 * Object dump is being taken and saved at the reporter instance (as long as
49 there is no other dump which is already stored)
50 * Auto recovery attempt is being done. Depends on:
51 - Auto-recovery configuration
52 - Grace period vs. time passed since last recover
57 User can access/change each reporter's parameters and driver specific callbacks
58 via ``devlink``, e.g per error type (per health reporter):
60 * Configure reporter's generic parameters (like: disable/enable auto recovery)
61 * Invoke recovery procedure
65 .. list-table:: List of devlink health interfaces
70 * - ``DEVLINK_CMD_HEALTH_REPORTER_GET``
71 - Retrieves status and configuration info per DEV and reporter.
72 * - ``DEVLINK_CMD_HEALTH_REPORTER_SET``
73 - Allows reporter-related configuration setting.
74 * - ``DEVLINK_CMD_HEALTH_REPORTER_RECOVER``
75 - Triggers a reporter's recovery procedure.
76 * - ``DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE``
77 - Retrieves diagnostics data from a reporter on a device.
78 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET``
79 - Retrieves the last stored dump. Devlink health
80 saves a single dump. If an dump is not already stored by the devlink
81 for this reporter, devlink generates a new dump.
82 dump output is defined by the reporter.
83 * - ``DEVLINK_CMD_HEALTH_REPORTER_DUMP_CLEAR``
84 - Clears the last saved dump file for the specified reporter.
86 The following diagram provides a general overview of ``devlink-health``::
89 +--------------------------+
93 +--------------------------+
96 mlx5_core devlink |recover,
98 +--------+ +--------------------------+
100 | | | +---------v----------+ |
101 | | ops execution | | | |
102 | <----------------------------------+ | |
104 | | | + ^------------------+ |
105 | | | | request for ops |
106 | | | | (recover, dump) |
108 | | | +-+------------------+ |
109 | | health report | | health handler | |
110 | +-------------------------------> | |
111 | | | +--------------------+ |
112 | | health reporter create | |
113 | +----------------------------> |
114 +--------+ +--------------------------+