5 This document describes the **Distributed Switch Architecture (DSA)** subsystem
6 design principles, limitations, interactions with other subsystems, and how to
7 develop drivers for this subsystem as well as a TODO for developers interested
13 The Distributed Switch Architecture is a subsystem which was primarily designed
14 to support Marvell Ethernet switches (MV88E6xxx, a.k.a Linkstreet product line)
15 using Linux, but has since evolved to support other vendors as well.
17 The original philosophy behind this design was to be able to use unmodified
18 Linux tools such as bridge, iproute2, ifconfig to work transparently whether
19 they configured/queried a switch port network device or a regular network
22 An Ethernet switch is typically comprised of multiple front-panel ports, and one
23 or more CPU or management port. The DSA subsystem currently relies on the
24 presence of a management port connected to an Ethernet controller capable of
25 receiving Ethernet frames from the switch. This is a very common setup for all
26 kinds of Ethernet switches found in Small Home and Office products: routers,
27 gateways, or even top-of-the rack switches. This host Ethernet controller will
28 be later referred to as "master" and "cpu" in DSA terminology and code.
30 The D in DSA stands for Distributed, because the subsystem has been designed
31 with the ability to configure and manage cascaded switches on top of each other
32 using upstream and downstream Ethernet links between switches. These specific
33 ports are referred to as "dsa" ports in DSA terminology and code. A collection
34 of multiple switches connected to each other is called a "switch tree".
36 For each front-panel port, DSA will create specialized network devices which are
37 used as controlling and data-flowing endpoints for use by the Linux networking
38 stack. These specialized network interfaces are referred to as "slave" network
39 interfaces in DSA terminology and code.
41 The ideal case for using DSA is when an Ethernet switch supports a "switch tag"
42 which is a hardware feature making the switch insert a specific tag for each
43 Ethernet frames it received to/from specific ports to help the management
46 - what port is this frame coming from
47 - what was the reason why this frame got forwarded
48 - how to send CPU originated traffic to specific ports
50 The subsystem does support switches not capable of inserting/stripping tags, but
51 the features might be slightly limited in that case (traffic separation relies
52 on Port-based VLAN IDs).
54 Note that DSA does not currently create network interfaces for the "cpu" and
57 - the "cpu" port is the Ethernet switch facing side of the management
58 controller, and as such, would create a duplication of feature, since you
59 would get two interfaces for the same conduit: master netdev, and "cpu" netdev
61 - the "dsa" port(s) are just conduits between two or more switches, and as such
62 cannot really be used as proper network interfaces either, only the
63 downstream, or the top-most upstream interface makes sense with that model
65 Switch tagging protocols
66 ------------------------
68 DSA currently supports 5 different tagging protocols, and a tag-less mode as
69 well. The different protocols are implemented in:
71 - ``net/dsa/tag_trailer.c``: Marvell's 4 trailer tag mode (legacy)
72 - ``net/dsa/tag_dsa.c``: Marvell's original DSA tag
73 - ``net/dsa/tag_edsa.c``: Marvell's enhanced DSA tag
74 - ``net/dsa/tag_brcm.c``: Broadcom's 4 bytes tag
75 - ``net/dsa/tag_qca.c``: Qualcomm's 2 bytes tag
77 The exact format of the tag protocol is vendor specific, but in general, they
78 all contain something which:
80 - identifies which port the Ethernet frame came from/should be sent to
81 - provides a reason why this frame was forwarded to the management interface
83 Master network devices
84 ----------------------
86 Master network devices are regular, unmodified Linux network device drivers for
87 the CPU/management Ethernet interface. Such a driver might occasionally need to
88 know whether DSA is enabled (e.g.: to enable/disable specific offload features),
89 but the DSA subsystem has been proven to work with industry standard drivers:
90 ``e1000e,`` ``mv643xx_eth`` etc. without having to introduce modifications to these
91 drivers. Such network devices are also often referred to as conduit network
92 devices since they act as a pipe between the host processor and the hardware
95 Networking stack hooks
96 ----------------------
98 When a master netdev is used with DSA, a small hook is placed in the
99 networking stack is in order to have the DSA subsystem process the Ethernet
100 switch specific tagging protocol. DSA accomplishes this by registering a
101 specific (and fake) Ethernet type (later becoming ``skb->protocol``) with the
102 networking stack, this is also known as a ``ptype`` or ``packet_type``. A typical
103 Ethernet Frame receive sequence looks like this:
105 Master network device (e.g.: e1000e):
107 1. Receive interrupt fires:
109 - receive function is invoked
110 - basic packet processing is done: getting length, status etc.
111 - packet is prepared to be processed by the Ethernet layer by calling
114 2. net/ethernet/eth.c::
116 eth_type_trans(skb, dev)
117 if (dev->dsa_ptr != NULL)
118 -> skb->protocol = ETH_P_XDSA
120 3. drivers/net/ethernet/\*::
122 netif_receive_skb(skb)
123 -> iterate over registered packet_type
124 -> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv()
129 -> invoke switch tag specific protocol handler in 'net/dsa/tag_*.c'
133 - inspect and strip switch tag protocol to determine originating port
134 - locate per-port network device
135 - invoke ``eth_type_trans()`` with the DSA slave network device
136 - invoked ``netif_receive_skb()``
138 Past this point, the DSA slave network devices get delivered regular Ethernet
139 frames that can be processed by the networking stack.
141 Slave network devices
142 ---------------------
144 Slave network devices created by DSA are stacked on top of their master network
145 device, each of these network interfaces will be responsible for being a
146 controlling and data-flowing end-point for each front-panel port of the switch.
147 These interfaces are specialized in order to:
149 - insert/remove the switch tag protocol (if it exists) when sending traffic
150 to/from specific switch ports
151 - query the switch for ethtool operations: statistics, link state,
152 Wake-on-LAN, register dumps...
153 - external/internal PHY management: link, auto-negotiation etc.
155 These slave network devices have custom net_device_ops and ethtool_ops function
156 pointers which allow DSA to introduce a level of layering between the networking
157 stack/ethtool, and the switch driver implementation.
159 Upon frame transmission from these slave network devices, DSA will look up which
160 switch tagging protocol is currently registered with these network devices, and
161 invoke a specific transmit routine which takes care of adding the relevant
162 switch tag in the Ethernet frames.
164 These frames are then queued for transmission using the master network device
165 ``ndo_start_xmit()`` function, since they contain the appropriate switch tag, the
166 Ethernet switch will be able to process these incoming frames from the
167 management interface and delivers these frames to the physical switch port.
169 Graphical representation
170 ------------------------
172 Summarized, this is basically how DSA looks like from a network device
176 |---------------------------
177 | CPU network device (eth0)|
178 ----------------------------
179 | <tag added by switch |
182 | tag added by CPU> |
183 |--------------------------------------------|
185 |--------------------------------------------|
187 |-------| |-------| |-------|
188 | sw0p0 | | sw0p1 | | sw0p2 |
189 |-------| |-------| |-------|
196 In order to be able to read to/from a switch PHY built into it, DSA creates a
197 slave MDIO bus which allows a specific switch driver to divert and intercept
198 MDIO reads/writes towards specific PHY addresses. In most MDIO-connected
199 switches, these functions would utilize direct or indirect PHY addressing mode
200 to return standard MII registers from the switch builtin PHYs, allowing the PHY
201 library and/or to return link status, link partner pages, auto-negotiation
204 For Ethernet switches which have both external and internal MDIO busses, the
205 slave MII bus can be utilized to mux/demux MDIO reads and writes towards either
206 internal or external MDIO devices this switch might be connected to: internal
207 PHYs, external PHYs, or even external switches.
212 DSA data structures are defined in ``include/net/dsa.h`` as well as
213 ``net/dsa/dsa_priv.h``:
215 - ``dsa_chip_data``: platform data configuration for a given switch device,
216 this structure describes a switch device's parent device, its address, as
217 well as various properties of its ports: names/labels, and finally a routing
218 table indication (when cascading switches)
220 - ``dsa_platform_data``: platform device configuration data which can reference
221 a collection of dsa_chip_data structure if multiples switches are cascaded,
222 the master network device this switch tree is attached to needs to be
225 - ``dsa_switch_tree``: structure assigned to the master network device under
226 ``dsa_ptr``, this structure references a dsa_platform_data structure as well as
227 the tagging protocol supported by the switch tree, and which receive/transmit
228 function hooks should be invoked, information about the directly attached
229 switch is also provided: CPU port. Finally, a collection of dsa_switch are
230 referenced to address individual switches in the tree.
232 - ``dsa_switch``: structure describing a switch device in the tree, referencing
233 a ``dsa_switch_tree`` as a backpointer, slave network devices, master network
234 device, and a reference to the backing``dsa_switch_ops``
236 - ``dsa_switch_ops``: structure referencing function pointers, see below for a
242 Limits on the number of devices and ports
243 -----------------------------------------
245 DSA currently limits the number of maximum switches within a tree to 4
246 (``DSA_MAX_SWITCHES``), and the number of ports per switch to 12 (``DSA_MAX_PORTS``).
247 These limits could be extended to support larger configurations would this need
250 Lack of CPU/DSA network devices
251 -------------------------------
253 DSA does not currently create slave network devices for the CPU or DSA ports, as
254 described before. This might be an issue in the following cases:
256 - inability to fetch switch CPU port statistics counters using ethtool, which
257 can make it harder to debug MDIO switch connected using xMII interfaces
259 - inability to configure the CPU port link parameters based on the Ethernet
260 controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/
262 - inability to configure specific VLAN IDs / trunking VLANs between switches
263 when using a cascaded setup
265 Common pitfalls using DSA setups
266 --------------------------------
268 Once a master network device is configured to use DSA (dev->dsa_ptr becomes
269 non-NULL), and the switch behind it expects a tagging protocol, this network
270 interface can only exclusively be used as a conduit interface. Sending packets
271 directly through this interface (e.g.: opening a socket using this interface)
272 will not make us go through the switch tagging protocol transmit function, so
273 the Ethernet switch on the other end, expecting a tag will typically drop this
276 Slave network devices check that the master network device is UP before allowing
277 you to administratively bring UP these slave network devices. A common
278 configuration mistake is forgetting to bring UP the master network device first.
280 Interactions with other subsystems
281 ==================================
283 DSA currently leverages the following subsystems:
285 - MDIO/PHY library: ``drivers/net/phy/phy.c``, ``mdio_bus.c``
286 - Switchdev:``net/switchdev/*``
287 - Device Tree for various of_* functions
292 Slave network devices exposed by DSA may or may not be interfacing with PHY
293 devices (``struct phy_device`` as defined in ``include/linux/phy.h)``, but the DSA
294 subsystem deals with all possible combinations:
296 - internal PHY devices, built into the Ethernet switch hardware
297 - external PHY devices, connected via an internal or external MDIO bus
298 - internal PHY devices, connected via an internal MDIO bus
299 - special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a
302 The PHY configuration is done by the ``dsa_slave_phy_setup()`` function and the
303 logic basically looks like this:
305 - if Device Tree is used, the PHY device is looked up using the standard
306 "phy-handle" property, if found, this PHY device is created and registered
307 using ``of_phy_connect()``
309 - if Device Tree is used, and the PHY device is "fixed", that is, conforms to
310 the definition of a non-MDIO managed PHY as defined in
311 ``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered
312 and connected transparently using the special fixed MDIO bus driver
314 - finally, if the PHY is built into the switch, as is very common with
315 standalone switch packages, the PHY is probed using the slave MII bus created
322 DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and
323 more specifically with its VLAN filtering portion when configuring VLANs on top
324 of per-port slave network devices. Since DSA primarily deals with
325 MDIO-connected switches, although not exclusively, SWITCHDEV's
326 prepare/abort/commit phases are often simplified into a prepare phase which
327 checks whether the operation is supported by the DSA switch driver, and a commit
328 phase which applies the changes.
330 As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN
336 DSA features a standardized binding which is documented in
337 ``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper
338 functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query
339 per-port PHY specific details: interface connection, MDIO bus location etc..
344 DSA switch drivers need to implement a dsa_switch_ops structure which will
345 contain the various members described below.
347 ``register_switch_driver()`` registers this dsa_switch_ops in its internal list
348 of drivers to probe for. ``unregister_switch_driver()`` does the exact opposite.
350 Unless requested differently by setting the priv_size member accordingly, DSA
351 does not allocate any driver private context space.
356 - ``tag_protocol``: this is to indicate what kind of tagging protocol is supported,
357 should be a valid value from the ``dsa_tag_protocol`` enum
359 - ``probe``: probe routine which will be invoked by the DSA platform device upon
360 registration to test for the presence/absence of a switch device. For MDIO
361 devices, it is recommended to issue a read towards internal registers using
362 the switch pseudo-PHY and return whether this is a supported device. For other
363 buses, return a non-NULL string
365 - ``setup``: setup function for the switch, this function is responsible for setting
366 up the ``dsa_switch_ops`` private structure with all it needs: register maps,
367 interrupts, mutexes, locks etc.. This function is also expected to properly
368 configure the switch to separate all network interfaces from each other, that
369 is, they should be isolated by the switch hardware itself, typically by creating
370 a Port-based VLAN ID for each port and allowing only the CPU port and the
371 specific port to be in the forwarding vector. Ports that are unused by the
372 platform should be disabled. Past this function, the switch is expected to be
373 fully configured and ready to serve any kind of request. It is recommended
374 to issue a software reset of the switch during this setup function in order to
375 avoid relying on what a previous software agent such as a bootloader/firmware
376 may have previously configured.
378 PHY devices and link management
379 -------------------------------
381 - ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs,
382 if the PHY library PHY driver needs to know about information it cannot obtain
383 on its own (e.g.: coming from switch memory mapped registers), this function
384 should return a 32-bits bitmask of "flags", that is private between the switch
385 driver and the Ethernet PHY driver in ``drivers/net/phy/\*``.
387 - ``phy_read``: Function invoked by the DSA slave MDIO bus when attempting to read
388 the switch port MDIO registers. If unavailable, return 0xffff for each read.
389 For builtin switch Ethernet PHYs, this function should allow reading the link
390 status, auto-negotiation results, link partner pages etc..
392 - ``phy_write``: Function invoked by the DSA slave MDIO bus when attempting to write
393 to the switch port MDIO registers. If unavailable return a negative error
396 - ``adjust_link``: Function invoked by the PHY library when a slave network device
397 is attached to a PHY device. This function is responsible for appropriately
398 configuring the switch port link parameters: speed, duplex, pause based on
399 what the ``phy_device`` is providing.
401 - ``fixed_link_update``: Function invoked by the PHY library, and specifically by
402 the fixed PHY driver asking the switch driver for link parameters that could
403 not be auto-negotiated, or obtained by reading the PHY registers through MDIO.
404 This is particularly useful for specific kinds of hardware such as QSGMII,
405 MoCA or other kinds of non-MDIO managed PHYs where out of band link
406 information is obtained
411 - ``get_strings``: ethtool function used to query the driver's strings, will
412 typically return statistics strings, private flags strings etc.
414 - ``get_ethtool_stats``: ethtool function used to query per-port statistics and
415 return their values. DSA overlays slave network devices general statistics:
416 RX/TX counters from the network device, with switch driver specific statistics
419 - ``get_sset_count``: ethtool function used to query the number of statistics items
421 - ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this
422 function may, for certain implementations also query the master network device
423 Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN
425 - ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port,
426 direct counterpart to set_wol with similar restrictions
428 - ``set_eee``: ethtool function which is used to configure a switch port EEE (Green
429 Ethernet) settings, can optionally invoke the PHY library to enable EEE at the
430 PHY level if relevant. This function should enable EEE at the switch port MAC
431 controller and data-processing logic
433 - ``get_eee``: ethtool function which is used to query a switch port EEE settings,
434 this function should return the EEE state of the switch port MAC controller
435 and data-processing logic as well as query the PHY for its currently configured
438 - ``get_eeprom_len``: ethtool function returning for a given switch the EEPROM
441 - ``get_eeprom``: ethtool function returning for a given switch the EEPROM contents
443 - ``set_eeprom``: ethtool function writing specified data to a given switch EEPROM
445 - ``get_regs_len``: ethtool function returning the register length for a given
448 - ``get_regs``: ethtool function returning the Ethernet switch internal register
449 contents. This function might require user-land code in ethtool to
450 pretty-print register values and registers
455 - ``suspend``: function invoked by the DSA platform device when the system goes to
456 suspend, should quiesce all Ethernet switch activities, but keep ports
457 participating in Wake-on-LAN active as well as additional wake-up logic if
460 - ``resume``: function invoked by the DSA platform device when the system resumes,
461 should resume all Ethernet switch activities and re-configure the switch to be
462 in a fully active state
464 - ``port_enable``: function invoked by the DSA slave network device ndo_open
465 function when a port is administratively brought up, this function should be
466 fully enabling a given switch port. DSA takes care of marking the port with
467 ``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it
468 was not, and propagating these changes down to the hardware
470 - ``port_disable``: function invoked by the DSA slave network device ndo_close
471 function when a port is administratively brought down, this function should be
472 fully disabling a given switch port. DSA takes care of marking the port with
473 ``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is
474 disabled while being a bridge member
479 - ``port_bridge_join``: bridge layer function invoked when a given switch port is
480 added to a bridge, this function should be doing the necessary at the switch
481 level to permit the joining port from being added to the relevant logical
482 domain for it to ingress/egress traffic with other members of the bridge.
484 - ``port_bridge_leave``: bridge layer function invoked when a given switch port is
485 removed from a bridge, this function should be doing the necessary at the
486 switch level to deny the leaving port from ingress/egress traffic from the
487 remaining bridge members. When the port leaves the bridge, it should be aged
488 out at the switch hardware for the switch to (re) learn MAC addresses behind
491 - ``port_stp_state_set``: bridge layer function invoked when a given switch port STP
492 state is computed by the bridge layer and should be propagated to switch
493 hardware to forward/block/learn traffic. The switch driver is responsible for
494 computing a STP state change based on current and asked parameters and perform
495 the relevant ageing based on the intersection results
497 Bridge VLAN filtering
498 ---------------------
500 - ``port_vlan_filtering``: bridge layer function invoked when the bridge gets
501 configured for turning on or off VLAN filtering. If nothing specific needs to
502 be done at the hardware level, this callback does not need to be implemented.
503 When VLAN filtering is turned on, the hardware must be programmed with
504 rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed
505 VLAN ID map/rules. If there is no PVID programmed into the switch port,
506 untagged frames must be rejected as well. When turned off the switch must
507 accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are
510 - ``port_vlan_prepare``: bridge layer function invoked when the bridge prepares the
511 configuration of a VLAN on the given port. If the operation is not supported
512 by the hardware, this function should return ``-EOPNOTSUPP`` to inform the bridge
513 code to fallback to a software implementation. No hardware setup must be done
514 in this function. See port_vlan_add for this and details.
516 - ``port_vlan_add``: bridge layer function invoked when a VLAN is configured
517 (tagged or untagged) for the given switch port
519 - ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the
522 - ``port_vlan_dump``: bridge layer function invoked with a switchdev callback
523 function that the driver has to call for each VLAN the given port is a member
524 of. A switchdev object is used to carry the VID and bridge flags.
526 - ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a
527 Forwarding Database entry, the switch hardware should be programmed with the
528 specified address in the specified VLAN Id in the forwarding database
529 associated with this VLAN ID. If the operation is not supported, this
530 function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback to
531 a software implementation.
533 .. note:: VLAN ID 0 corresponds to the port private database, which, in the context
534 of DSA, would be its port-based VLAN, used by the associated bridge device.
536 - ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a
537 Forwarding Database entry, the switch hardware should be programmed to delete
538 the specified MAC address from the specified VLAN ID if it was mapped into
539 this port forwarding database
541 - ``port_fdb_dump``: bridge layer function invoked with a switchdev callback
542 function that the driver has to call for each MAC address known to be behind
543 the given port. A switchdev object is used to carry the VID and FDB info.
545 - ``port_mdb_prepare``: bridge layer function invoked when the bridge prepares the
546 installation of a multicast database entry. If the operation is not supported,
547 this function should return ``-EOPNOTSUPP`` to inform the bridge code to fallback
548 to a software implementation. No hardware setup must be done in this function.
549 See ``port_fdb_add`` for this and details.
551 - ``port_mdb_add``: bridge layer function invoked when the bridge wants to install
552 a multicast database entry, the switch hardware should be programmed with the
553 specified address in the specified VLAN ID in the forwarding database
554 associated with this VLAN ID.
556 .. note:: VLAN ID 0 corresponds to the port private database, which, in the context
557 of DSA, would be its port-based VLAN, used by the associated bridge device.
559 - ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a
560 multicast database entry, the switch hardware should be programmed to delete
561 the specified MAC address from the specified VLAN ID if it was mapped into
562 this port forwarding database.
564 - ``port_mdb_dump``: bridge layer function invoked with a switchdev callback
565 function that the driver has to call for each MAC address known to be behind
566 the given port. A switchdev object is used to carry the VID and MDB info.
571 Making SWITCHDEV and DSA converge towards an unified codebase
572 -------------------------------------------------------------
574 SWITCHDEV properly takes care of abstracting the networking stack with offload
575 capable hardware, but does not enforce a strict switch device driver model. On
576 the other DSA enforces a fairly strict device driver model, and deals with most
577 of the switch specific. At some point we should envision a merger between these
578 two subsystems and get the best of both worlds.
583 - making the number of ports fully dynamic and not dependent on ``DSA_MAX_PORTS``
584 - allowing more than one CPU/management interface:
585 http://comments.gmane.org/gmane.linux.network/365657
586 - porting more drivers from other vendors:
587 http://comments.gmane.org/gmane.linux.network/365510