InfiniBand

  1. openfabrics.org Primary source for OFED. Somewhat convoluted, thus a bit of work to get them compiled on your platform. Compilation for MPI, MVAPICH, etc would be needed for HPC env. They include ULP (upper layer protocol) such as IPoIB.
  2. RHEL ofed - RHEL backport fixes, and keep a more usable OFED stack than the community version. use this before using community version to avoid headache (eg luster compatibility).
  3. mellanox.com - compile binaries for major distro, used to be about 3 months lag behind source, but probably best route for supported cards.
    If updateing OS, need to get new version of driver, and recompile for the OS. may take some 15 min to compile and install. May also disable opensmd systemd , just need to re-enable it
  4. It seems that the current trend is to use the ofed drivers. some compilation needed. Compilation for MPI, MVAPICH, etc would be needed for HPC env. They include ULP (upper layer protocol) such as IPoIB.

InfiniBand 101

CA - Control Adaptor
HCA - Host CA (like how FibreChannel has HBA)
LID - Local ID, uniq id of IB end points.
port - a port on the HCA, like NIC ports in ethernet quad card.
GUID - Global Uniq ID
sminfo give the GUID of the subnet manager, can be used to determine if two nodes are on same fabric or in two separate islands.

Symbol Error - Think of this as packet error in the ethernet world. it is the rate of error accumulation that should be of concern.

XmtDiscards - counter indicative of congestion (concern when rate of increase is rapid)

Verbs - IB allow code to be run on the network. Think optimizing calls for calculating sum of counters distributed among many IB hosts. Verbs allow command to be executed on feature-rich IB swtich.

PS. Mellanox ConnectX-5, 6 can take AI-branded QSFP modules (but not the other way around).

InfiniBand Generations

Modulation

SDR - Single Data Rate              4x2.5 = 10 Gbps  (8 Gbps accounting for bit encoding overhead)
DDR - Double Data Rate              4x 5  = 20 Gbps
QDR - Quad Data Rate.       QSFP,   4x10 =  40 Gbps.  CX-2 ??
FDR - Fith? Data Rate.      QSFP,   4x14 =  56 Gbps.  CX-3
EDR - Enhanced Data Rate.   QSFP28, 4x25 = 100 Gbps.  CX-4 ?  CX-5?  NRZ modulation.    Backward compatible with FDR.
HDR - High?/Hyper? DR.      QSFP56, 4x50 = 200 Gbps.  CX-6 can do HDR200 or Y-split 2x HDR100.  PAM4 modulation.  2018.  Backward compatible at 10/25/40/50/100/200G, but matching cables would be needed.
NDR -  QSFP-DD?  8x50?
XDR -  OSFP ?    8x50? 

modulation (think encoding):
NRZ:   Non-Return-to-Zero: traditional binary, on/off 1 bit per cycle.  Is this same as PAM-2? 
PAM-4: Pulse Amplitude Modulation-4: 2 bits per PAM4 symbol, encoded via 4 mean voltages: VA, VB,VC, VD.

Arista 7280R2 - 25 Gbps SERDES?  QSFP28 ports, Max out at 100Gbps?  only do NRZ?
Arista 7280R3 - 50 Gbps SERDES?  QSFP56 ports, Capable of 200Gbps?  Capable of PAM4 ?

ConnectX-6 can connect to slower equipment, but note that HDR cable will negotiage PAM4.  EDR cable would be needed to have it negotiate to NRZ.
In practice, used ConnectX-6 HBA, HDR100 cable, connected to EDR switch, and it worked.  ibstat reports Rate: 100, so presumably did negotiate NRZ cuz of the EDR switch and use 4x25.
Still don't understand why HDR cables have IB version vs Ethernet version (they are not interchangeable).


Ref:
  1. wikipedia EDR vs HDR table
  2. Review of ConnectX-5 by STH
  3. Mellanox CX-6 spec sheet
MCX556A-EDAT -- decyphering this model number:
First 5 = CX-5
The 6   = dual port
The D   = PCIe 4.0

Note: Model with Lx designation run at lower speed than the max allowed by the protocol.  eg 50 Gbps instead of 100. (cuz not using all the lanes?)

Generation cross compatibility

The following note is from wonderful colleage Chris from Mellanox/Nvidia:

Each product (Adapters/Eth Switches/IB switches) has its own firmware compatibility lists. The firmware notes for each product will have compatibility cross maps.

FDR Adapters and Switches using FDR cables are compatible with QDR/FDR/FDR10/EDR/HDR* (*HDR firmware has notes about specific ports)

EDR Adapters and Switches using EDR cables are compatible with EDR/HDR/FDR** (** FDR does not officially support EDR cables due to the higher wattage and may not work in some cases)
[If a HDR cable is used in EDR environment, the HDR cable will try to negotiate PAM4, which EDR does NOT understand (it only supports NRZ modulelation)].

HDR Adapters and Switches using HDR cables are compatible with HDR only. HDR – PAM4 encoding and will assume PAM4 when an HDR cable is used (EEPROM is read).

ConnextX-6 dual port adapters will support 1/10/40/25/50/100/200 FDR/EDR/HDR when the correct cables/port adapter is used, AND ONLY 1 port is used.

If both ports are enabled on the dual port card, some internal clocks will be limited to 25/50/100/200 EDR/HDR only

The basic rule I tell customers is to always use the cable that is compatible with the lowest speed of the 2 devices. The higher speed cables require higher power at the port, and also have an EEPROM letting the port know what speed to start the negotiation.



Ethernet it gets very complicated, you will need to know if the Adapter or switch is PAM4, and make sure you are using the compatible cables. Mellanox/NVIDIA equipment will also use the EEPROM in the cables to establish the correct negotiation (PAM4 or NRZ). I would also reference the firmware notes of each product to determine cross compatibilities. When using 3rd party equipment it is not a guarantee they are backwards/forwards compatible.

InfiniBand 103

opensm 	# open fabric subnet manager.  run on one of the host.  
	# additional standy manager on other hosts can be setup.  
	# the daemon process will only bind to ONE IB port.  
	# HA fabrics should be joined into a single one.
  	# Fix opensm.conf to bind to desired guid (port).   
	# not specifiying one will get prompt to choose if multiple port is avail.
	# If have two-port card and need TWO subnet manager, then need to 
	# update /etc/init.d/opensm and start TWO instance of opensmd
	# use -g GUID to specify which port each instance use.  this would overwrite config in /etc/opensm/opensm.conf. eg 
 	# /usr/sbin/opensm --daemon 
 	# /usr/sbin/opensm --daemon -g 0x506b4b03007c00c2  --logfile /var/log/opensm_c2.log

OFED has ib driver and opensm
Mellanox compiled their own version and provide support for it, MLNX_OFED, aka MOFED.  has driver and opensm.  No more support for ConnectX-2 cards in 2018.
SL7 version of OFED is said to have less trouble with LUSTER than general public OFED.


opensm --create-config /tmp/opensm.conf 
cp -i /tmp/opensm.conf /etc/opensm/
vi /etc/opensm/opensm.conf # set GUID for the port where openSM should bind to

Troubleshooting commands
ibstat
  - all ports should be active.  It would be active if there is running fabric manager to register them when the port comes alive.
  - If port is connected, but no running Subnet Manager, port status would be Initiating
  - If port status is Down, then there is problem with physical link (even when no SM is running)  Reseating after SM is running may help.
  - node guid
  - port guid

ibv_devinfo - similar info to ibstat

rds-ping
rds-info

ompi_info --all
ompi-top
ompi-ps



iblinkinfo 
iblinkinfo --node-name-map fabrics.node-name-map
	# iblinkinfo provides link info for whole fabric, very quick 
	# node-name-map converts GUID to human name
	# 0xe41d2d03004f1b40 "ib000.cf0 (Mellanox SX6025, S54-21)"

ibnetdiscover 
ibnetdiscover --node-name-map fabrics.node-name-map > ibnet.topo
	# generate a topology file 
	# switch LID would be listed above all the port, see comment field 


cat fabrics.conf | grep -v ^$ | tr '[]' '#' | awk -F\= '{print $1 " \"" $2 "\""}' > fabrics.node-name-map

MOFED (MLNX_OFED) for RH7
/etc/init.d/openibd status
/etc/init.d/opensmd status


ibcheck

ibacm...
        # ib adaptive control manager
        # not really needed by OFED, but included by yong


tool to update IB (HCA?) firmware:
flint ?
mfp ?

# select output of ibnetdiscover

switchguid=0x2c90200410d02(2c90200410d02)
Switch  24 "S-0002c90200410d02"   # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 2 lmc 0
#                                                                               ib switch LID      ^^^

[16]    "H-00188b9097fe8e41"[1](188b9097fe8e42)     # "rac1001 HCA-1" lid 4 4xDDR
[15]    "H-00188b9097fe906d"[1](188b9097fe906e)     # "rac1002 HCA-1" lid 9 4xDDR
^^^^ ib port where host is connected            comment area and host's LID ^^^
ibclearerrors		# reboot will induce errors, this is normal.
ibclearcounters		


hca_self_test.ofed  	# sanity check. ofed-scripts rpm.  /usr/bin
pdsh -g etna0 TERM=vt52 /usr/bin/hca_self_test.ofed | grep -i fail
# the script use tput and eed to force TERM when used with pdsh


perfquery		# simple to read ib counters.  pay special attention to:
			# SymbolErrorCounter
			# LinkErrorRecover
			# infiniband-diags by Mellanox, install to /sbin

ibqueryerrors -c -s XmWait


ibcheckerrors -b	# very noisy, from infiniband-diags rpm by Mellanox, /sbin



ibdiagnet		# scan fabric, takes a long time

wwibcheck by Yong Qin:
ibcheck -f fabrics.conf -C etna -E  -dd -a -b -O 20 > now.out
ibcheck -f fabrics.conf -C etna -E  -dd -a -b -O 20 > 2hrsLater.out
but can't simple diff to see errors, need to awk out the RcvPkgs columns...

per-node errors listed in bottom section.
This is the header of avail counters:

   NodeDesc Port               Guid ExcBufOverrunErrors LinkDowned LinkIntegrityErrors LinkRecovers RcvConstraintErrors    RcvData RcvErrors    RcvPkts RcvRemotePhysErrors RcvSwRelayErrors SymbolErrors VL15Dropped XmtConstraintErrors    XmtData XmtDiscards    XmtPkts




IB port on switch side

ibportstate - manage specific port on the IB switch, eg turn it on/off, etc.

ibportstate [swlid] [swPortNum] [command]	
ibportstate  2        11         disable	# turn off IB switch port a specific host is connected to
ibportstate  6        11         disable
     # can also use enable, status.  Other form allow for changing speed, etc.

ibcheck
ibcheck needs root to run many cmd (ibcheck use fabrics.conf for GUID to hostname resolution, etna below is name of a cluster)
ibcheck -f ../fabrics.conf -C etna -E  -dd -a -b -O 20 > ibcheck-E-b-dd-a.etna.out.1226

ibcheck -h		# help (also pydoc ./ibcheck)

ibcheck -E      # errors and perf counter (top-like UI, h for help)

ibcheck -d -f fabrics.conf -C lrc-fabric
	-f      # specify a fabric definition file, so name displayed is looked up from fabrics.conf
	-d      # detail level 1, just a list of devices
	-dd     # detail level 2, switch and what is connected to each ports [most adequate output]
		# -dd   scan all ports with a link attached to it, ie list switch and what is connected to it.
	-ddd    # detail level 3, add vendor id, device id, guid etc to each switch or Ca.  empty ports explicity listed as such
		# -ddd  scan all ports - empty port will be explicity listed as such
	-C      # filter by Cluster - ie display only for entries matching specified cluster
	-i      # change scan interval, in seconds.
	-b      # batch mode, avoid the top-like interface for -E
	-O      # timeout, default to 3 sec, may need it to be longer

ibcheck -f fabrics.conf -dd -C cf0 -E -i3
		# monitor Error counter for specific cluster
		# top-like interface.  use space to toggle between error-only list vs all-node list
		# -f fabrics.conf is file maintained manually.  Use Node GUID : name key-value pair
		# note that nodes are defined in their cluster (eg [lr4]) and super cluster (ie [lrc-fabric])

ibcheck -l 32,33 -ddd 					# -l filter for specific LID (comma list)
ibcheck -f fabrics.conf -g 0x7cfe900300b515e0 -ddd 	# -g GUID, can be comma list
ibcheck -f fabrics.conf -t ibnetdiscover.out	 	# -t scan for changes against a saved ibnetdiscovery file
ibcheck -f fabrics.conf -dd -M 				# -M also display how many subnet manager(s) are running and where.

ibcheck -E -b         -O 10 -ddd

Ref
  1. Yong Qin's IB Diag & Troubleshooting
  2. ulowa basic ib troubleshooting
  3. http://tinyurl.com/infiniband - my cache of assorted IB docs.



Y-split cable switch port config
The port on the HDR switch need to be configured to use Y-split cable, else one of the node get no connection.
HDR Y-split cable,
green pull tab is live usable in default (non-split) config on the switch port
blue pull tab is not usable till switch port configured to SPLIT_MODE

Ref: nvidia doc on y-split cconfig
# using switch lid worked, but need subnet manager active to assign lid-id to the switch

ibswitches 	# scan fabric for switches and display their lid
         	# lid 0 seems like default unconfigured lid that doesn't work

# q = query current setting
mlxconfig -d lid-912 q
mlxconfig -d lid-912 q SPLIT_MODE
mlxconfig -d lid-$ID q SPLIT_PORT
mlxconfig -d lid-999 q SPLIT_PORT[1..12]	# see config for ports 1 to 12, inclusive

# enable split mode, all ports will be split
# split port will go from x/y to have subports x/y/1 and x/y/2
mlxconfig -d lid-912 set SPLIT_MODE=1     # lid-912 as example , mlxconfig will prompt for confirmation
    flint -d lid-$ID swreset              # reboot switch, no password required! but only root can open dev path

# set specific port as non split
mlxconfig -d lid-912 set SPLIT_PORT[40]=0    # port 40 as non-split

# set specific port as split (they get the extra x/y/z treatment)
mlxconfig -d lid-912 set SPLIT_PORT[1..32]=1 # port 1 to 32 as split mode


other things to try:

sudo mst start
# which create /dev/mst device path

# set (switch?) to suport SPLIT_MODE:
# but my dev is for the local CX6 .... this didn't work:

# query for current settings:
sudo mlxconfig -d /dev/mst/mt4123_pciconf0 q | grep -i SPLIT       # q=query, but split_mode is on the swithc, not HBA
sudo mlxconfig -d /dev/mst/SW_MT54000_EVB-SX-1_L00_lid-0x0001 q    # doc says switch device path could exist under /dev/mst (not for me)

applying raw config file. there is command to upload a raw config file (eg from support?) to the (?switch/?HBA). anyway, it looks like this, RTFM if really need to do this.
mlxconfig -f ./tlv_file.conf -d /dev/mst/mt4115_pciconf1 set_raw

configuring split port on managed switch (one with RJ45 serial console)
MLNX=OS ...
Split port mode for Y-cable support in managed switch

serial into switch
admin/admin is def
	will prompt for pass change, as well as set pass for "monitor"
enable
config term
system profile ib split-ready
(reboot)

interface ib 1/1-1/36
shutdown
module-type qsfp-split-2
(ports 1 to 36 will be in split mode and ready for use, existing 2nd port of Y-split would be online right away

check status:

show interface ib status


to set back to non split mode
module-type qsfp

save config
? copy running-config startup-config 
? write mem

details , See https://support.lenovo.com/us/en/solutions/ht508894-how-to-split-or-unsplit-ports-on-a-qm8700-using-mlnx-os-cli

How to split or unsplit ports on a QM8700 using MLNX-OS CLI - 
This article introduces the procedure used to split the ports in two on a QM8700 switch. QM Series switches support split mode, which allows for the 200 Gb/s quad-lane ports to be divided into two 100 Gb/s dual-lane ports.


need the mallanox cable for serial connectivity.

IBoIP

Run IP network over Infiniband (instead of Ethernet).

# bring up interface after driver install, w/o reboot 
rmmod    ib_ipoib
modprobe ib_ipoib

/etc/sysconfig/network-scripts/ifcfg-ib0 ::
ConnectedMode
no  = use datagram (default?)
yes = like tcp?
Ref:
  • RH IPoIB cli
  • RH IPoIB

    IB switch Config

    
    docker login abcd 1234
    
    TO use Y-split cable, the switch port has to set to split profile. p 387 of nvidia mlnx-os user manual rev 6.6 for sw v 3.9.3124
    admin/admin
    
    config
    interface ib 1/4
    shutdown   ! shutdown the specific switch port
    
    module-type qsfp-split-2   ! set to split mode for use with Y-split cable
    module-type qsfp           ! revert, back to normal 1:1 cable
    
    
    show interfaces ib status
    
    # Note, QM8700 HDR switch has 40 ports.  
    

    IBoIP with Bonding Config

    OFED 1.4 instructions for configuring bonding for IPoIB interfaces is to create static config in /etc/sysconfig/network-scripts for the file ifcfg-bond0, ifcfg-ib0, ifcfg-ib1, as below:
    IB-bond for system WITHOUT ethernet bond
    $ cat /etc/sysconfig/network-scripts/ifcfg-bond0
    DEVICE=bond0
    BOOTPROTO=none
    IPADDR=10.2.2.98
    NETMASK=255.255.255.0
    NETWORK=10.2.2.0
    BROADCAST=10.2.2.255
    ONBOOT=yes
    USERCTL=no
    TYPE=Bonding
    
    $ cat /etc/sysconfig/network-scripts/ifcfg-ib0  
    DEVICE=ib0
    BOOTPROTO=none
    ONBOOT=yes
    USERCTL=no
    MASTER=bond0
    SLAVE=yes
    TYPE=InfiniBand
    
    $ cat /etc/sysconfig/network-scripts/ifcfg-ib1
    DEVICE=ib1
    BOOTPROTO=none
    ONBOOT=yes
    USERCTL=no
    MASTER=bond0
    SLAVE=yes
    TYPE=InfiniBand
    
    # relevant entries in  /etc/modprobe.conf ::
    ## added for Oracle 10/11 (per Cambridge WI)
    ## make sure that the hangcheck-timer kernel module is set to load when the system boots
    options hangcheck-timer hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1
    alias ib0 ib_ipoib
    alias ib1 ib_ipoib
    alias net-pf-27 ib_sdp
    ##  For IB bonding
    alias bond0 bonding
    options bond0 miimon=100 mode=1 max_bonds=1
    
    
    As of OFED 1.4.0 (circa 2009.09), the above bonding config would work, bond0 would be created correctly and disabling the ib port for say ib0 would cause thigns to fail over.

    However, fail over won't actually work if the machine also has ethernet bonding configured. The config would successfully create a bond for ib0 and ib1. But the IP would be bond to a specific interface and when the IB port is disabled from the switch, ping and rds-ping would stop working. Maybe it has to do with some bugs in the ifcfg-* scripts in RHEL 5.3 that associate the HW "mac address" of the ibX interfaces incorrectly to the bonding interface. OFED 1.4.x doesn't support the bonding config in /etc/inifiniband/openib.conf anymore. Manually creating the ib-bond after system boot would work, and fail over actually works correctly. Here is the required config:
    IB-bond for system with ethernet bond
    
    $ cat /etc/sysconfig/network-scripts/ifcfg-ib0
    DEVICE=ib0
    BOOTPROTO=none
    STARTMODE=onboot
    ONBOOT=yes
    USERCTL=no
    BROADCAST=192.168.2.255
    NETMASK=255.255.255.0
    #
    IPADDR=0.0.0.0
    #SLAVE=yes
    #MASTER=bond1
    TYPE=InfiniBand
    
    
    $ cat /etc/sysconfig/network-scripts/ifcfg-ib1
    DEVICE=ib1
    BOOTPROTO=none
    STARTMODE=onboot
    ONBOOT=yes
    USERCTL=no
    BROADCAST=192.168.2.255
    NETMASK=255.255.255.0
    #
    IPADDR=0.0.0.0
    #SLAVE=yes
    #MASTER=bond1
    TYPE=InfiniBand
    
    
    # relevant entries in  /etc/modprobe.conf ::
    options bond0 mode=balance-rr miimon=100    max_bonds=2
    
    # ethernet bonds are configured as in stock RHEL 5.3 config.
    
    
    
    #!/bin/sh
    
    # init script to start ib-bond at boot time:
    #
    # run "manual" ib-bond config
    # could not do this in /etc/sysconfig/network-scripts as bond1, ib0, ib1 scripts
    # as somehow presence of eth bond would make ib-bond fail over not to work.
    # maybe a bug in how the network-scripts are parsed...
    #
    ##  nn = startLevel  Sxx Kxx
    ##  eg start at rc3 and rc5,   start as S56, kill is omitted so no Kxx script
    ##  maybe as S28 if need to be before oracleasm
    ##
    # chkconfig: 35 28 -
    # description: ib-bond config
    #
    
    # source function library
    . /etc/rc.d/init.d/functions
    
    
    RETVAL=0
    prog="ib-bond-config"
    
    start() {
            ifconfig ib0 up 0.0.0.0/24
            ifconfig ib1 up 0.0.0.0/24
            ib-bond --bond-name bond1 --bond-ip 192.168.2.101/24 --slaves ib0,ib1 --miimon 100
    }
    
    stop() {
            ib-bond --bond-name bond1 --stop
    }
    
    status() {
            ib-bond --status-all
    }
    
    case "$1" in
            start)
                    echo -n $"Starting $prog: "
                    start
                    RETVAL=$?
                    [ "$RETVAL" = 0 ] && logger local7.info "ib-bond-config start ok" || logger local7.err "ib-bond-config start failed"
                    echo
                    ;;
            stop)
                    echo -n $"Stopping $prog: "
                    #echo 'not implemented yet'
                    stop
                    [ "$RETVAL" = 0 ] && logger local7.info "ib-bond-config stop ok" || logger local7.err "ib-bond-config stop failed"
                    echo
                    ;;
            status)
                    status
                    echo
                    ;;
            *)
                    echo $"Usage: $0 {start|stop|status}"
                    RETVAL=1
    esac
    
    
    
    exit $RETVAL
    
    ## setup aid:
    ##  sudo cp -p /nfshome/sa/config_backup/lx/conf.test-4000/ib-bond-config /etc/init.d/
    ##  sudo ln -s /etc/init.d/ib-bond-config /etc/rc.d/rc3.d/S56ib-bond-config
    ##  sudo ln -s /etc/init.d/ib-bond-config /etc/rc.d/rc5.d/S56ib-bond-config
    
    
    
    There is one final catch. eth fail over work for one NIC at a time. If both eth0 and eth1 is ifdown'd, the minute eth0 is ifup, the machine reset. Not sure why, no log message, so may not even be a kernel panic... But if both eth interfaces are down, machine is probably screwed anyway...

    RDS and InfiniBand

    RDS stands for Reliable Datagram Socket. It was modeled/designed like a UDP replacement, but adds reliability and in-sequence delivery characteristics traditionally only available from TCP. It was to be lightweight, rely on the InfiniBand hardware to do the path mapping and "virtual channel" config.

    The RDS developer mailing list has a doc that indicates RDS would not depend on IP when run on InfiniBand, so then IBoIP may not actually need to be configured (eg, how MPICH work with infiniband without any IP support). On the flip side, there are also discussion of implementing RDS over TCP! (eg: take 2) At any rate, rds-ping needs an IP address. rds-info is centered around the premise of IP address. Oracle doesn't work unless IP are assigned for IB interface.

    So, for HA (at least for Oracle), it seems that IBoIP fail over from the IB-bonding need to be configured. Oracle cluster/RAC does not have ability to use multiple ib0, ib1 interfaces as its private network, so some sort of automatic HCA fail over would be needed.

    RDMA and InfiniBand

    RDMA = Remote DMA , ie Remote Direct Memory Access.
    It would allow data transfer between two hosts with IB HCA to copy data from one host directly into the other, bypassing the many copy needed of traditional NIC. Combined with low latency of IB, it would give very high transfer speed.
    However, RDMA is not implemented by RDS as of OFED 1.4. (it is for TCP? but that would be at the IPoIB layer? Or maybe for things that use IB directly like MPICH... don't know....).

    Troubleshooting

    
    /usr/bin/mst start
    mst status
    
    mlxvpd    -d /dev/mst/mt4119_pciconf0 	# show nic model part number, serial
    mlxuptime -d /dev/mst/mt4119_pciconf0	# show dev cpu freq and uptime
    
    
    
    ibv_devinfo	# find firmware version
    
    /sbin/lspci -d 15b3:		# list only mellanox devices, get device id for use with mstflint
    
    mstflint -hh # detail help on burning firmware on mellanox device
    
    sudo /usr/bin/mstflint --device af:00.0 query 
    sudo /usr/bin/mstflint -d       af:00.0 q     full  # full query, but still no OPN info
    
    
    ## burn .bin firmware (not .zip).  -ir burn does some image reactivation prior burning, which was what the example shows.
    sudo /usr/bin/mstflint --device af:00.0 --image fw-ConnectX5-rel-16_27_2008-MCX555A-ECA_Ax-UEFI-14.20.22-FlexBoot-3.5.901.bin -ir burn 
    ## but that --ir thing wasn't accepted, so skiped, and end up doing:
    sudo /usr/bin/mstflint --device af:00.0 --image fw-ConnectX5-rel-16_28_2006-MCX555A-ECA_Ax_Bx-UEFI-14.21.17-FlexBoot-3.6.102.bin  burn 
    
    
    # can't verify again after firmware upgrade, unless do a reload of the module first.  easier to just reboot.
    sudo /usr/bin/mstflint --device af:00.0 verify
    
    
    # to check for cable problem, connectivity with switch:
    lspci | grep Mellano # to get pci device number
    mlxlink -d af:00.0 -m -e -c --show_fec
    mlxlink -d af:00.0 --speeds 10G,25G,50G,100G
    
    mlxlink -d af:00.0 -a TG # Toggle link up and down (to reset speed)
    
    # find cable type via Part Number:
    mlxlink -d af:00.0 -m  | grep Part
    	Vendor Part Number  : MCP1600-C002	# EDR QSFP28 DAC ## n0002sav3 no FDR, only SDR (CX-5 HBA, FDR switch)
    	Vendor Part Number  : MCP1600-E002E30	# EDR QSFP+  DAC ## n0004sav3 works as FDR (CX5 HBA, FDR switch)
    	Vendor Part Number  : MC2207130-002	# FDR QSFP+  DAC VPI 56 Gbps IB/40 Gbps Eth ## n0150sav3 
    
    the C0002 version is an ethernet cable, in IB it will negotiate to 10Gb in FDR config, and not work at all for EDR config.  
    the E002E30 version is IB and can negotiate down to 56Gb and run at FDR rate!
    part matter!!
    
    ref: 
    https://www.mellanox.com/related-docs/prod_cables/PB_MCP1600-Cxxx_100GbE_QSFP28_DAC.pdf     - Cxxx = Ethernet
    https://www.mellanox.com/related-docs/prod_cables/PB_MCP1600-E0xxEyy_100Gbps_QSFP28_DAC.pdf - E0xxEyy = IB
    https://store.mellanox.com/products/nvidia-mc2207130-002-passive-copper-cable-vpi-up-to-56gb-s-qsfp-2m.html - FDR VPI
    
    HDR100 vs 100GbE cables (Y-split version):
    both cables have the same Model number:  MCP7H50
    but a single character differ in the part number:
    MCP7H50-V002R26 - 100GbE
    MCP7H50-H002R26 - HDR100  (presumably works in IB mode "only")
    (The 100GbE marking was on the cable connector, use of cable negotiated to 5 Gbps when used as IB.  ibnetdiscover reported it as SDR).
    
    
    ref:
  • mstflint doc from mellanox.

    IB to Ethenert

    mlxconfigfrom mellanox doc.
    Check if Mellanox ConnectX-n card is in IB or Eth/VPI mode
    
    mst start
    mlxconfig query | grep LINK_TY
    
    # 1 = IB
    # 2 = Eth/VPI
    
    


    change IB card to Eth mode
    
    mlxconfig --dev /dev/mst/mt4119_pciconf0 set LINK_TYPE_P1=2                  # single port NIC
    mlxconfig --dev /dev/mst/mt4119_pciconf0 set LINK_TYPE_P1=2 LINK_TYPE_P2=2   # dual   port NIC
    
    
    for Mellanox ConnectX-4 
    
    may need to run 
    mst start
    mst status
    to create /dev/mst device 
    allegedly once in eth mode, no need for any of these things.
    mst in turn req OFED driver installed.
    (have not found way to run these out of docker container, even when Mellanox seems to have something for this, maybe for some product they have which i have not found doc about yet).
    
    also need a Mellanox MAM1Q00A-QSA for each port you want to be on ethernet as well as SR optics.  
    
    
    
    https://community.mellanox.com/s/article/getting-started-with-connectx-5-100gb-s-adapters-for-linux
    
    
    
    Note that the LINK_TYPE_P1 and LINK_TYPE_P2 equal 1 (InfiniBand) by default.
    
     mlxconfig -d /dev/mst/mt4119_pciconf0 query
     (or simply mlxdonfig query)
     --dev name depends on what "mst status" reports
    
    eg2 (connectX-3 MT27500):
    mlxconfig --dev /dev/mst/mt4099_pciconf0 query | grep LINK_TY
             LINK_TYPE_P1                        VPI(3)          
    		 LINK_TYPE_P2                        VPI(3)          
    
    eg3 (ConnectX-5 MT27800):
             LINK_TYPE_P1                        IB(1)           
    
    
    before eth mode on CX-5 card, ifconfig reports:
    	2: enp1s0f0:   [NO-CARRIER,BROADCAST,MULTICAST,UP] mtu 1500 qdisc mq state DOWN group default qlen 1000
    	3: enp1s0f1:   [NO-CARRIER,BROADCAST,MULTICAST,UP] mtu 1500 qdisc mq state DOWN group default qlen 1000
    	4: enp134s0f0: [BROADCAST,MULTICAST,UP,LOWER_UP] mtu 1500 qdisc mq state UP group default qlen 1000
    	5: enp134s0f1: [NO-CARRIER,BROADCAST,MULTICAST,UP] mtu 1500 qdisc mq state DOWN group default qlen 1000
    	6: docker0:    [NO-CARRIER,BROADCAST,MULTICAST,UP] mtu 1500 qdisc noqueue state DOWN group default 
    	25: ib0:       [NO-CARRIER,BROADCAST,MULTICAST,UP] mtu 4092 qdisc mq state DOWN group default qlen 256
    
    ibdev2netdev  # from mlnx-ofa_kernel-4.7-OFED.4.7.1.0.0.1.g1c4bf42.rhel7u7.x86_64.rpm
    	mlx5_0 port 1 ... ib0 (Down)
    
    
    
    Change to Ethernet mode for dual port card:
    
    eg# mlxconfig -d /dev/mst/mt4119_pciconf0 set LINK_TYPE_P1=2 LINK_TYPE_P2=2
    
    mlxconfig --dev /dev/mst/mt4119_pciconf0 set LINK_TYPE_P1=2
    
    Device #1:
    ----------
    
    Device type:    ConnectX5       
    Name:           N/A             
    Description:    N/A             
    Device:         /dev/mst/mt4119_pciconf0
    
    Configurations:                              Next Boot       New
             LINK_TYPE_P1                        IB(1)           ETH(2)          
    
    
    reboot needed for change to be effected.
    
    
    ~~~~
    https://www.servethehome.com/changing-mellanox-connectx-vpi-ports-ethernet-infiniband-linux/
    
    VPI mode automagically do Eth and IB, no need to change its mode?
    Some card can only do IB or only Eth, can't be changed?
    
    **no, likely need to still flip to eth so that os assign a ifconfig nic name to it. **
    
    
    # ip a | egrep UP\|inet
    2: enp1s0f0: [BROADCAST,MULTICAST,UP,LOWER_UP] mtu 1500 qdisc mq state UP group default qlen 1000
        inet 128.3.7.58/24 brd 128.3.7.255 scope global noprefixroute enp1s0f0
    3: enp1s0f1: [NO-CARRIER,BROADCAST,MULTICAST,UP] mtu 1500 qdisc mq state DOWN group default qlen 1000
    4: ib0: [BROADCAST,MULTICAST,UP,LOWER_UP] mtu 2044 qdisc mq state UP group default qlen 256
        inet 192.168.207.58/24 brd 192.168.207.255 scope global noprefixroute ib0
        inet6 fe80::e61a:4fc3:cba3:72c9/64 scope link noprefixroute 
    
    enp1s0f? are likely the onboard RJ45 ports.
    
    
    

    MLNX-OS

    MLNX-OS CLI change password https://docs.nvidia.com/networking/display/MLNXOSv381206/User+Management+and+Security+Commands
    show whoami
    
    show usernames		# show configured user.  default is admin monitor xmladmin xmluser  but most disabled, need to set password to enable
    			# Acc Status will indicate if password is set
    
    enable
    config term
    
    username bofh  capability admin # create user , not usable till set a password
    username bofh  password 0       # 0 = use clear text password, will prompt for it
    # or use
    username bofh  password 7     	# 7 = encrypted text (which alg?)
    
    username admin   password 0     # change password for user admin
    username monitor password 0     # def pw is monitor for this acc
    
    username admin nopassword   # erase password?
    
    
    saving config
    see: https://docs.nvidia.com/networking/display/MLNXOSv381206/Configuration+Management
    enable
    config term
    
    show configuration files	# default one is called "initial"
    
    configuration write    		# save to default? current? config file.  disable zero touch config ZTC thing on first write.  Think Cisco copy run start
    
    configuration write to NewCfFilename		# create/save to a new config file and switch to it as active config file (used at boot?)
    configuration write to NewCfFilename no-switch	# save to a new config file , but don't make it the active config (ie good for creating backup)
    
    configuration switch-to NewCfFilename		# load a new config file
    
    show running
    
    
    reset factory keep-basic			# restore to factory default
    
    
    Set management interface IP
    enable
    config terminal
    
    show interfaces mgmt0
    
    # default is dhcp, manually disable it 
    no interface mgmt0 dhcp
    
    # setting static ip would disable dhcp 
    interface    mgmt0 ip address 10.6.1.10 255.255.0.0
    
    


    [Doc URL: http://tin6150.github.io/psg/]
    [Doc URL: http://tin6150.gitlab.io/psg/]
    Last Updated: 2020-10-27
    (cc) Tin Ho. See main page for copyright info.


    nSarCoV2
    hoti1
    sn5050
    psg101 sn50 tin6150