warewulf

baremetal provision tool. common in hpc environment pxe boot machine, most often run stateless nodes.


Ref

Good intro/overview
https://warewulf.readthedocs.io/en/latest/about/architecture.html

google group eg:
https://groups.google.com/a/lbl.gov/g/warewulf/c/Yx9OMqkyJKU

Sys Admin magazine articles:
Wareulf Cluster Manager
Wareulf Cluster Manager - Part 2

warewulf boot process

  1. hardware boot, BIOS to invoke network PXE boot or UEFI process
  2. mac broadcast, DHCPD on warewulf server responds with IP. For this piece to work, the mac address is defined in wwsh node set
  3. PXE fetch linux image via TFTP, using mac address as identifier
  4. The fetched linux image is provided by warewulf. The OS does a "swap" and pivot to this new image.
  5. It scan hardware and look for network devices. NIC driver has to be baked into the wareulf image fetched via TFTP, else it can't find driver and things won't work. If there is no workable nic driver at all, the warewulf boot process will reset.
  6. If there are nic driver (and it maybe for a secondary interface that one isn't expecting, wwulf assign that nic eth0 and contact the warewulf server via https, asking for an image with the mac address of this NIC.
  7. Here in this process, the IP and NIC config would be the one in the wwsh provision ifcfg-eth0. tagged vlan probably not supported here.
  8. if there is a matching mac address image, it is served over https, and a slew of packages is then also downloaded via https
  9. Linux boots and eventually OS is ready, ssh login, maybe a login prompt on the tty :)

warewulf backup

warewulf has a mysql db, running daily dbdump is recommended.
This stores the node and provision config.
MySQL/MariaDB config is stored in /etc/warewulf/database*conf


mysqldump --add-drop-database --opt $DB 
Where $DB is one of 
mysql 
warewulf
/var/chroot contents for the vnfs is NOT in the db, and separate store for that is necessary
(eg, vnfs build server, pullvnfs script presumed can run again to repopulate this).
(alt, tar up the existing chroot, xfer the file and untar)
MySQL DB Space usage warning
VNFS (pullvnfs.sh) DO get stored in MySQL, and to reduce overbloating DB footprint, avoid having version in names, ie keep a test image, once tested, copy it back to main vnfs name. cuz MySQL do not erase/free up space. new VNFS name would add to the DB footprint. replacing with existing name would do overwrite so keep same DB size.
db restore
Restore command (run as root):

bunzip2 -c mysql_backup_mysql_MMDD.sql.bz2    | mysql mysql
bunzip2 -c mysql_backup_warewulf_MMDD.sql.bz2 | mysql warewulf


A few commands to check the restored databases:

mysql -e "describe host" mysql
mysql -e "describe datastore" warewulf
mysql -e "SELECT COUNT(*) FROM user" mysql
mysql -e "SELECT COUNT(*) FROM datastore" warewulf

if the wwroot and wwuser somehow disapears from the warewulf database, wwinit DATABASE could fix this. the command is somewhat of a misnomer. It will NOT wipe the database to a clean state, so all previuosly defined objects will still remain intact. It just really recreate the users and credenntials so that wwsh commands can interact with the MySQL/MariaDB.

network troubleshooting


If nodes are not PXE booting:

Jul  9 10:40:50 wwulf dhcpd[3849]: DHCPDISCOVER from ff:7b:25:fa:4f:7e via enp0s3: network 10.15.4.0/24: no free leases
Jul  9 10:40:54 wwulf dhcpd[3849]: DHCPDISCOVER from ff:7b:25:fa:4f:7e via enp0s3: network 10.15.4.0/24: no free leases

then try
wwsh -v dhcp update
wwsh dhcp restart
wwsh pxe  update

write to /etc/dhcp/dhpcd.conf
if config has:

   # Evaluating Warewulf node: c0000 (DB ID:887)
   # Skipping c0000: Not on boot network (10.15.14.0)

then:
check /etc/warewulf/provision.conf
the NIC device defined there need to have a network that is within the IP range defined by the wwulf nodes.

Also see:
https://groups.google.com/a/lbl.gov/g/warewulf/c/ERzlDkDw2tY/m/rg-fvAoOacQJ

wwsh bootstrap rebuild
	# recreate files needed by pxe boot, stored in /srv/warewulf/bootstrap


Manual bootstrap build cmd? 
wwbootstrap --config=/etc/warewulf/bootstrap-sl7.conf --chroot=/var/chroots/sl7-nvidia --name 3.10.0-1127.13.1.el7.x86_64-nvnew 3.10.0-1127.13.1.el7.x86_64

Notes/Ref
  • VirtualBox iPXE ROM does not support bzImage, but required by some OS like CoreOS (and CentOS?) cuz initramfs is a cpio.bz image. Thus generating ipxe err 23008001 https://github.com/coreos/tectonic-installer/issues/932
  • Have VBox use EFI instead of bios mode, see https://www.makeuseof.com/set-up-efi-linux-virtual-machine-virtualbox/
  • virt manager would be closer to hardware, but a time sink to setup
    proxmox vm should also work fine.

    File object

    
    
    wwsh file list | grep apptainer
    apptainer.conf          :  rwxr-xr-x 1   root root            13094 /etc/apptainer/apptainer.conf
    apptainer.conf          :  rw-rw-r-- 1   root root            13094 /etc/apptainer/apptainer.conf
    
    # duplicate filename is allowed.  
    # would need object id to manipulate them
    
    wwsh file print apptainer.conf
    #### apptainer.conf ###########################################################
    apptainer.conf  : ID               = 839
    apptainer.conf  : NAME             = apptainer.conf
    apptainer.conf  : PATH             = /etc/apptainer/apptainer.conf
    apptainer.conf  : ORIGIN           = /etc/warewulf/files/apptainer.conf
    apptainer.conf  : FORMAT           = UNDEF
    apptainer.conf  : CHECKSUM         = 4d5e790be2ab6bebfa7cb75f19d902ea
    apptainer.conf  : INTERPRETER      = UNDEF
    apptainer.conf  : SIZE             = 13094
    apptainer.conf  : MODE             = 0755
    apptainer.conf  : UID              = 0
    apptainer.conf  : GID              = 0
    #### apptainer.conf ###########################################################
    apptainer.conf  : ID               = 1599
    apptainer.conf  : NAME             = apptainer.conf
    apptainer.conf  : PATH             = /etc/apptainer/apptainer.conf
    apptainer.conf  : ORIGIN           = /etc/warewulf/files/apptainer.conf
    apptainer.conf  : FORMAT           = UNDEF
    apptainer.conf  : CHECKSUM         = 4d5e790be2ab6bebfa7cb75f19d902ea
    apptainer.conf  : INTERPRETER      = UNDEF
    apptainer.conf  : SIZE             = 13094
    apptainer.conf  : MODE             = 0664
    apptainer.conf  : UID              = 0
    apptainer.conf  : GID              = 0
    
    
    
    
    wwsh file print  --lookup=id 839
    wwsh file delete --lookup=id 839
    
    

    Files on boot server

    ensure these files are present. UEFI may need additional config.
    
    /var/lib/tftpboot/warewulf/
    /var/lib/tftpboot/warewulf/ipxe
    /var/lib/tftpboot/warewulf/ipxe/bin-i386-pcbios
    /var/lib/tftpboot/warewulf/ipxe/bin-i386-pcbios/undionly.kpxe
    /var/lib/tftpboot/warewulf/ipxe/bin-i386-pcbios/undionly.kpxe-old
    /var/lib/tftpboot/warewulf/ipxe/bin-x86_64-efi
    /var/lib/tftpboot/warewulf/ipxe/bin-x86_64-efi/snp.efi
    /var/lib/tftpboot/warewulf/ipxe/bin-i386-efi
    /var/lib/tftpboot/warewulf/ipxe/bin-i386-efi/snp.efi
    
    
    /srv/warewulf/initramfs/x86_64/capabilities/provision-adhoc
    /srv/warewulf/initramfs/x86_64/capabilities/provision-files
    /srv/warewulf/initramfs/x86_64/capabilities/provision-selinux
    /srv/warewulf/initramfs/x86_64/capabilities/provision-vnfs
    /srv/warewulf/initramfs/x86_64/capabilities/setup-filesystems
    /srv/warewulf/initramfs/x86_64/capabilities/transport-http
    /srv/warewulf/initramfs/x86_64/capabilities/setup-ipmi
    
    # wwsh bootstrap rebuild # should generate a series of files like:
    /srv/warewulf/bootstrap/x86_64/6/kernel
    /srv/warewulf/bootstrap/x86_64/6/cookie
    /srv/warewulf/bootstrap/x86_64/6/initfs.gz
    
    /srv/warewulf/ipxe/cfg/ac:1f:6b:a5:9c:f6
    
    

    /var/chroot

    eg add link in form of /var/lib/docker -> /local/docker
    cd /var/chroot/VNFSNAME/var/lib
    ln -s /local/docker docker
    
    then need to update ?
    wwvnfs --chroot /var/chroots/NVFSNAME
    ?
    there is this --hybridpath=/vnfs thing?
    
    ref: 
    https://www.admin-magazine.com/HPC/Articles/warewulf_cluster_manager_completing_the_environment
    https://warewulf3.readthedocs.io/en/latest/subprojects_components_plugins/vnfs/
    
    pullvnfs does this, without the hybrid option
    wwvnfs -y --chroot=$CHROOT_BASE/$DIST-$BRANCH $DIST-$BRANCH
    
    

    special object config

    custom object
    
    
    wwsh object modify -s LASTOCTET=5   s00
    wwsh object modify -s LASTOCTET=100 n00
    wwsh object modify -s LASTOCTET=101 n01
    
    wwsh object print -p :all | egrep 'node|name|LASTOCTET'
    
    wwsh --debug file sync # debug mode pring lot of perl code state. eg grep ERROR or WARNING
    

    config troubleshooting

    wwsh file sync barf like:
    WARNING in Warewulf::DataStore::SQL::BaseClass->persist()/861:  Unable to execute set lookup query: Data too long for column 'value' at row 1
    
    or
    
    wwsh file import  /etc/warewulf/files/node_exporter.service --mode=0644 --path=/etc/systemd/system/multi-user.target.wants/node_Exporter.service
    wwsh file set node_exporter.service --path=/etc/systemd/system/multi-user.target.wants/node_Exporter.service
    
    WARNING:  Unable to execute set lookup query: Data too long for column 'value' at row 1
    
    it is cuz Destination PATH is limited to at most 65 chars!  use a shorter PATH :-\
    
    

    tftp troubleshooting

    tftp (and dhcpd) needs to be running on the warewulf server. see general_unix.html#tftp for troubleshooting info.

    PXE vs warewulf boot

    The first part of the boot is network pxe boot, where the bootstrap image is transferred and the machine booted from. Stage 2 is a swap to warewulf that utilize VNFS, where the OS is hybridized between ram and /var/chroots NFS mount.

    kargs

    
    --kargs=\'$KARGS\'
    with KARGS='"console=tty0 console=ttyS1,115200n8"'
    wwsh provision print for kargs sectino need to have "" in them if there is space in it.
    really old wwulf can't handle space even with the quotes.
    
    "console=tty0 console=ttyS1,115200n8"  # this maybe bad, VGA will not have rest of the boot message, and ipmi sol activate should obliviate the need for this.
    
    acpi_irq_nobalance
    
    "net.ifnames=0 biosdevname=0"
    net.ifnames=0  # use predictable name (eth0, eth1?)
    biosdevname=0  # set to 0 use traditional name.  "Dell method"
    # use old eth0 naming convention ? 
    # still in use by w0000 as of 2023.02.13
    
    
    iommu=pt
    # this is needed for Mellanox CX-5, CX-6 Ethernet to work with AMD processor due to memory management
    # https://nvcrm.lightning.force.com/lightning/r/Knowledge__kav/ka08Z000000Tm5GQAS/view?ws=%2Flightning%2Fr%2FCase%2F5008Z00002Fn4VSQAZ%2Fview
    
    

    TBD int ref

    
    

    httpd troubleshooting

    Once PXE boot complete via tftp, the image invoke HTTP GET to fetch files from the warewulf servers. these files are expected to be stored in: /srv/warewulf /var/log/httpd/ error_log could tell of potential problem, eg fetching file for mac address that isn't expected



    Typewriter monospaced fonts in here. This is another line.




    [Doc URL: http://tin6150.github.io/psg/warewulf.html ]
    Last Updated: 2021-07-20
    (cc) Tin Ho. See main page for copyright info.


    hoti1
    sn5050
    psg101 sn50 tin6150