Warning: Parameter 1 to Language::getMagic() expected to be a reference, value given in /opt/local/apache2/htdocs/wiki/includes/StubObject.php on line 58
Redstorm:Guide - OSR

Redstorm:Guide

From OSR

Jump to: navigation, search

Contents

Building Cray SVN Tree

Checkout

All of this happens on rsc-c3. The checkout is done over a ssh tunnel to a squid server on gold.us.cray.com, which then contacts svn.us.cray.com. Believe it or not, this is faster than doing a local checkout at cray and then copying the entire tree to Sandia.

  • Update your ~/.subversions/servers:
               [groups]
               cray = *.cray.com
               ### Proxy to hosts inside of cray
               [cray]
               http-proxy-host = rsc-c3
               http-proxy-port = 3128
  • Do the checkout. Expect it to take anywhere from one to six hours.

You will need to use your Craypark username and password to authenticate. After the first time Subversion will store the details in its cache and not require a password.

               svn checkout --username n12220 https://svn.us.cray.com/svn/xt/externals/full-nightly

If you want to checkout a specific release, you can do it like this:

               svn co https://svn.us.cray.com/svn/xt/externals/nightly-build/tags/REL_1_4_22


Patches

  • Patch linux/initrd/roots/linuxrc-unified:
               Index: roots/linuxrc-unified
               ===================================================================
               --- roots/linuxrc-unified       (revision 1936)
               +++ roots/linuxrc-unified       (working copy)
               @@ -199,8 +199,11 @@
                
                make_devs()
                {
               -    mknod /dev/zero c 1 3
               -    mknod /dev/null c 1 5
               +    mknod /dev/zero c 1 5
               +    mknod /dev/null c 1 3
               +    mknod /dev/random c 1 8
               +    mknod /dev/tty c 5 0
               +    mknod /dev/ptmx c 5 2
                    mknod -m 666 /dev/ukbridge0 c 62 0
                    mknod -m 666 /dev/ukbridge1 c 62 1
                    mknod -m 600 /dev/console c 5 1
               @@ -437,6 +439,9 @@
                   echo "Initrd: Mounting NID specific /etc..."
                   if [ -n "${NID}" -a  -d /.shared/node/${NID}/etc ];then
                     mount -n --bind /.shared/node/${NID}/etc /etc
               +   elif [ "$harness" = 1 ]; then
               +     echo "WARNING: did not find NID specific /etc. Using default."
               +     mount -n --bind /.shared/default/etc /etc
                   else
                     echo "ERROR: did not find NID specific /etc for NID=${NID}."
                     echo "Create a specialized view for this node. See xtspec(8) and xtcloneshared(8)."
               @@ -453,6 +458,7 @@
                make_devs
                mkfifo -m 600 /dev/initctl
                mkfifo -m 640 /dev/blog
               +mount -n -t tmpfs tmpfs /tmp
                mkdir /dev/pts
                ln -sf /proc/self/fd /dev/fd
                echo "Initrd: done."


Build / Post install

  • Do a full `make`. This step might also take a few hours depending on what is happening on rsc-c3. It has the slowest disks of any Opteron machine out there.
  • Copy old harness binary into `install/bin/snos32/harness`. You can copy `rsc-c3:~tbhudso/src/harness`, although it is possible that there may be a way to build it from the source tree.
  • Update `install/linux/bootable/parameters` for our system. For mine, I add `harness=1` and change `bootpath=/rr/current_2.6`, `bootprot=dhcp` and `bootnodeip=192.168.1.1`. There may be a way to modify `linux/initrd/parameters.in` to do this automatically, but I haven't figured it out yet. You will need to redo this after every build of the `linux/initrd` tree. Here is my full file:
               earlyprintk=rcal0
               load_ramdisk=1
               ramdisk_size=80000
               acpi=off
               console=ttyL0
               bootnodeip=192.168.1.1
               bootproto=dhcp
               bootpath=/rr/current_2.6
               rootfs=nfs-shared
               pci=lastbus=3
               idle=poll
               harness=1


Incremental builds

  • portals.ko:
       make -C linux/portals
  • rcad.ko: (can't use -C since it calls pwd)
       make -C ~/src/full-nightly/linux/rca

The harder way:

       cd RSMS/RCA/rcad
       make KERNDIR=~/src/full-nightly/linux/kernel.2.6-ss-lustre26 modules
  • Firmware:
       make -C hardware/firmware_c clean all install
  • Linux kernel (not sure about this one)
       make -C linux/kernel.2.6-ss-lustre26
  • Catamount QK for Seastar:
        make -C catamount SEASTAR=1

Redstorm:Harness setup

Booting Harness Nodes

These are also done on rscsmw2 and should be done for every reboot. You should try to minimize the number of full power cycles that you do on the cage. Some of the nodes don't seem to come up reliabily, so we don't power cycle unless things really get wedged. These steps should be sufficient to get the nodes back into a sane state most of the time. Sometimes it doesn't work, in which case you will need to do the full reboot as described above.

Whole cage boots

  • Init the entire cage:
               rs_init cx24y2c0
  • Boot all the qk nodes. The new `coldstart` is very verbose and will generate reams of noise that you probably don't care about.
               rs_boot --qk cx24y2c0
  • Boot the linux node. Unlike the `--qk` boot, this will give you the console for the Linux node as it boots. Once you see the final startup messages (`Starting SSHD`) it is safe to hit ^C to kill the boot process. If you don't kill it, eventually it will die on its own, but there will be an orphan `bootlinux` process out there that will continue to write the console to this terminal.
               rs_boot --l n0.cx24y2
  • Login to the service node:
               ssh root@rsclogin108

Single node boots

If you crash a node it is possible to reboot that one module and have it rejoin th mesh. Each module has four nodes on it, so re-initing the node will take the other three down as well. If you are just doing one node tests then it is not necessary to bring them all back up.

               rs_init n4.cx24y2
               rs_boot --qk n{4,5,6,7}.cx24y2

Apendix

Using JTAG interface

  • rsh to the L0 for the node (slot = int(node/4), module=int(node%4))
   rsh -l root cx24y2c0s1
  • Use hdt to dump the registers from module 0
   [Compute(24,2,0.1)]# hdt -n 0 -r 0:10
   00 0x0000000000000011 # rax
   01 0x0000000011000312 # rcx
   02 0x0000000000000cfc # rdx
   03 0x000000000000001c # rbx
   04 0x0000000000000000 # rsp
   05 0x0000000000000000 # rbp
   06 0x0000000000ff8003 # rsi
   07 0x0000000000000000 # rdi
   08 0x0000000000000007 # eflags
   09 0x000000000000f3e6 # rip
  • Use hdt to dump physical memory
   hdt -n 0 -d 0x100000:1024

Rebooting rsc-c3

rsc-c3 is on sarpc02. Login as crayadm (password on request):

  ssh -l crayadm sarpc02

At the prompt hit "5" for the RPC menu:

   DS-Series - F  2.05.03     (C) 2003 Bay Technical Associates
   Module Name: DS62


   Module: 1  
   Attention Character:  ;  
   Device A         (2 ,1).........1  
   Device B         (2 ,2).........2  
   Device C         (2 ,3).........3  
   Device D         (2 ,4).........4  
   DS-RPC           (5 ,1).........5  
   Logout..........................T

 Enter Request :5

It will print some details about all the ports. Tell it to reboot port 2 (rsc-c3) (1==rsc-c2, 2==rsc-c3, 3==rscsmw2):

  DS-RPC>reboot 2
  Reboot Outlet  2        (Y/N)? y

Hit ; six times to return to the main screen:

  ;;;;;;

Hit 2 to watch the console or T to logout.