Redstorm:Guide
From OSR
Contents |
Building Cray SVN Tree
Checkout
All of this happens on rsc-c3. The checkout is done over a ssh tunnel to a squid server on gold.us.cray.com, which then contacts svn.us.cray.com. Believe it or not, this is faster than doing a local checkout at cray and then copying the entire tree to Sandia.
- Update your ~/.subversions/servers:
[groups]
cray = *.cray.com
### Proxy to hosts inside of cray
[cray]
http-proxy-host = rsc-c3
http-proxy-port = 3128
- Do the checkout. Expect it to take anywhere from one to six hours.
You will need to use your Craypark username and password to authenticate. After the first time Subversion will store the details in its cache and not require a password.
svn checkout --username n12220 https://svn.us.cray.com/svn/xt/externals/full-nightly
If you want to checkout a specific release, you can do it like this:
svn co https://svn.us.cray.com/svn/xt/externals/nightly-build/tags/REL_1_4_22
Patches
- Patch linux/initrd/roots/linuxrc-unified:
Index: roots/linuxrc-unified
===================================================================
--- roots/linuxrc-unified (revision 1936)
+++ roots/linuxrc-unified (working copy)
@@ -199,8 +199,11 @@
make_devs()
{
- mknod /dev/zero c 1 3
- mknod /dev/null c 1 5
+ mknod /dev/zero c 1 5
+ mknod /dev/null c 1 3
+ mknod /dev/random c 1 8
+ mknod /dev/tty c 5 0
+ mknod /dev/ptmx c 5 2
mknod -m 666 /dev/ukbridge0 c 62 0
mknod -m 666 /dev/ukbridge1 c 62 1
mknod -m 600 /dev/console c 5 1
@@ -437,6 +439,9 @@
echo "Initrd: Mounting NID specific /etc..."
if [ -n "${NID}" -a -d /.shared/node/${NID}/etc ];then
mount -n --bind /.shared/node/${NID}/etc /etc
+ elif [ "$harness" = 1 ]; then
+ echo "WARNING: did not find NID specific /etc. Using default."
+ mount -n --bind /.shared/default/etc /etc
else
echo "ERROR: did not find NID specific /etc for NID=${NID}."
echo "Create a specialized view for this node. See xtspec(8) and xtcloneshared(8)."
@@ -453,6 +458,7 @@
make_devs
mkfifo -m 600 /dev/initctl
mkfifo -m 640 /dev/blog
+mount -n -t tmpfs tmpfs /tmp
mkdir /dev/pts
ln -sf /proc/self/fd /dev/fd
echo "Initrd: done."
Build / Post install
- Do a full `make`. This step might also take a few hours depending on what is happening on rsc-c3. It has the slowest disks of any Opteron machine out there.
- Copy old harness binary into `install/bin/snos32/harness`. You can copy `rsc-c3:~tbhudso/src/harness`, although it is possible that there may be a way to build it from the source tree.
- Update `install/linux/bootable/parameters` for our system. For mine, I add `harness=1` and change `bootpath=/rr/current_2.6`, `bootprot=dhcp` and `bootnodeip=192.168.1.1`. There may be a way to modify `linux/initrd/parameters.in` to do this automatically, but I haven't figured it out yet. You will need to redo this after every build of the `linux/initrd` tree. Here is my full file:
earlyprintk=rcal0
load_ramdisk=1
ramdisk_size=80000
acpi=off
console=ttyL0
bootnodeip=192.168.1.1
bootproto=dhcp
bootpath=/rr/current_2.6
rootfs=nfs-shared
pci=lastbus=3
idle=poll
harness=1
Incremental builds
- portals.ko:
make -C linux/portals
- rcad.ko: (can't use -C since it calls pwd)
make -C ~/src/full-nightly/linux/rca
The harder way:
cd RSMS/RCA/rcad
make KERNDIR=~/src/full-nightly/linux/kernel.2.6-ss-lustre26 modules
- Firmware:
make -C hardware/firmware_c clean all install
- Linux kernel (not sure about this one)
make -C linux/kernel.2.6-ss-lustre26
- Catamount QK for Seastar:
make -C catamount SEASTAR=1
Booting Harness Nodes
These are also done on rscsmw2 and should be done for every reboot. You should try to minimize the number of full power cycles that you do on the cage. Some of the nodes don't seem to come up reliabily, so we don't power cycle unless things really get wedged. These steps should be sufficient to get the nodes back into a sane state most of the time. Sometimes it doesn't work, in which case you will need to do the full reboot as described above.
Whole cage boots
- Init the entire cage:
rs_init cx24y2c0
- Boot all the qk nodes. The new `coldstart` is very verbose and will generate reams of noise that you probably don't care about.
rs_boot --qk cx24y2c0
- Boot the linux node. Unlike the `--qk` boot, this will give you the console for the Linux node as it boots. Once you see the final startup messages (`Starting SSHD`) it is safe to hit ^C to kill the boot process. If you don't kill it, eventually it will die on its own, but there will be an orphan `bootlinux` process out there that will continue to write the console to this terminal.
rs_boot --l n0.cx24y2
- Login to the service node:
ssh root@rsclogin108
Single node boots
If you crash a node it is possible to reboot that one module and have it rejoin th mesh. Each module has four nodes on it, so re-initing the node will take the other three down as well. If you are just doing one node tests then it is not necessary to bring them all back up.
rs_init n4.cx24y2
rs_boot --qk n{4,5,6,7}.cx24y2
Apendix
Using JTAG interface
- rsh to the L0 for the node (slot = int(node/4), module=int(node%4))
rsh -l root cx24y2c0s1
- Use hdt to dump the registers from module 0
[Compute(24,2,0.1)]# hdt -n 0 -r 0:10 00 0x0000000000000011 # rax 01 0x0000000011000312 # rcx 02 0x0000000000000cfc # rdx 03 0x000000000000001c # rbx 04 0x0000000000000000 # rsp 05 0x0000000000000000 # rbp 06 0x0000000000ff8003 # rsi 07 0x0000000000000000 # rdi 08 0x0000000000000007 # eflags 09 0x000000000000f3e6 # rip
- Use hdt to dump physical memory
hdt -n 0 -d 0x100000:1024
Rebooting rsc-c3
rsc-c3 is on sarpc02. Login as crayadm (password on request):
ssh -l crayadm sarpc02
At the prompt hit "5" for the RPC menu:
DS-Series - F 2.05.03 (C) 2003 Bay Technical Associates Module Name: DS62
Module: 1 Attention Character: ; Device A (2 ,1).........1 Device B (2 ,2).........2 Device C (2 ,3).........3 Device D (2 ,4).........4 DS-RPC (5 ,1).........5 Logout..........................T Enter Request :5
It will print some details about all the ports. Tell it to reboot port 2 (rsc-c3) (1==rsc-c2, 2==rsc-c3, 3==rscsmw2):
DS-RPC>reboot 2 Reboot Outlet 2 (Y/N)? y
Hit ; six times to return to the main screen:
;;;;;;
Hit 2 to watch the console or T to logout.