Life in University Information Technology: 08.07

Thursday, August 30, 2007

Student Kiosks

We've recently expanded our Student Kiosks from 4 to 9 kiosks around campus. These are custom Kiosks I designed and used our University Fab Shop / Paint Shop to create for us. They use a 19" LCD and a stadard desktop computer with an add on wireless card and wifi extender antenna to get better reception. Systems are really just setup for quick internet / email / registration info for students.

The kiosks have been very popular and according to our domain stats from last year, the 4 kiosks we had in place averaged 30 unique logins each per hour during normal business hours, higher than any other student stations on campus.

Tuesday, August 28, 2007

NetApp 3050c upgrade of DataOnTap 7.0.5 to DataOnTap 7.2.3

Performing a non-disruptive upgrade of our Network Appliance FAS 3050c (clustered filer configuration)

One of the benefits of having the clustered filers (FAS3050c) is that I can, in most cases, perform a system upgrade without having to disrupt services running on either system. The process is a little complex but well worth the payoff as in our environment I literally have thousands of students connecting to the storage at a time. Below is a slightly modified version of my notes from the upgrade (use at your own risk). I followed the directions from NetApp's upgrade guide. Although, I will note that their directions were not exact. I had differing outputs from commands at times which made me a little nervous. All in all the upgrade went pretty smooth and the systems have been running solid since.

Download from now.netapp.com under Download Software – DataOnTap – FAS 3050c

new Shelf Firmware from now.netapp.com (all shelf firmware updates)
new Disk Firmware from now.netapp.com (all disk firmware updates)
newest release of Filer Firmware CFE 3.1
newest GA Release of DataOnTap 7.2.3
docs for DataOnTap 7.2.3

Copied and Made backups of files

Mounted \\filerA\c$
Mounted \\filerB\c$
Made of backup of c$\etc\ folder on both systems (minus log files)
- Copy to c$\backup\etc_8-24-2007
From shelf zip file to the etc\shelf_fw on the both filerA and filerB
From shelf zip file to the etc\shelf_fw on the both filerA and filerA
From disk zip file to the etc\disk_fw on the both filerA and filerB
From disk zip file to the etc\disk_fw on the both filerA and filerA

Shelf Firmware

Login to the appliance console.
Check current shelf firmware version ( > sysconfig -v )
Enter Advanced privileges ( > priv set advanced )
Start the update ( > storage download shelf )
- This will upgrade the shelf firmware on all the disk shelves in the system. (If you wish to only update the disk shelves attached to a specific adapter, enter storage download shelf adapter_number instead).
Accept the update, Press y for yes and hit enter.
To verify the new shelf firmware, ( > sysconfig -v )
Exit Advanced privlieges ( > priv set admin )

Disk Firmware

Disk firmware is automatically updated on reboot if there are updated files in the disk_fw folder. To keep the system from updating too many disks at once set or verify the following option.

( > options raid.background_disk_fw_update.enable)
- if it is set to off, I recommend you change it to on

DataOnTap Update

Downloaded the newest General Deployment Release, in this case it was Data ONTAP 7.2.3.
Verified our system met all requirements for running the downloaded release, updates were required for Disk firmware and shelf firmware (which was done above)
Checked known problems and limitations of the new release to see if any would affect our environment. No potential problems found.
Compared bug fixes from current version of OnTap 7.0.5 to new version of 7.2.3. There were many bug fixes that could potentially effect our environment which makes the upgrade needed.
Downloaded newest documentation for 7.2.3

Update Procedure

With C$ mapped on both filers I ran the downloaded OS install (self extracting zip files) to the respective \etc directories. This is the first step and copies all the needed files over to the filers. Once completed, we preforme the procedure below from the NOW upgrade guide for Windows Clients.

start the install on both systems ( > download )
Checked the cluster status ( > cf status ) to make sure cluster failover was enabled
Had filerB takeover services for filerA ( > cf takeover )
- This causes filerA to reboot
During reboot of filerA hit ( ctrl-c ) to enter into maintenance mode
From maintenance mode type ( > halt ) to do a full reboot
Hit ( del ) during memory test to get to the CFE prompt
start the firmware update of the filer from the CFE> prompt using ( CFE> update_flash )
Now reboot, type ( bye ) at console after update was finished to reboot filerA
filerA is now in a …waiting for giveback state
Now to give services back to filerA we have to force it using ( > cf giveback –f ) from filerB
- This is required since we are now on different version of DataOnTap between systems in the cluster.
Giveback successful, checked firmeware and os version on filerA using ( > sysconfig –v )
After checking services on both systems it's time to upgrade filerB
Have filerA take over the services of filerB ( > cf takeover –n )
Type ( > halt ) from filerB to reboot it
During reboot of filerB hit ( ctrl-c ) to enter into maintenance mode
From maintenance mode type ( > halt ) to do a full reboot
Hit ( del ) during memory test to get to the CFE prompt
start the firmware update of the filer from the CFE> prompt using ( CFE> update_flash )
Typed ( bye ) at console after update was finished to reboot filerB
filerB is now in a …waiting for giveback state
Now to give services back to filerB we have to force it using ( > cf giveback –f ) from filerA
- This is required since we are now on different version of DataOnTap between systems in the cluster.
Giveback successful, checked firmeware and os version on filerB using ( > sysconfig –v )

Both systems should now show the updated firmware and OnTap version 7.2.3

You should also notice that any out of date disk firmware is automatically updated. In my case I went from NA07 to NA08 on many of the disks.

My final steps were to test system connections

We use the following NetApp services: CIFS, FTP, HTTP, FCP via VMWARE. All worked fine. I Also checked our student websites and our web based FTP software that connects to the filer.
Checked Domain connection using cifs testdc ( filerA> cifs testdc )
- appeared fine

Friday, August 24, 2007

Filer Panic - NetApp FAS 3050c cluster mode

Filer Panic - NetApp FAS 3050c cluster mode

I recently encountered a panic of one of my filers. I run (2) FAS 3050c NetApp filers in a clustered configuration. Here's what happened. One of the guys in the server room (to remain nameless) was messing around behind the server rack and somehow broke one of my fiber cables connecting filerB to it's disk shelves (the shelves I run as my san that VMWare ESX is connected to). No biggie, filerA quickly noticed the outage and picked up services for filerB. I got notice from autosupport about the outage and was quickly in the server room to check it out. After going through filerB back and forth and checking connections I attempted to run a ( cf giveback ) from filerA to give filerB it's services back. filerB didn't like it and quickly threw the services back to filerA. This got me worried.

So i called NetApp support, who by the way already had a case opened up because of the failover that had happened. After a couple hours on with support we decided it was either a faulty ESH2 module, a faulty cable, or something else (like a bad disk not reporting itself as bad in the loop). So NetApp sent out some parts, 3 new drives, a new cable and 2 ESH2 modules. (Now I'm glad I have the hardware warrenty).

I get the parts the next day and get everything replaced. Again, everything looks to be normal but when checking the disk / shelf status with a ( > fcadmin device_map ) from within maintenance mode we noticed the filer was not recognizing the shelf's in it's loop but services were still running fine from filerA. (see below) At this point we decided we need to wait until I can take the entire system (both head units in the cluster) down so we can do an invasive test to see what piece of hardware is having the problem. Also, running an ( > aggr status -r ) showed me the filer thought all the disks in the shelf had failed.

> *>  fcadmin device_map
Loop Map for channel 0a:
Translated Map: Port Count  24
    7  29  28  27  25  26  23  22  21  20  16  19  18  17   24  39  38  37  36  32  35  34  33  40
Shelf mapping:

    Shelf Unknown: 16 17 18 19 20 21 22  23 24 25 26 27 28 29 32 33 34 35 36 37 38 39 40

Loop Map for channel 0b:
Translated Map: Port Count 17
7 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Shelf mapping:
Shelf Unknown: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Chuck at NetApp support recommends I first try just powering off the systems, then powering off the shelves and letting them sit for a couple minutes. (I first had to shut down all my VM's running the from the SAN and shutdown all services running on both systems... with thousands of users this was a pain). He then has me power on the shelves, let them fully come up and then power on filerB by itself into maintenance mode. We again run the ( > fcadmin device_map ) (see below) and now the filer is seeing it's shelves. Apparently there is a bug with the shelf firmware version I am on (just one version back) that causes certain panics to stay in memory. Hence, our problem.

> *> fcadmin  device_map
Loop Map for channel 0a:
Loop Map for channel  0a:
    7  29  28  27  25  26  23  22  21  20  16  19  18  17   24  39  38  37  36  32   35  34  33  40
Shelf mapping:
    Shelf 1:  29  28  27  26   25  24  23  22  21  20  19  18  17  16

    Shelf 2: XXX XXX  XXX XXX XXX  40  39  38  37  36  35  34  33  32

Loop Map for channel  0b:
Translated Map: Port Count 17
    7  14  15  16  17  18   19  20  21  22  23  24  25  26  27  28 29
Shelf  mapping:
    Shelf 1:  29  28  27  26  25  24  23  22  21  20   19  18  17  16

Target SES devices on this loop:
Shelf 1: 14 15

I now shut filerB back down, bring up filerA (which is still running service for both systems) and then bring filerB up. filerB comes up in a ...waiting for giveback state. I issue a ( > cf giveback ) from filerA and filerB takes back it's services. Were back up and running.

NetApp recommended I upgrade the firmware as soon as possible and noted that the power cycle of the shelf is a temporary fix to the memory issue.