Friday, August 24, 2007

Filer Panic - NetApp FAS 3050c cluster mode

Filer Panic - NetApp FAS 3050c cluster mode

I recently encountered a panic of one of my filers. I run (2) FAS 3050c NetApp filers in a clustered configuration. Here's what happened. One of the guys in the server room (to remain nameless) was messing around behind the server rack and somehow broke one of my fiber cables connecting filerB to it's disk shelves (the shelves I run as my san that VMWare ESX is connected to). No biggie, filerA quickly noticed the outage and picked up services for filerB. I got notice from autosupport about the outage and was quickly in the server room to check it out. After going through filerB back and forth and checking connections I attempted to run a ( cf giveback ) from filerA to give filerB it's services back. filerB didn't like it and quickly threw the services back to filerA. This got me worried.

So i called NetApp support, who by the way already had a case opened up because of the failover that had happened. After a couple hours on with support we decided it was either a faulty ESH2 module, a faulty cable, or something else (like a bad disk not reporting itself as bad in the loop). So NetApp sent out some parts, 3 new drives, a new cable and 2 ESH2 modules. (Now I'm glad I have the hardware warrenty).

I get the parts the next day and get everything replaced. Again, everything looks to be normal but when checking the disk / shelf status with a ( > fcadmin device_map ) from within maintenance mode we noticed the filer was not recognizing the shelf's in it's loop but services were still running fine from filerA. (see below) At this point we decided we need to wait until I can take the entire system (both head units in the cluster) down so we can do an invasive test to see what piece of hardware is having the problem. Also, running an ( > aggr status -r ) showed me the filer thought all the disks in the shelf had failed.

> *> fcadmin device_map
Loop Map for channel 0a:
Translated Map: Port Count 24
7 29 28 27 25 26 23 22 21 20 16 19 18 17 24 39 38 37 36 32 35 34 33 40
Shelf mapping:
Shelf Unknown: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 32 33 34 35 36 37 38 39 40
Loop Map for channel 0b:
Translated Map: Port Count 17
7 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Shelf mapping:

Shelf Unknown: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29



Chuck at NetApp support recommends I first try just powering off the systems, then powering off the shelves and letting them sit for a couple minutes. (I first had to shut down all my VM's running the from the SAN and shutdown all services running on both systems... with thousands of users this was a pain). He then has me power on the shelves, let them fully come up and then power on filerB by itself into maintenance mode. We again run the
( > fcadmin device_map ) (see below) and now the filer is seeing it's shelves. Apparently there is a bug with the shelf firmware version I am on (just one version back) that causes certain panics to stay in memory. Hence, our problem.

> *> fcadmin device_map
Loop Map for channel 0a:
Loop Map for channel 0a:
7 29 28 27 25 26 23 22 21 20 16 19 18 17 24 39 38 37 36 32 35 34 33 40
Shelf mapping:
Shelf 1: 29 28 27 26 25 24 23 22 21 20 19 18 17 16
Shelf 2: XXX XXX XXX XXX XXX 40 39 38 37 36 35 34 33 32
Loop Map for channel 0b:
Translated Map: Port Count 17
7 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Shelf mapping:
Shelf 1: 29 28 27 26 25 24 23 22 21 20 19 18 17 16

Target SES devices on this loop:
Shelf 1: 14 15

I now shut filerB back down, bring up filerA (which is still running service for both systems) and then bring filerB up. filerB comes up in a ...waiting for giveback state. I issue a ( > cf giveback ) from filerA and filerB takes back it's services. Were back up and running.

NetApp recommended I upgrade the firmware as soon as possible and noted that the power cycle of the shelf is a temporary fix to the memory issue.

No comments:

Post a Comment