I took over as the man in charge of the network for my day job a little over a year ago. Before me there were some guys that didn't have the level of IT knowledge I have. I'm not trying to toot my own horn here. They just specialized in different areas. For instance, the guy who hired me was at his core a database administrator. He didn't know anything about networking, or Active Directory. He also didn't know how to setup an iSCSI SAN from
NetApp.
Because of that, when they purchased our NetApp FAS2020 SAN, they had
our vendor configure it for them. Well our vendor apparently didn't know their ass from their elbow about how to configure the thing for takeover in the event of a network failure. To be honest, I didn't either, but at least I know how to test to see if it works!
Fast forward to October 28th of this year. Our data center was doing some power maintenance on one of their generators, and power was cut to our 'A' power for several hours. In theory nothing should have gone down for us because our two iSCSI switches are on separate power, and so are the two power supplies in the NetApp. Well, when one of the switches went down so did all of the LUNs on one of the NetApp controllers. For some reason takeover failed when one of the NICs went down.
After troubleshooting with NetApp it appears that the reason that a takeover didn't occur was because the
/etc/rc files were configured incorrectly by the vendors who my company had setup the NetApp. All NICs needed an
nfo option, and they didn't have that. What I had to do was ssh into both filers, and edit the
/etc/rc file by running:
wrfile /etc/rc
Then I pasted the following into the terminal where the cursor was:
hostname filer01
ifconfig e0b `hostname`-e0b netmask 255.255.255.0 mtusize 9000 trusted -wins mediatype auto flowcontrol full nfo partner e0b
ifconfig e0a `hostname`-e0a netmask 255.255.255.0 mtusize 1500 trusted -wins mediatype auto flowcontrol full nfo partner e0a
route add default 192.168.1.1 1
routed on
options dns.domainname ns0.bauer-power.net
options dns.enable on
options nis.enable off
savecore
After that I pressed
enter, then
control +c to save the file. Once that is set you also need to make sure that
cf.takeover.on_network_interface_failure is set to
on by running:
options cf.takeover.on_network_interface_failure on
And you need to make sure
cf.takeover.on_network_interface_failure.policy is set to
any_nic by running:
options cf.takeover.on_network_interface_failure.policy any_nic
You need to make these settings changes on both filers. Make sure you change the hostname for the other filer in your
/etc/rc file. Also make sure you change anything else you need to fit your network.
Once those changes are complete, you need to manually perform a takeover of one node, then manually perform a take back. Then do the same thing with the other node. After we did this I was able to simulate a network failure by unplugging a network cable on one node. It took about 51 seconds, but the takeover automatically happened, and we didn't really lose connections with our LUNs.
Special thanks to Cecilia Thompson at NetApp Tech Support for helping me track down the root cause of this!