How to increase network resiliency?

Network design is not fixed process. Every time when we add or change something in the network, we should analyze if the network is still resilient, as it was in the original design. Let's analyze below scenario:

Firewall - Fortigate 5.x

Core switch - Nexus 5k NX-OS 7.X

Routing between core and firewalls - static

With direct connection between FW01-Core01 and FW02-Core02 we can detect link failure easily. Firewalls here are in HA Active-Passive mode, what means the secondary box doesn't process any traffic. In case of Port1, Port2 or device failure - the secondary takes its role and sends ARP updates to the core switch. The same situation when Core01 or Core02 fails, FW01/02 can notice it and triggers failover.

Let's imagine your are tasked to put IDS between core switches and perimeter firewalls, like on the diagram below:

What is wrong with this scenario? Let's think if following failure scenarios are backed up:

1) FW01/Port1/Port2 failure - with port failure FW01 triggers failover, with device failure FW02 detects lack of heartbeats, triggers failover and updates MAC table on the core switch. In case of FW01 malfunction, FW02 will not see heartbeats, so we are covered too.

Status: PASS

2) IDS01 failure or external interface. FW01 can detect such incident and triggers failover.

Status: PASS

3) IDS01 can't process traffic (device malfunction) but its physical interfaces are up. In such case there is nothing what could trigger failover. Traffic will be dropped between FW01-IDS01 (ingress) and Core01-IDS01 (egress):

Status: FAIL

4) Core01 has physical interfaces failure. There is no mechanism in place to trigger FW01 failover. Traffic will be dropped by Core01:

Status:FAIL

5) Core01 can't process traffic (device malfunction). There is no mechanism in place to trigger FW01 failover. Traffic will be dropped by Core01:

Status:FAIL

In above 5 failure scenarios, 3 of them are not resilient. The problem is the devices are not directly connected and they can't detect link or device failure (scenarios 3,4 and 5).

One of the method to fix the problem could be following change. We have to implement SLA monitor feature on core switches (available on Nexus 5k from version 7) to monitor interface on firewall (Port2) and use it with a static route. On the firewall we can use Link Health Monitor to track state of remote interface (on the core switch).

http://help.fortinet.com/fgt/handbook/cli52_html/index.html#page/FortiOS%25205.2%2520CLI/config_system.23.040.html

https://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus5500/sw/unicast/7_x/cisco_n5500_layer3_ucast_cfg_rel_6x/l3_object.html#92401

With above improvement there is no possibility to trigger failover based on status from these above features (object tracking on Nexus and Link Health Monitor on Fortigate). They can help you to remove static route only and use alternate path. It means we need backup links:

There are four possible paths:

a) FW01 (port2) -> IDS01 -> Core01 -> preferred
b) FW01 (port3) -> IDS02 -> Core02
c) FW02 (port3) -> IDS01 -> Core01
d) FW02 (port2) -> IDS02 -> Core02

Let's analyze last 3 scenario which weren't resilient in the previous design.

1) (former #3) - IDS01 can't process traffic (device malfunction) but its physical interfaces are up.

This is what happens:

- FW01 is still active as it isn't able to detect IDS malfunction
- Static route from FW01 via preferred path is removed as link monitor on Fortigate can't reach Vlan15 on Core01. Next available path from FW01 is via Port3 to IDS02 and then to Vlan25 on Core02
- Core01 detects lack of connectivity to Port2 on Active firewall (FW01) and only one available path is via Core02 and then IDS02 to Port3 on FW01. FW02 is in standby mode that's why path via Vlan25 on Core01 and via Vlan15 on Core02 are not available

2) (former #4) - Core01 has physical interfaces failure.

This is what happens:

- Core01 can't reach Port2 (via Vlan15) on FW01 and that static route is removed

- Core01 has only one available path via Core02 then IDS02 and FW01 on Port3
- FW01 detects problem with reaching Vlan15 via Port2 and next preferred path is via Port3 to IDS02

3) (former #5) - Core01 can't process traffic (device malfunction).

This is what happens:

- FW01 detects problem with reaching Vlan15 via Port2 and next preferred path is via Port3 to IDS02
- Core01 is not available and only one possible path is via Core02

I think I went through all possible failure scenarios. If not, please let me know. You may think that all this job could be done by dynamic routing protocols. Fortigate and Cisco Nexus support most of them. The problem is many organizations don't accept dynamic routing on firewalls.

myITmicroblog

Search This Blog

How to increase network resiliency?

Labels

Comments

Post a Comment

Popular posts from this blog

What should you know about HA 'override enabled' setting on Fortigate?

FortiGate and GRE tunnel

Inpection of asymmetric sessions on FortiGate