High availability is mandatory in most of today's network designs. Only very small companies or branches can run their business without redundancy. When you have Fortigate firewall in your network you have many options to increase network availability. You can use Fortigate Clustering Protocol (FGCP) or Virtual Router Redundancy Protocol (VRRP). FGCP has two modes: 'override' disabled (default) and 'override' enabled. I'm not going to explain how to set up HA as you can find many resources on Fortinet websites:
https://cookbook.fortinet.com/high-availability-two-fortigates-56/
https://cookbook.fortinet.com/high-availability-with-fgcp-56/
Let's recap what is the main difference between them. The default HA setting is 'override' disabled and this is an order of selection an active unit:
1) number of monitored interfaces - when both units have the same number of working (up) interfaces check next parameter
2) HA uptime - an unit with higher uptime wins but a difference between them has to exceed 5 min. When the uptime is the same or the difference is lower than 5 minutes check next condition
3) priority - a default value is 128 and you should set higher value on a primary unit, when the priority is the same check last condition
4) serial number - when all above conditions are the same, compare the serial numbers and elect the primary unit
The second option in HA is 'override' enabled and there is only one difference in order of parameters which are checked:
1) number of monitored interfaces
2) priority
3) HA uptime
4) serial number
As you can see the priority is compared before checking HA uptime. In this variation you have more control over which unit wins and becomes the active one. It's tempting but there are some pitfalls you should be aware of.
Case #1
With 'override' enabled, your primary unit will be the active one, whenever possible. Imagine a failure when you experience a hardware issue. The firewall fails and the secondary device becomes the active one (1st failover). Then you fix the primary unit, and you reconnect it into the network. Once the device is up and running, it immediately takes over the active role causing the 2nd failover.
You should plan carefully such activity as it's very disrupting. When you do the same with 'override' disabled configuration, an uptime is compared and nothing changes as the new unit has a lower value.
Case #2
The next problem you can see when your primary device is rebooting every 5 minutes due to power instability for an instance. As you can imagine, every time when the unit is back, it triggers failover. With 'override' disabled it doesn't cause any problems as long as your backup device is up and running.
Case #3
The last case scenario is similar to above ones but now I focus on something else than number of failovers, which are of course not desired. Imagine situation when your primary device is broken and you wait week or so for a power supply. It's very long time and in busy environment many things may happen, including configuration changes. Assume during that time, you added some firewall policies. Finally your primary device is fixed and you connect it to your network. As you expect, the new device becomes the active one. Everything looks fine so far but there is one problem. When the new device negotiates and re-builds the cluster, it also synchronizes configuration. Guess in which direction? Of course the device with higher priority sends configuration to its peer. In our case the fixed device sends its (week old!) config to the current active unit and overrides changes you added recently. You can of course prepare for it and increase the priority on the secondary (currently active) device or lower the value on the fixed device, and change the value back once the cluster is in sync.
As you can see, the 'override' enabled variation should be used only in specific situations, and you should take precaution when using it.When you compare these two variations, the 'override' enabled requires more attention and better planning.
https://cookbook.fortinet.com/high-availability-two-fortigates-56/
https://cookbook.fortinet.com/high-availability-with-fgcp-56/
Let's recap what is the main difference between them. The default HA setting is 'override' disabled and this is an order of selection an active unit:
1) number of monitored interfaces - when both units have the same number of working (up) interfaces check next parameter
2) HA uptime - an unit with higher uptime wins but a difference between them has to exceed 5 min. When the uptime is the same or the difference is lower than 5 minutes check next condition
3) priority - a default value is 128 and you should set higher value on a primary unit, when the priority is the same check last condition
4) serial number - when all above conditions are the same, compare the serial numbers and elect the primary unit
The second option in HA is 'override' enabled and there is only one difference in order of parameters which are checked:
1) number of monitored interfaces
2) priority
3) HA uptime
4) serial number
As you can see the priority is compared before checking HA uptime. In this variation you have more control over which unit wins and becomes the active one. It's tempting but there are some pitfalls you should be aware of.
Case #1
With 'override' enabled, your primary unit will be the active one, whenever possible. Imagine a failure when you experience a hardware issue. The firewall fails and the secondary device becomes the active one (1st failover). Then you fix the primary unit, and you reconnect it into the network. Once the device is up and running, it immediately takes over the active role causing the 2nd failover.
You should plan carefully such activity as it's very disrupting. When you do the same with 'override' disabled configuration, an uptime is compared and nothing changes as the new unit has a lower value.
Case #2
The next problem you can see when your primary device is rebooting every 5 minutes due to power instability for an instance. As you can imagine, every time when the unit is back, it triggers failover. With 'override' disabled it doesn't cause any problems as long as your backup device is up and running.
Case #3
The last case scenario is similar to above ones but now I focus on something else than number of failovers, which are of course not desired. Imagine situation when your primary device is broken and you wait week or so for a power supply. It's very long time and in busy environment many things may happen, including configuration changes. Assume during that time, you added some firewall policies. Finally your primary device is fixed and you connect it to your network. As you expect, the new device becomes the active one. Everything looks fine so far but there is one problem. When the new device negotiates and re-builds the cluster, it also synchronizes configuration. Guess in which direction? Of course the device with higher priority sends configuration to its peer. In our case the fixed device sends its (week old!) config to the current active unit and overrides changes you added recently. You can of course prepare for it and increase the priority on the secondary (currently active) device or lower the value on the fixed device, and change the value back once the cluster is in sync.
As you can see, the 'override' enabled variation should be used only in specific situations, and you should take precaution when using it.When you compare these two variations, the 'override' enabled requires more attention and better planning.
Thanks Hubert, really useful!
ReplyDeleteThanks for comments!
DeleteThanks Hubert, very informative. This level of detail is perfect for Service Managers like myself who need to be able to identify possible service disruption during maintenance to be able to make informed decisions and communicate risk to the customer.
ReplyDeleteThank you David, it's nice to see you here :)
DeleteGreat article Hubert! From what I understand override disabled is bringing less caveats or issues in many cases. Can we say that is the recommended FGCB failover mode in most cases as it may brings less headaches in most common situations? In case not completely true, can you give some examples on when the override enabled may bring some advantages in practical terms excluding having more control on which unit becomes active?
ReplyDeleteFortinet doesn't recommend using 'override enabled' feature when you don't use 'virtual clustering'. This is similar to Active-Active HA on Cisco ASA with contexts. With 'virtual clustering' the option 'override enabled' is set by default. It helps you to distribute traffic equally on both units. With uptime as a main factor you would have all VDOMs active on one unit (on that one with a higher uptime). That option (virtual clustering) is not mandatory and you can have the cluster (also with VDOMs) in Active-Passive mode where only one physical unit processes all traffic.
DeleteI know my answer may raises more questions but for all designs when you don't need 'virtual clustering' you should use the default setting. You can trigger failover using below command (on the current active unit):
diagnose sys ha reset-uptime
and by resetting the uptime the device becomes a standby one.
Maybe I compare Fortinet and Cisco HA in my next post as they use the same terms but with different meanings what may be confusing.
Thanks.
DeleteThanks ! It wasn't that clear from Fortinet's documentation in which cases it would cause network disruptions
ReplyDeleteThank you for the tips !