Skip to main content

What should you know about HA 'override enabled' setting on Fortigate?

High availability is mandatory in most of today's network designs. Only very small companies or branches can run their business without redundancy. When you have Fortigate firewall in your network you have many options to increase network availability. You can use Fortigate Clustering Protocol (FGCP) or Virtual Router Redundancy Protocol (VRRP). FGCP has two modes: 'override' disabled (default) and 'override' enabled. I'm not going to explain how to set up HA as you can find many resources on Fortinet websites:

https://cookbook.fortinet.com/high-availability-two-fortigates-56/

https://cookbook.fortinet.com/high-availability-with-fgcp-56/





Let's recap what is the main difference between them. The default HA setting is 'override' disabled and this is an order of selection an active unit:

1) number of monitored interfaces - when both units have the same number of working (up) interfaces check next parameter
2) HA uptime - an unit with higher uptime wins but a difference between them has to exceed 5 min. When the uptime is the same or the difference is lower than 5 minutes check next condition
3) priority - a default value is 128 and you should set higher value on a primary unit, when the priority is the same check last condition
4) serial number - when all above conditions are the same, compare the serial numbers and elect the primary unit

The second option in HA is 'override' enabled and there is only one difference in order of parameters which are checked:

1) number of monitored interfaces
2) priority
3) HA uptime
4) serial number

As you can see the priority is compared before checking HA uptime. In this variation you have more control over which unit wins and becomes the active one. It's tempting but there are some pitfalls you should be aware of.

Case #1

With 'override' enabled, your primary unit will be the active one, whenever possible. Imagine a failure when you experience a hardware issue. The firewall fails and the secondary device becomes the active one (1st failover). Then you fix the primary unit, and you reconnect it into the network. Once the device is up and running, it immediately takes over the active role causing the 2nd failover.
You should plan carefully such activity as it's very disrupting. When you do the same with 'override' disabled configuration, an uptime is compared and nothing changes as the new unit has a lower value.

Case #2

The next problem you can see when your primary device is rebooting every 5 minutes due to power instability for an instance. As you can imagine, every time when the unit is back, it triggers failover. With 'override' disabled it doesn't cause any problems as long as your backup device is up and running.

Case #3

The last case scenario is similar to above ones but now I focus on something else than number of failovers, which are of course not desired. Imagine situation when your primary device is broken and you wait week or so for a power supply. It's very long time and in busy environment many things may happen, including configuration changes. Assume during that time, you added some firewall policies. Finally your primary device is fixed and you connect it to your network. As you expect, the new device becomes the active one. Everything looks fine so far but there is one problem. When the new device negotiates and re-builds the cluster, it also synchronizes configuration. Guess in which direction? Of course the device with higher priority sends configuration to its peer. In our case the fixed device sends its (week old!) config to the current active unit and overrides changes you added recently. You can of course prepare for it and increase the priority on the secondary (currently active) device or lower the value on the fixed device, and change the value back once the cluster is in sync.


As you can see, the 'override' enabled variation should be used only in specific situations, and you should take precaution when using it.When you compare these two variations, the 'override' enabled requires more attention and better planning.


Comments

  1. Thanks Hubert, very informative. This level of detail is perfect for Service Managers like myself who need to be able to identify possible service disruption during maintenance to be able to make informed decisions and communicate risk to the customer.

    ReplyDelete
    Replies
    1. Thank you David, it's nice to see you here :)

      Delete
  2. Great article Hubert! From what I understand override disabled is bringing less caveats or issues in many cases. Can we say that is the recommended FGCB failover mode in most cases as it may brings less headaches in most common situations? In case not completely true, can you give some examples on when the override enabled may bring some advantages in practical terms excluding having more control on which unit becomes active?

    ReplyDelete
    Replies
    1. Fortinet doesn't recommend using 'override enabled' feature when you don't use 'virtual clustering'. This is similar to Active-Active HA on Cisco ASA with contexts. With 'virtual clustering' the option 'override enabled' is set by default. It helps you to distribute traffic equally on both units. With uptime as a main factor you would have all VDOMs active on one unit (on that one with a higher uptime). That option (virtual clustering) is not mandatory and you can have the cluster (also with VDOMs) in Active-Passive mode where only one physical unit processes all traffic.
      I know my answer may raises more questions but for all designs when you don't need 'virtual clustering' you should use the default setting. You can trigger failover using below command (on the current active unit):

      diagnose sys ha reset-uptime

      and by resetting the uptime the device becomes a standby one.
      Maybe I compare Fortinet and Cisco HA in my next post as they use the same terms but with different meanings what may be confusing.

      Delete
  3. Thanks ! It wasn't that clear from Fortinet's documentation in which cases it would cause network disruptions
    Thank you for the tips !

    ReplyDelete

Post a Comment

Popular posts from this blog

FortiGate and GRE tunnel

Recently I worked on one project where a client requested to re-route web traffic to the GRE tunnel to perform traffic inspection. I would like to share with you what is required if you configure it on FortiGate. We need a new GRE interface and policy base routing (PBR) to change the route for specific source IPs. Of course you need firewall policies to permit the traffic. Let's start with GRE interface. Unfortunately you can't configure it using the GUI, only CLI is the option: config system gre-tunnel edit "gre1" set interface "port1" set local-gw 55.55.55.55 set remote-gw 44.44.44.44 next end When the end peer is Cisco router, you need to set the IP for the GRE interface: config system interface edit gre1 set ip 192.168.10.10 255.255.255.255 set remote-ip192.168.10.20 end In next step we need to fix routing. We need the alternate path via GRE but to keep the route in the active routing table you need to set the same AD (adminis

Inpection of asymmetric sessions on FortiGate

There is one feature available on FortiGate, and I think you should know it, as it modifies a bit what we know about stateful firewalls. In past every packet was treated individually and you had to create policies in both directions. With stateful firewalls we can track connections, and by checking couple of attributes, we can treat them as part of the same session. For example when you initiate connection from a host1 to host2, the returning connection from host2 to host1 will be treated as part of the same connection (session). They have to have the same source/destination and destination/source IPs, port numbers and interfaces.There is an exception from this rule and FortiGate in some specific cases can accept connections on port which was not used in the initial connection. Let me explain how it works on the below example:      The host1 has a default gateway on R1 (10.0.1.2), but you may notice that it is not the optimal path to host2 subnet. When we analyze the packet flo