Layer 2 Redundancy with STP: Palo Alto Firewall + Cisco Switches

I built a basic test laboratory with a Palo Alto Networks PA-200 firewall and two Cisco Catalyst 2950 switches in order to test the Spanning Tree Protocol (STP) for achieving Layer 2 redundancy for the physical connections to/from the firewall. This post lists the configurations, “show spanning-tree” outputs from the switches and a few other outputs after several tests. Not all tests ran without any problems so I think there must be something wrong with my configurations, the test sequences, with the STP process, or with the MAC address tables. Maybe some readers have similar experiences?

[UPDATE] Problem solved! I missed the layer 2 zones. Description at the bottom [/UPDATE]

Though the Palo Alto firewall does not participate in STP itself, it forwards the BPDUs from the switches. That is: The complete STP process takes place at the two switches while the firewall is a simple Layer 2 forwarding device. All ports between the network devices are configured as VLAN trunk ports. I am not using the firewall as a “Layer 2 firewall” with appropriate zones, but as a Layer 3 firewall with VLAN interfaces. That is: There are no policies between the layer 2 interfaces during the tests.

Here is the basic laboratory installation:

PA-200 Cisco Switche STP Laboratory

The Cisco Catalyst 2950 switch runs with IOS “12.1(22)EA14” while the PA-200 has PAN-OS 5.0.8 installed. The two notebooks are simple Linux machines.

Layer 2 Configuration PA-200

The following screenshots show the configuration on the Palo Alto firewall. The two physical interfaces (Layer2) have two subinterfaces with the VLANs 120 and 125 configured. Two VLAN-Interfaces (Layer3) provide routing and are configured with Layer3 Zones. Finally, the two VLANs have the subinterfaces and the VLAN interfaces assigned to it.

PA-200_Interfaces_Ethernet PA-200_Interfaces_VLAN

PA-200_Zones PA-200_VLANs

Switch Configuration with STP

The following listing shows the relevant Cisco commands in the test lab. SW01 is listed, while SW02 is exactly the same except the management IPv4 address. Note that I have only configured the first two spanning-tree commands while the other two appeared automatically.

Expected Test Results

Since the Layer 2 interfaces on the Palo Alto behave like a normal switch with no STP enabled, the whole spanning tree process should work as normal from the perspective of two Cisco switches. That is: In the case of a loop (all cables are plugged in) one port of the switch should be in blocked (BLK) mode while the other should still forward (FWD) any traffic. However, the Palo Alto firewall should recognize the change of MAC addresses from one physical port to another. This should be seen in the MAC address table. The same is expected for the CAM tables of the switches.

Test Sequence & Results

I created the following test sequence and captured the “show spanning-tree” output from both switches after each test. Beginning with test #11  I also captured the contents of the MAC address tables and the ARP caches of both switches and the firewall.

I pinged an external IPv4 address from an internal client that resided behind SW02. Beginning with test #11 I pinged the default gateway, i.e., the Palo Alto address, in the vlan 125 from both notebooks that are shown in the figure above.

  1. Only cable from PA-200 to SW01 was plugged in.
  2. Both cables were plugged in.
  3. Cable to SW01 was ripped off. 8 ping timeouts.
  4. Both cables were plugged in. 7 ping timeouts, then traffic ok. BUT: Both switches are not accessible via ssh anymore. After doing a console connection to one of the switches and a ping to the other switch, the vty connections worked again. (???)
  5. Port fa0/1 on SW02 was “shutdown”.
  6. “no shutdown”.
  7. Port fa0/1 on SW01 was “shutdown”. 7 ping timeouts. BUT: ssh connection to both switches lost one more time. No further tests.
  8. “no shutdown”. 7 ping timeouts. Still both ssh connections to the switches were gone. No new connections were possible. (The ping to the Internet still worked!) Tried to reload SW02. But still no ssh connection possible. (???) Port fa0/1 from SW01 ripped off. 8 ping timeouts. NOW both switches are accessible via ssh again.
  9. Port plugged in again. 6 ping timeouts. Both switches are NOT accessible via ssh. (???)
  10. Plugged off the power supply from both switches and plugged them in again. Now both switches are accessible via ssh again.
  11. Setup for a new scenario: Both notebooks are on port fa0/8 of the appropriate switches. Notebook nb02 was pinging nb01.
  12. Cut cable between both switches (gi0/1). Ping from nb02 to nb01 did not work anymore.
  13. Palo Alto: “clear mac trust-server”. Ping from nb02 to nb01 still did not work.
  14. Cable between both switches plugged in again. Ping from nb02 to nb01 worked directly! (<- this is strange!)
  15. Last scenario: Both notebooks pinging the default gateway, i.e., the VLAN interface of the Palo Alto firewall.
  16. Cut cable between both switches (gi0/1). Ping from nb01 to the gateway still OK, and the ping from nb02 to the gateway timed out but came back after a while. However, the ping from nb02 to nb01 did not work anymore! (???)
  17. Cable plugged in again. Ping from nb02 to nb01 worked directly. After a while, the ping from nb02 to the gateway came back again, too.

The following MAC address were used in this laboratory:

As an example here are the two “show spanning-tree” outputs from VLAN120 for both switches after the test #2 case in which all cables are plugged in. It basically shows that the second switch put the Fa0/1 port into “Altn BLK” mode. (Since there were some other access ports plugged in, the interface list with “Edge P2p” entries differ.)

SW01:

SW02:

 

Here are the listings after test #16 in which the ping from nb02 to nb01 did not work:

Palo Alto:

SW01:

SW02:

 

Summary of the presence of the MAC addresses from nb01 and nb02:

 nb01nb02
15_paeth1/3.120eth1/3.120
15_sw01Fa0/8Gi0/1
15_sw02Gi0/1Fa0/8
16_paeth1/3.120eth1/4.120
16_sw01Fa0/8--> n/a <--
16_sw02--> n/a <--Fa0/8

–> That is, after test #16 the MAC addresses from both notebooks are NOT PRESENT in the switch’ MAC address table while it should have shown “Fa0/1” because they are accessible through the Palo Alto firewall. However, I do not know why this happens. Maybe the firewall does not forward all Ethernet frames?

In summary I captured many outputs from the two switches and the firewall after each test. If someone is *really* interested in the details, the following zip file contains all outputs. They are counted appropriate to the test case numbers with a suffix of “sw01” and “sw02” for the switches and “pa” for the Palo Alto firewall:

download-buttons02

Conclusion

The basic STP tests showed the expected behaviour. For example, the default gateway was always accessible. The switches correctly recognized the layer 2 loop while they changed the port states to “forward” in the case of a loop-free environment. However, in some situations the switches did not recognize the real exit-interface for some Ethernet frames. At this point I do not know whether this is a configuration mistake by myself or a bug in any of the systems…? If anyone has an idea, please comment it!

Problem Solved!

After some further investigation and discussions with colleagues I understood the problem: At the Palo Alto firewall the Layer2 subinterfaces also need Security Zones (Layer2) and an allow policy in order to allow intra-zone traffic! Since we are talking about a *real* firewall, this makes sense at all. Now the test laboratory worked, especially the cases in which I pinged from nb02 to nb01, even when the cable between both switches is cut.

Here are the new screenshots from the Palo Alto firewall with the correct layer 2 security zones, the intra-zone policy and the pings in the traffic log:

PA-200_Interfaces_Ethernet_2 PA-200_Zones_2

PA-200_Security_Policy PA-200_Traffic_Log

Further Reading

Some links concerning this article:

11 thoughts on “Layer 2 Redundancy with STP: Palo Alto Firewall + Cisco Switches

  1. Hi,

    Thanks for posting this article. I have built a very similar scenario as a POC for a customer. Things seem to work more or less as expected, however I am seeing 30s outages to traffic while RSTP reconverges. In hunting for answers, I came across your blog…
    I am using Meraki MS220 switches, in the same configuration as in your drawings. RSTP should converge much more quickly than I am seeing, so I am wondering if perhaps the PA is causing the delay for some reason…

    Were you able to improve on the outage times you observed? I notice you were using rapid PVST, and I assume your 7-8 pings lost equate to the same 30sec I am seeing…

    Did you ever put this solution into production? If so, were there any unforeseen issues? I am looking to deploy this setup to 5 international sites, so any reassurances / observations you may have would be appreciated…

    Regards,
    Tim

    1. Hi Tim,

      I have not further tuned the timers for my test lab. In theory, RSTP should converge in a few seconds (and not in 30-50 seconds as STP). Please try your network configuration without the Palo, i.e., plugin the two cables directly into the switches. [Instead of Switch < -> Palo < -> Switch, use Switch < -> Switch]. You should have a loop which is blocked by RSTP. Then try your tests and compare them with the test with the Palo in between. How do the network outages change?

      Yes, we are using a similar design at the customer’s site. I know that there was an issue with the ARP caches on the layer 2 interfaces on the Palo that was fixed in 5.0.9 or so (I don’t have any further details here, sorry). I think it is running now without any further problems. But I am not fully involved there…

      1. Thanks for the quick response Johannes.

        After I posted, I tried disabling STP on a 3560, and put it in place of the PA. The Meraki switches converged as expected with the triangle topology, discarding on the correct port. They exhibited about the same delay (25 seconds) when I dropped the active uplink to the 3560.

        I have confirmed that RSTP converges very quickly when I have the Meraki switches only – ie, loop between two switches.

        The convergence delay is the only issue i have noticed with the network, and the customer is more than happy to have that 25s interruption if it means clients on the remaining switches can continue to function if the primary uplink/switch is lost…

        Thanks again,

        Tim

  2. Hi
    I just need a help . as I applied the same topology in my company but we are facing random instability in the network and many times I received in PA logs ” incomplete MAC address ” so is this related to an ARP issue or what ?
    Note : the PA version is 6.1.4 the latest version

    1. Hi Tamer,

      of course I cannot troubleshoot your issue with these few information. However, “incomplete MAC address” might be the problem when no ARP answer is coming. So yes, this might be related to your layer-2 design. Have you tested some packets “from left to right”, similar to my scenario? Have you the layer-2 policies in place on the Palo? Are you sure that the requested IP address is really alive (that is, it really answers with its MAC address to the ARP request)?

  3. Any need for the rules considering the below?

    There are two default rules on the Palo Alto Networks firewall regarding security policies:
    Deny cross zone traffic
    Allow same zone traffic

    1. The second one (allow same zone traffic) should fit. However, this only works if you DO NOT have an explicit “deny any any” rule before that. ;) Since many IT admins prefer to have an explicit deny any any rule, the final default rules from Palo Alto won’t be hit, because every traffic is already denied.

      What does your log say? You should see denied packets, it the security rules are the cause.

  4. Have you tested this on Pan OS 6.1.1? We attempted to replicate your configuration and found that the connection to the second switch does not get blocked with STP. If we connect the two switches together using the same ports, STP blocks the traffic as expected on the duplicate path. (12.2(44)SE5).

    1. No, the last PAN-OS version I had with this setup was 5.0.x. I have not done further tests in the last months about that… I am sorry.
      Have you configured the layer2 policies?

      1. Yes, I’ve tried it with and without layer2 policies. A loop is created in both cases. I have a case open with support and they are trying to reproduce it in a lab. Thanks for your response.

        1. We were able to get this working on both 6.1.1 and 7.0.1. Our problem was only looping on vlan.1 which was our tagged default vlan. As soon as we switched our default vlan to untagged the loop went away.

          Thanks for the resource and the help.

Leave a Reply

Your email address will not be published. Required fields are marked *