I recently upgraded my pfSense appliance to the latest code release (currently 2.4.4-RELEASE (amd64)) and since doing so I’ve had a few strange occasions where my devices are unable to access the Internet or more precisely certain ports are working while others do not. Let me explain further and then explain what I’ve done which appears to have resolved the matter.
So what exactly happened? Well I would come home and fire up my desktop only to find applications and browsers were unable to access web resources. What was more interesting is that I could do DNS lookups just fine (well to be fair pfSense is my resolver and caches records) so I knew it wasn’t a matter of name resolution. Additionally I could see the heartbeat ICMP messages going from my WAN interface to the ISP endpoint so I knew the link wasn’t down. I could also connect to the pfSense web dashboard so I knew packets were reaching that just fine. To help visualise the setup here is a very simple diagram, it excludes a lot of stuff in my home setup but the important aspects for this discussion are included.
I figured I’d just disconnect and reconnect the WAN interface and see if that helped matters and behold, everything worked again! I assumed a transient error or some other issue that had been resolved my recycling that interface. Unfortunately the problem came back again, and then today as well.
Today I figured I’d actually spend a little more time looking into what was going on. The great thing about pfSense (there are so many) is that it is very easy to generate packet captures on any interface with various settings and then download them to view in Wireshark.
To create a capture we browse to the Diagnostics tab and then select Packet Capture from the drop down menu.
As you can see there are a number of options and selections we can make. I decided to do a promiscuous mode capture on my WAN interface and then another on my OPT1 interface which is the physical Ethernet port my desktop PC connects to. Both captures were set to a maximum of 100 packets. All other settings remained on their default values.
Once the capture was running I opened a web browser and tried to load https://www.google.co.uk – I did for both captures. After a few refresh attempts on the browser I stopped the capture and then downloaded it to my machine to view in Wireshark.
First off let’s take a look at the WAN packet capture –
As my ISP provides a static IP I’ve chosen to obfuscate those entries from the screenshot.
What we can see is a number of ICMP heartbeats going between my WAN interface and the ISP so we know that the link is up and packets are traversing it. Additionally we can see that my pfSense appliance is making a DNS request to the Cloudflare 18.104.22.168 service – this request is encrypted and uses port 853 over a TLS session. If you haven’t looked at using Cloudflare for your upstream DNS resolution I highly recommend them, especially with the option now to encrypt your DNS lookups.
We can see that the DNS lookup worked perfectly fine, there is a SYN->SYN ACK-> ACK 3 way handshake and then we establish the TLS session, exchange certificates etc.
This seems to strongly support the fact that WAN traffic is going outbound to the Internet just fine but for some reason traffic from my PC (and all other internal devices) is failing to forward.
Let’s take a look at the OPT1 interface packet capture and see what that shows us –
Hmm this is interesting – I’m getting ‘Time-to-live exceeded’ (TTL) messages and I can see the SYN packets leaving my machine with a destination of Google but we don’t ever complete the handshake. What is even more strange is that the TTL packet expiry messages are coming from 192.168.1.254 – my pfSense OPT1 interface is 192.168.2.1.
The IP 192.168.1.254 is actually a layer 3 switch used for my home lab and pfSense has a static route for it along with some basic routing/firewall rules in place. I’ve never had any problems with this setup but it seems like for some reason my traffic is going via that device. This is a problem, if pfSense is defaulting to the L3 switch and that switch is configured to use pfSense as it’s default gateway we will simply find ourselves in a routing loop. The packets will find themselves in a bit of a situation and my machine will receive a TTL expired and I’ll not be able to watch videos of cats or quantum physics. Indeed this is a terrible situation and many tears of lamentation were wept.
OK so now that I think I know what the problem is we have to figure out what has changed recently and whether that change could have some bearing on this situation.
As I mentioned at the start I deployed version 2.4.4 of pfSense and it has some changes regarding gateways –
- Default Gateway Group: The default gateway may now be configured using a Gateway Group setup for failover, which replaces Default Gateway Switching.
I went into the system routing options which can be found under the ‘System’ dropdown menu. Here we see my two gateways, one for the WAN and the second for my lab layer 3 switch.
This is really interesting, there was no ‘Default gateway’ box and drop down previously on this page – this is definitely new and aligns with the 2.4.4 update notes. I decided that instead of leaving this to automatic I would change it to specify my WAN interface as the default gateway.
Immediately after making the change my Internet traffic flow returned to normal and my apps and browsers were happy once more.
My intention is to see how things go over the next few weeks and if the problem ceases to rear it’s ugly head I shall be happy that this is was indeed the cause of my problem. If I feel like it I could of course revert everything back to the default values following the 2.4.4 update and see if the issue returns which would certainly lend weight to my argument.
If you have come across a similar situation then perhaps this post will be of use to you, otherwise it was an interesting way to spend some of my evening!