The Resilient Internet Connection Caper

I first met the internet in the nineties when my dad bought a 14.4k modem. Since then I have never lived anywhere that did not have some form of connection; in fact I got so carried away writing this intro paragraph that it turned into its very own post.

Nowadays my connection has to service two of us working from home, all manner of IoT gadgets, and endless media streaming from more devices than I can count - it's in use almost 24/7, so much that I have to carefully schedule outages for upgrades for days people are in the office/at school. This reliance on the internet has allowed me to justify what has turned into a slightly over-engineered solution for resilient internet.

I should start by saying that if you need truly resilient internet, e.g. you're running a business where downtime costs you real money, then there are plenty of suppliers who can sell you resilient connections and you should explore those. My write-up here simply explains how I've leveraged a particular backup option available from my particular ISP, it's not necessarily a how-to for you.

ISP: Andrews & Arnold

For years my ISP has been the excellent Andrews & Arnold. A&A is a fantastic company offering top-quality domestic and business internet connections, VoIP services and more. They offer "internet for techies" with (a) static IPv4 address(es), IPv6 as standard, their technical support is second to none, you can talk to them on IRC, they have detailed control pages where you can monitor and adjust your services, the list goes on. All their products are exactly what I want from an ISP. There are no undeliverable promises and no marketing jargon in sight. What they are not is a cheap mass-market ISP and nor do they claim to be. Excellent technical service comes at a little extra cost, but for me it's absolutely worth it.

I use their Home::1 package delivered over an OpenReach Fibre to the Premises (FTTP) connection. For the hell of it (also because impatience) I have the highest speed service (1000/115Mbps down/up) available and my household usage is such that I just about need the higher 10T/month quota. Yes, that's right, there's a quota - it's not an "unlimited" package. But, I never come close to hitting 10T, so who cares; it's effectively unlimited for my needs.

Another cool service offered by A&A is L2TP: This lets customers use A&A's services/internet connection from any other ISP by connecting over a L2TP tunnel. What's even better is that they offer a lower-speed L2TP connection as a backup for all their broadband customers. If the primary connection breaks, simply swap traffic to a different connection, dial up the L2TP tunnel and presto - you're back online with the same IP addresses as you had before. Without the L2TP tunnel you'd have a different IP address on the internet when using your secondary connection. This may be fine for web browsing, but it wouldn't work when you're making use of your A&A connection's static public IP addresses (e.g. to self-host servers) because those addresses belong to A&A and would not be available via that secondary ISP.

Two Connections

So, I have two internet connections:

Connection	Supplier	Technology	Speed (down/up)
Primary	AAISP	FTTP (OpenReach)	1000/115
Secondary	Plusnet	FTTC (OpenReach)	40/10

The secondary connection could be with any ISP; I picked Plusnet as one of the slightly better mass-market (cheap) ISPs. It's a LOT slower than the primary connection, but I survived for years with 'only' 52Mbps, so 40Mbps is fine as a backup and I'm looking to spend as little on this connection as possible.

ISP Diversity

I also needed to avoid using A&A for the secondary connection. While they're excellent (I may have mentioned that before), things can very occasionally go wrong in their core network and while, of course, they fix this as quickly as they can, I wanted to be able to guard against a longer outage by using a totally independent ISP for the secondary connection.

Physical Diversity

Both of my connections are cabled, but due to installation happenchance, OpenReach's cables for FTTP and FTTC run opposite ways along my street. So, apart from a 20m section of shared duct in the middle, I even have physical separation between my house and the exchange. This is very lucky; ensuring this sort of separation is usually only possible at enterprise-contract levels. Had it not have been possible with OpenReach, I would have looked to an alternative network supplier for the secondary connection, possibly CityFibre or perhaps a 4G/5G mobile data connection.

Resilient Internet Requirements

With two internet connections to my house and A&A's L2TP backup service, my resilient connection is starting to take shape:

I set myself a target of two main goals:

Automatic failover from the primary connection to the backup tunnel running over the secondary connection (and the restoration thereof)
Secondary connection available "directly" (i.e. without the tunnel) as a 'rip cord' solution for the very rare cases when A&A is broken at the ISP level. Or for when I want to be on the internet from a different ISP - perhaps to check connectivity from the wider internet back into A&A.

L2TP Backup

A&A configures their systems such that when the L2TP client is connected it 'steals' the routing away from the usual PPP connection; the L2TP connection takes precedence. This means that the L2TP client must remain disconnected in regular operation and only connected when the primary connection breaks.

Public IP Subnets

IPv4

Internet connections are usually allocated a single IPv4 address by their ISP. This, single, IPv4 address is* your only address on the public internet. To let you use multiple devices with your internet connection, your router performs a trick known as Source Network Address Translation (S-NAT, or IP masquerading) to make multiple devices on your internal network look, to the wider internet, as if they're all using this singular external IP address. Clever.

* Usually. Technologies such as Carrier Grade NAT (CGNAT) render this untrue, but none of my connections are subjected to CGNAT so I can fortunately disregard it.

The A&A internet connection follows this design and my router's PPPoE client is allocated a single external IPv4 address when the PPP connection connects.

Now things start to get different: A&A has very nicely allocated me an additional block of public IPv4 addresses. This small block, a /29, gives me a handful of public IP addresses for my internal network that can be routed to A&A (and thence to the internet) through the PPP connection without needing any S-NAT. It means I can run servers, or multiple, totally independent, networks at home with each one having a public IP address of its own.

IPv6

IPv6 is perhaps simpler: my router's PPPoE client is allocated a link-local IPv6 address which is not routable on the public internet. My internal network is then assigned an IPv6 /64 subnet from the /48 block assigned to me by A&A and the router simply routes between this /64 subnet and the internet at the other end of the PPP connection.

Two Routers

To keep my network 'simple' (don't laugh), I don't make any use of the single IPv4 address assigned by A&A, instead I use my /29 'internal' IPv4 and one of my IPv6 subnets as my 'internet presence', using one router to present these to a switch in my comms rack. This router is dubbed the Internet Failover Router (rtr-ifr) and it is solely concerned with PPPoE connections and failovers.

Connected to the 'internet presence' switch is then a second router, performing a more traditional job of offering local network services such as NAT, firewalling and serving IP addresses via DHCP. I've dubbed this the Domestic Router (rtr-dom).

Router	External side	Internal side
"Internet Failover" rtr-ifr	PPP connections to ISPs	Public IPv4 /29 subnet IPv6 /64 subnet
"Domestic" rtr-dom	Public IPv4 address in the /29 subnet IPv6 address in the /64 subnet	Private (RFC1918) IPv4 addressing Further IPv6 /64 subnets from my /48 allocation

And in diagram form:

Internal IP Subnets

In IPv4, rtr-dom simply NATs the private RFC1918 addresses on my internal VLANs to the single public IPv4 address of its external interface. I use addresses from the 172.16.0.0/16 supernet, assigning a /24 per-subnet/VLAN. I've got a half-dozen or so subnets to separate devices out.

In IPv6 there is no NAT. A&A assigns me a /48 block of IPv6 addresses and I pick /64 subnets from this block to use internally. The /64 subnet used for the 'internet presence' subnet (the internal side of rtr-ifr), and the further /64 subnets used for my internal subnets/VLANs (internal side of rtr-dom) all come from this block. In the incoming direction, static routes are needed on rtr-ifr to route rtr-dom's internal IPv6 subnets via rtr-dom's external IPv6 address. In the outgoing direction, rtr-dom uses rtr-ifr's IPv6 address on the 'internet presence' subnet as its IPv6 route gateway.

More than Two Routers

The 'internet presence' switch is actually connected to more than just rtr-dom. I have a couple of server VMs and additional routers all of which benefit from having a full public IP address (both IPv4 and IPv6) of their own. But none of this is really relevant to the resilient internet connection itself.

Owning the Hardware

I am a fan of Mikrotik routers. I have too many to count already but the needs of my new fibre (gigabit) internet connection meant I went shopping for a new model that was fast enough to handle the new speed with ease, and allow for some further expansion.

Mikrotik's RB5009 is the perfect router for me. Fanless and small, yet packs a 2.5Gbps port (which I'll use for the external connections) and seven 1Gbps ports (for secondary ISP modem and/or internal connections). Mikrotik routers can be quirky around Ethernet ports and switching but the RB5009 does away with most of this by having all ports serviced by a single switch chip, AND that chip has a 10Gbps lane to the CPU. In short, the hardware is perfectly suited to a modern 1Gbps internet connection.

Block diagram from Mikrotik's website:

I bought two of these and mounted them in a 1U slot in the rack using the rackmount kit. rtr-ifr is the lower of the pair. rtr-dom is the upper.

A photo showing a pair of Mikrotik RB5009 routers mounted in a bay, with cables sprouting liberally from the front. A joy to behond.

RouterOS / Mikrotik Configuration

Routing Tables

Using this Mikrotik help page as inspiration, I am using different routing tables and rules to control the internet traffic: to direct it to the main or backup internet connections or to force it to always exit via a specific connection.

Add routing tables:

/routing table
add disabled=no fib name=wan2-force
add disabled=no fib name=wan1-force

Not shown is the (always-present) main routing table.

PPPoE & L2TP Connections

Configure PPPoE connections for the two ISPs:

/interface pppoe-client
add allow=chap,mschap2 comment="AAISP via FTTP" disabled=no \
    interface=ether1 max-mru=1500 max-mtu=1500 name=pppoe-wan1 \
    password=******** user=********
add comment=Plusnet disabled=no \
    interface=ether2 max-mru=1492 max-mtu=1492 name=pppoe-wan2 \
    password=******** user=********

Note that the A&A connection makes use of baby jumbo frames, I can configure an IP MTU of 1508, which means the PPPoE connection can use full 1500-byte packets.

Configure the specific-ISP routing tables with the respective PPP connections as the gateways:

/ip route
add comment="Gateway for WAN2 non-tunnelled traffic" \
    dst-address=0.0.0.0/0 gateway=pppoe-wan2 routing-table=wan2-force
add comment="Gateway for WAN1 forced traffic" \
    dst-address=0.0.0.0/0 gateway=pppoe-wan1 routing-table=wan1-force

Add the L2TP client for the A&A backup connection:

/interface l2tp-client
add allow=chap,mschap2 comment="AAISP via L2TP" \
    connect-to=194.4.172.12 max-mru=1454 max-mtu=1454 \
    name=l2tp-wan2 password=******** profile=default \
    user=********

This uses the same credentials as the PPPoE connection.

The L2TP client connects to A&A's server by IP address. This isn't totally brilliant because A&A could change the IP address, however I need to specify it by IP because I need to force traffic to this IP out through the secondary connection and RouterOS only lets me do that by IP. Additionally, ensuring DNS works if the primary connection has failed would require some fiddling. The IP address is detailed on this page.

Force the L2TP server traffic to exit via the secondary connection:

/routing rule
add action=lookup-only-in-table disabled=no \
    dst-address=194.4.172.12/32 table=wan2-force

IPv6 Routes

All IPv6 is sent to A&A; I don't think Plusnet even does IPv6, but in any case all my IPv6 subnets are A&A's addresses so could not be routed over a third-party ISP. So I don't bother do any IPv6 configuration in the force-connection routing tables.

Internal Connection Setup

This is bog-standard RouterOS setup stuff; create a bridge interface to handle the internal subnet and assign the Ethernet ports to it. Give it the appropriate IP addresses, the first addresses in the public IP subnets concerned. Port ether1 is the primary internet connection and port ether2 is the secondary. All other ports are put into the bridge as a group of switch ports.

/interface bridge
add comment="Client devices" name=bridge1 vlan-filtering=yes

/interface bridge port
add bridge=bridge1 interface=ether3
add bridge=bridge1 interface=ether4
add bridge=bridge1 interface=ether5
add bridge=bridge1 interface=ether6
add bridge=bridge1 interface=ether7
add bridge=bridge1 interface=ether8

/ip address
add address=198.51.100.0/29 interface=bridge1

/ipv6 address
add address=2001:db8:1892:4001::1 interface=bridge1

Note: the IP addresses shown are for documentation purposes.

Failover Detection

To detect whether the primary internet connection has failed, I use the Netwatch tool to ping a known-good host once every 15 seconds. When the pings fail, I deduce the internet connection has failed and the Netwatch tool runs a script to adjust the IPv4 and IPv6 gateways of the main routing table. The script also enables the L2TP client which triggers it to dial up the A&A backup connection.

Configure the gateways (the backup gateway, via the L2TP client, is disabled):

/ip route
add comment="Gateway via FAILOVER LINK" disabled=yes \
    distance=1 dst-address=0.0.0.0/0 gateway=l2tp-wan2 \
    routing-table=main
add comment="Gateway via MAIN LINK" disabled=no \
    distance=1 dst-address=0.0.0.0/0 gateway=pppoe-wan1 \
    routing-table=main
    
/ipv6 route
add comment="Gateway via FAILOVER LINK" disabled=yes \
    distance=1 dst-address=::/0 gateway=l2tp-wan2 \
    routing-table=main
add comment="Gateway via MAIN LINK" disabled=no \
    distance=1 dst-address=::/0 gateway=pppoe-wan1 \
    routing-table=main

Add the Netwatch:

/tool netwatch
add comment="Gateway check to initiate failover" disabled=no \
    down-script="/system script run switch-to-failover" \
    host=81.187.81.187 interval=15s start-delay=10s \
    thr-loss-percent=80% type=icmp \
    up-script="/system script run switch-to-primary"

I have configured the ICMP loss percentage at 80% to cause the failover to trigger if 'most' packets are being lost. I am also relying on the connection being fast enough that I'm not going to be able to saturate it (except perhaps in bursts) to the point pings are lost. If I do run into problems with Netwatch triggering when I'm hammering the connection, I will need to add some traffic prioritisation on the ICMP.

The scripts are based on ones I found online (can't find the source now, I'll update this if/when I do). As I have Netwatch doing the heavy lifting of pinging and checking, all the script needs to do is to enable (or disable) gateways and interfaces.

Here's the script to switch from MAIN to BACKUP connections. The script that switches from BACKUP to MAIN connections is identical but the enable/disables are reversed:

:local intWan1 \"pppoe-wan1\"
:local intWan2 \"l2tp-wan2\"
:local intL2tp \"l2tp-wan2\"

:log info \"Switching to FAILOVER WAN\"

# Find the routes we're modifying and log the results, just in case our
# FIND matches more than one

:local hWan1 [/ip route find gateway=\$intWan1 \
    dst-address=0.0.0.0/0 routing-table=main]
:local h6Wan1 [/ipv6 route find gateway=\$intWan1 \
    dst-address=::/0 routing-table=main]
:local hWan2 [/ip route find gateway=\$intWan2 \
    dst-address=0.0.0.0/0 routing-table=main]
:local h6Wan2 [/ipv6 route find gateway=\$intWan2 \
    dst-address=::/0 routing-table=main]
:local hL [/interface l2tp-client find name=\$intL2tp]

:log info (\"WAN1 v4 route handle = \" . \$hWan1)
:log info (\"WAN1 v6 route handle = \" . \$h6Wan1)
:log info (\"WAN2 v4 route handle = \" . \$hWan2)
:log info (\"WAN2 v6 route handle = \" . \$h6Wan2)
:log info (\"L2TP interface handle = \" . \$hL)

# Disable WAN1 IPv4 route
:if ([/ip route get \$hWan1 disabled] = false) do={
    :log info \"Disabling WAN1 IPv4 route\"
    :do {
        /ip route disable \$hWan1
    } on-error={ 
        :log info \"Error disabling WAN1 IPv4 route\" 
    }
} 

# Disable WAN1 IPv6 route
:if ([/ipv6 route get \$h6Wan1 disabled] = false) do={
    :log info \"Disabling WAN1 IPv6 route\"
    :do {
        /ipv6 route disable \$h6Wan1
    } on-error={ 
        :log info \"Error disabling WAN1 IPv6 route\" 
    }
} 

# Enable WAN2 IPv4 route
:if ([/ip route get \$hWan2 disabled] = true) do={
    :log info \"Enabling WAN2 IPv4 route\"
    :do {
        /ip route enable \$hWan2
    } on-error={ 
        :log info \"Error enabling WAN2 IPv4 route\" 
    }
}

# Enable WAN2 IPv6 route
:if ([/ipv6 route get \$h6Wan2 disabled] = true) do={
    :log info \"Enabling WAN2 IPv6 route\"
    :do {
        /ipv6 route enable \$h6Wan2
    } on-error={ 
        :log info \"Error enabling WAN2 IPv6 route\" 
    }
}

# Enable the L2TP client
:if ([/interface l2tp-client get \$hL disabled] = true) do={
    :log info \"Enabling L2TP Client\"
    :do {
        /interface l2tp-client set \$hL disabled=no
    } on-error={ 
        :log info \"Error enabling L2TP Client\" 
    }
}

:log info \"Finished switching to FAILOVER WAN\"

The "known-good" IP address that is being pinged by Netwatch is A&A's PPP endpoint for DSL customers, and they recommend it as the target for checks like this (details on this page). To ensure we're not accidentally checking via the L2TP tunnel in a failover situation, also force traffic to this address out of the primary connection:

/routing rule
add action=lookup-only-in-table disabled=no \ 
    dst-address=81.187.81.187/32 table=wan1-force

Failover Conclusion

I've used routing tables and rules to force traffic to a pair of important IP addresses out through specific internet connections. I've used the Netwatch tool to ping a known-good IP address out of the primary internet connection, executing scripts to fail the default gateway of the main routing table between the two connections and enable/disable the backup L2TP client.

Having run this for a few months it definitely works. It is perhaps slightly trigger happy, inasmuch as a momentary glitch causing a few seconds' loss may trigger the failover. But if this happens, the next poll 15s later restores service and practically this doesn't cause a problem.

Job: Done. Or, is it..?

Connection Status Monitoring

I've set up Home Assistant to monitor some key statistics about the internet connections:

In/out bytes for both internet connections
Status of the PPPoE and L2TP clients, this lets me see when the primary client has failed and the L2TP client has been enabled - i.e. the failover is active
Remaining A&A monthly quota
A&A connection speed

Bytes & Interface Status

Monitored by SNMP. There is probably a SNMP poller integration for Home Assistant but I chose to write my own which polls the router and reports status to MQTT. The bytes counters are only 32-bit so on a gigabit connection running at full tilt will wrap around in only a few minutes, so the raw value is processed by a Home Assistant Utility Meter integration which handles wrapping counters.

A&A Quota

The remaining quota on my A&A line is available as a JSON response from a simple URL which is easily queried by the Home Assistant Rest sensor:

- platform: rest
  unique_id: aaisp_quota.gb
  scan_interval: 300
  resource: http://quota.aa.net.uk
  name: AAISP Quota
  value_template: "{{ value_json.quota_remaining_gb }}"
  device_class: data_size
  unit_of_measurement: GB
  headers:
    Accept: application/json
    User-Agent: Home Assistant REST sensor

A&A Connection Speed

A&A limit the downstream connection (TX, as they see it) rate either to limit the speed to the service you've purchased or to back a line off slightly to optimise traffic in heavy load situations. When I connect PPPoE via the FTTP connection, my line is limited to 974,726,200 bit/s which is 1Gbps less the configured backoff on my line, which I can tweak in the control panel. When I connect via the backup L2TP connection, my line is limited to 100,000,000 bit/s because L2TP backup is a 100Mbps service.

The line rate is adjusted by the most-recent connection type that A&A has received.

When my FTTP line fails, therefore, the L2TP is connected and the line is limited to 100M. When the FTTP line recovers, I am relying on it re-connecting to reset the line limit. If there was some brief glitch that stopped traffic flow but didn't cause a PPPoE re-connect, the line would be limited to 100M by the L2TP connection but would not then be reset to ~1G. Things would be slow!

Monitoring the line rate means I can detect this situation. In practice, real-world failovers never seem to result in this situation, which is useful. If they did, I'd need to adjust the failover scripts to bounce the PPPoE connection upon line restoration.

The connection speed value is available from A&A's line management interface Chaos. I use another Home Assistant Rest sensor to pull the value in JSON every 5 minutes:

- platform: rest
  unique_id: aaisp_line_tx.gb
  scan_interval: 300
  resource: https://chaos2.aa.net.uk/broadband/info?
     control_login=********&control_password=********&service=********
  name: AAISP TX Rate
  value_template: "{{ value_json.info[0].tx_rate_adjusted }}"
  device_class: data_rate
  unit_of_measurement: "bit/s"
  headers:
    Accept: application/json
    User-Agent: Home Assistant REST sensor

Connection Alerts

I originally intended to set up some Home Assistant alerts to warn me when the connection fails over, or when the line rate has not been correctly reset. But, in practice, I didn't bother. The failover seems to work so seamlessly I don't notice it and I get an email from A&A when the line drops, pleasingly reporting an outage of 1 second, and the line rate doesn't seem to get stuck.

Nice.

Secondary Connection 'Rip Cord'

I mentioned at the start of this writeup that one of my requirements was to make the secondary internet connection available "directly" (i.e. without the L2TP tunnel) as a 'rip cord' solution for the very rare cases when A&A is broken at the ISP level.

The secondary internet connection's external IP is assigned to the PPPoE interface on rtr-ifr, but the clients for this connection are on the internal side of rtr-dom in the 172.17.1.0/24 subnet. This needs a bit of plumbing between the routers:

The S-NAT (IPv4 only; there's no IPv6 on the secondary connection) must happen on rtr-ifr. I use the 172.17.0.0/16 supernet for all the 'rip cord' subnets:

/ip firewall nat
add action=masquerade chain=srcnat \
    comment="Masquerade (WAN2 local network)" \
    out-interface=pppoe-wan2 src-address=172.17.0.0/16

To connect the NAT routing of rtr-ifr to the 'rip cord' devices on the internal side of rtr-dom I use a VLAN on the internal switch fabric of rtr-ifr and a /30 point-to-point link. I create the VLAN interface on the bridge and configure port ether3 (connected to rtr-dom) to carry tagged traffic. Then I assign a /30 IPv4 address from the 'rip cord' supernet:

/interface vlan
add comment="PtP VLAN for rip-cord clients via rtr-dom" \
    interface=bridge1 name=vlan73 vlan-id=73

/interface bridge vlan
add bridge=bridge1 \
    comment="PtP VLAN for rip-cord clients via rtr-dom" \
    tagged=bridge1,ether3 vlan-ids=73

/ip address
add address=172.17.255.1/30 interface=vlan73

I add routing rules to ensure all traffic for the 'rip cord' subnets is routed by the secondary internet connection's routing table (so it never tries to leave via the primary connection) and a static route tells rtr-ifr where to direct traffic to the subnet on the other side of rtr-dom. I also needed to add an IP route to ensure traffic to the /30 point-to-point link uses the correct routing table, because by default connected routes appear only in the main table.

/routing rule
add action=lookup-only-in-table comment=\
    "Force WAN2-only traffic into the relevant table" disabled=no \
    src-address=172.17.0.0/16 table=wan2-force
add action=lookup-only-in-table disabled=no \
    dst-address=172.17.0.0/16 table=wan2-force
    
/ip route
add disabled=no distance=1 dst-address=172.17.1.0/24 \
    gateway=172.17.255.2 routing-table=wan2-force 
add disabled=no distance=1 dst-address=172.17.255.0/30 \
    gateway=vlan73 routing-table=wan2-force

rtr-dom then routes between the internal 'rip cord' VLAN's private IP /24 subnet and the point-to-point link back to rtr-ifr, again using a dedicated routing table. For brevity (this post is long enough) I haven't included that configuration here, but it's just a couple of routing rules.

rtr-dom also provides DHCP services for the clients in the 'rip cord' subnet. rtr-dom's DNS server is A&A's, but the point of having the 'rip cord' subnet is for cases when A&A is dead. So, to ensure DNS remains working, the DHCP server tells clients on the 'rip cord' subnet to use a public DNS server which will always be contactable through the secondary internet connection.

Activating the 'Rip Cord'

To put the 'rip cord' subnet into action all I need to do is to connect a laptop/other device to a switch port on my internal infrastructure that's placed in the 'rip cord' VLAN. I have ports configured accordingly available in the two rooms in which we work from home, the thinking being, we'll only really desperately need internet to work from home and media/recreational use can wait until A&A fix themselves. So to make that Really Important Teams call, we'd just need to plug in a Cat5 patch and drop the laptop off WiFi.

Technical Capers

Giles's Adventures with Tech