Thursday 24 September 2015

VXLAN Support Added to Stripe


I've been looking into VXLAN over the last couple of days and the header turns out to be a very simple fixed-length affair so I thought I would add VXLAN decapsulation support into Stripe.

If you're not familiar with Stripe, it is a command line tool which loads in a pcap file and strips off any VLAN / MPLS / GRE / PPPoE / L2TP / GTP or VXLAN headers it finds, re-assembling any IP fragments it finds along the way. The mechanism is explained in a previous post for anyone interested.

Stripe can be downloaded from github: https://github.com/theclam/stripe - if you try it, please share your experience (good or bad)!

Monday 21 September 2015

Bundling Ports over EoMPLS and Managed Circuits

Alongside its cloud managed services offerings, the company where I work offers various network services such as Internet feeds, VRFs and point to point layer 2 (Ethernet over MPLS). One question that regularly crops up with the layer 2 services is whether it is possible to bundle or port channel two links together to "get double the bandwidth".

I've already covered elsewhere why bundling ports doesn't always do what people might think, particularly around bandwidth multiplication, but this idea raises a more subtle set of issues.

Barrier 1 - Detecting Failures


OK, so you have a pair of layer 2 extensions. Why would you not want to bundle them?

Actually, there are a few things to consider. Ethernet over MPLS and various other types of managed circuits do not forward link loss, meaning that if a port goes down at one end then the port at the other end remains up. Imagine the case where two links are up at one end while only one is up at the other:


Or even worse, if the carrier has a meltdown and all ports are up but only one link is passing traffic:


Clearly we need a dynamic protocol to detect and mitigate this kind of problem.

Port bundles can either be configured statically (if the port is up we will use it) or dynamically (the device must negotiate over the link before it is used and must remain communicative to stay in the bundle). Never use static port bundles over any kind of WAN circuit, you are just asking for trouble. In fact, never use static port bundles anywhere.  Always use a protocol, and make that protocol LACP.

Barrier 2 - Passing Control Frames


The layer 2 extensions that we provide are pure port based Ethernet over MPLS, garbage in / garbage out, so any valid Ethernet frame that comes in one end will be transported and passed out at the other end.

Specifically, control frames such as LACP and (eugh) PAgP pass through transparently so LAGs / Etherchannels can negotiate fine. With other providers and/or other service types your mileage will vary - some will terminate control frames on the attached device, others will just drop them and bundles will fail to negotiate end-to-end.

LACP will take care of all sorts of failures and mis-configurations, even over a WAN - if you're interested in the specifics see All Sorts of Things About LACP and LAGs.

Is there anything else to consider? Actually, yes:

Barrier 3 - Timers


LACP uses periodic keepalives to detect whether links are still viable. If three keepalives are lost then the link is marked as expired and will be removed from the bundle. These keepalives can be sent either in "fast" or "slow" mode using timers of 1 or 30 seconds, respectively.

On that basis, fast timers give a 3 second fault detection time, whereas slow timers give a 90 second detection time (yikes!) which most people would consider extremely inadequate. Bear in mind that during this 90 second window, statistically half the traffic in each direction will be lost. Worse than that, the traffic paths between hosts is chosen independently in each direction meaning each of the following cases is equally likely:






If we lose the either the red or the green link, 3 out of the 4 possibilities will break traffic flow in at least one direction!

Things are starting to improve but until relatively recently many supposedly decent switches (particularly from "Brand C") only supported long timers. Even current product line Nexus switches for some reason default to using long timers and have to be overridden to use short (lacp rate fast if that's what brought you here!).

So if you are going to bundle, make sure your devices support short timers and make sure they are configured to use them!

Other Thoughts


Given all of this, do I think it's a good idea? I can see both sides.

On one hand, the detection time of fast LACP is better than spanning tree for silent failures and unidirectional links. It is much faster at detecting silent failures than Cisco Fabricpath (~30 seconds).

On the other hand, if you need the additional bandwidth provided by bundling the two circuits then you're going to be out of luck when one of them fails and you should probably reconsider your resilience.

If I were running Fabricpath over the WAN I would certainly put LACP underneath for the faster failure detection. If I were just running STP then I'd probably leave it as it is. If you're sure what you're doing then bundling can work, however if you want resilience then you need to monitor and ensure that the utilisation doesn't rise above 50%... if you're doing that then what's the benefit of bundling?

Thursday 23 July 2015

Cisco Nexus Spanning Tree History

I've been doing a fair bit of work on Nexus 5k / 6k platforms lately and while I've been less than impressed with certain aspects of the products, one thing that the Nexus is really excellent at is keeping logs, whether you ask it to or not. That comes in super-handy if you've left the buffer logging at its default don't-wake-me-unless-the-world-ends setting...

Pretty much anything that has a state is logged somewhere in the Nexus and you can get lost in a labyrinth of cryptic troubleshooting messages related to virtually any process in the switch. In this post I'm focusing on spanning tree logs as they're pretty universal.

Imagine the scenario shown below:



We have three sites connected over the WAN. We blew the budget on dark fibres out of the Cardiff site so we've had to skimp on switches and only have one per site, with the Caerphilly switch being root bridge. The link between Caerphilly and Newport is a metro Ethernet circuit which doesn't forward link loss.

Now imagine there's a failure within the carrier network which results in a total loss of traffic across the circuit between Caerphilly and Newport. No ports go down, however after a short time spanning tree will detect the fault and converge to use the indirect route via Cardiff. If the user's port is in p2p mode rather than edge, he is going to see a 30 second outage while his port transitions back to forwarding, even with RSTP.

How would you even know this had happened (aside from users complaining bitterly)? If you're really wily you may notice your traffic statistics look a bit odd, but if the primary link is restored relatively quickly that kind of thing gets lost in 5 minute roll-ups and natural variation quite easily. Since no interfaces went down, there will be nothing in your logs (by default).

Luckily, the Nexus logs every STP port state transition in its event history and keeps them seemingly forever. If the link flapped 6 months ago there's a good chance you could still prove it, as long as you haven't rebooted the switch. These logs can be retrieved using the command show spanning-tree internal event-history all - note that it's pretty verbose and you probably want to narrow it down if you have a lot of VLANs. The first section for each STP instance is the overall state history, mostly concerned with who the root is and how it is best reached:

Newport# show spanning-tree internal event-history all | begin VLAN0055
VDC01 VLAN0055
<snip>
77) Transition at 643104 usecs after Tue Jul  7 07:44:47 2015
     Root: 8037.000c.a45e.321c Cost: 0 Age:  0 Root Port: none Port: none [STP_TREE_EV_MULTI_FLUSH_LOCAL]

78) Transition at 762615 usecs after Tue Jul  7 07:44:49 2015
     Root: 8037.000c.a45e.321c Cost: 0 Age:  0 Root Port: none Port: Ethernet1/1 [STP_TREE_EV_UPDATE_TOPO_RCVD_SUP_BPDU]

79) Transition at 763013 usecs after Tue Jul  7 07:44:49 2015
     Root: 8037.000c.ac6d.43ba Cost: 4 Age:  0 Root Port: Ethernet1/1 Port: none [STP_TREE_EV_MULTI_FLUSH_LOCAL]

80) Transition at 722769 usecs after Tue Jul  7 07:44:51 2015
     Root: 8037.000c.ac6d.43ba Cost: 4 Age:  1 Root Port: Ethernet1/1 Port: Ethernet1/1 [STP_TREE_EV_MULTI_FLUSH_RCVD]

81) Transition at 832764 usecs after Tue Jul  7 07:44:51 2015
     Root: 8037.000c.ac6d.43ba Cost: 4 Age:  1 Root Port: Ethernet1/1 Port: Ethernet1/2 [STP_TREE_EV_MULTI_FLUSH_RCVD]

82) Transition at 752841 usecs after Tue Jul  7 07:44:52 2015
     Root: 8037.000c.ac6d.43ba Cost: 4 Age:  1 Root Port: Ethernet1/1 Port: Ethernet1/2 [STP_TREE_EV_MULTI_FLUSH_RCVD]

83) Transition at 782964 usecs after Tue Jul  7 07:44:53 2015
     Root: 8037.000c.ac6d.43ba Cost: 4 Age:  1 Root Port: Ethernet1/1 Port: Ethernet1/1 [STP_TREE_EV_MULTI_FLUSH_RCVD]


The logs are quite verbose but it's clear to see from the "Root Port: none" message that the primary path to the root was lost, then re-gained within a few seconds. Just a minor flap within the carrier network and a few seconds' impact?

Below the main state history are the individual port histories, let's look at our user's port and see what happened there:

VDC01 VLAN0055 <Ethernet1/10>
<snip>
7) Transition at 762694 usecs after Tue Jul  7 07:44:49 2015
     State: BLK  Role: Desg  Age:  2 Inc: no  [STP_PORT_MULTI_STATE_CHANGE]

8) Transition at 640356 usecs after Tue Jul  7 07:45:04 2015
     State: LRN  Role: Desg  Age:  2 Inc: no  [STP_PORT_STATE_CHANGE]

9) Transition at 642846 usecs after Tue Jul  7 07:45:19 2015
     State: FWD  Role: Desg  Age:  2 Inc: no  [STP_PORT_STATE_CHANGE]


Oh. Right at the same time as the WAN dropped out, our user's port went into blocking for 15s then learning for another 15 before finally transitioning to forwarding again. Ouch... and we never would have known were it not for the STP event history!

Side Note


You can save yourself the effort of reading the incredibly verbose event history by setting the logging level of spanning tree to something more useful, such as informational:

Newport(config)#logging level spanning-tree 6

Note, the logging level for the local buffer or syslog server will need to be set to a level that will record the newly verbose logging.

Also, user ports should be forced into edge mode to avoid STP convergence causing massive disruption to them:

Newport(config-if)#spanning-tree port type edge

The switch should "guess" correctly but it's probably best not to take the chance that a user port accidentally go into p2p mode.

Saturday 11 July 2015

Testing the Impact of Local Packet Capture on the Cisco 6500 Series

For a while now, many of the larger Cisco devices (such as 6500 and 7600s) have supported local packet capture. I've always been hesitant to use these in a production environment, primarily due to concerns about the potential performance hit it could cause. Anyway, I decided to test out the potential impact in the lab to see whether it was sometimes / always / never safe to run local captures.

Test Setup

For my test setup I used a 6504-E switch with a modest SUP32-GE-3B supervisor and a 12.2 Advanced IP Services IOS - if it works on that, it should be safe anywhere. For traffic, I used a spare server running Ubuntu with a combination of Ostinato and Scapy.
The configuration was as follows:
Running at full tilt, Ostinato was happily producing 1 Gbps of traffic which went into my node on VLAN 100, out around the loop cable, back into the node and then out of the same interface on VLAN 101. Basically the port is running at 1 Gbps in each direction, so the worst possible case for mirroring a gig port.

Default Settings

The default settings for the capture are pretty conservative - a tiny 2 MB linear capture buffer with a rate limit of 10,000 frames per second. With this config, a 1 Gbps stream of 1500 byte packets fills the buffer in ~ 2.5s, triggering the capture to end. The impact of this is almost impossible to measure at all, with the capture being over so quickly you may not actually see any change in CPU on the 5 second roll-ups.
Lab-6503E#monitor capture start
*Jul 11 14:30:59.205: %SPAN-5-PKTCAP_START: Packet capture session 1 started
Lab-6503E#
*Jul 11 14:31:01.449: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended as the buffer is full, 21845 packets captured
Lab-6503E#show proc cpu hist
                                                              
                                                              
                         222223333322222               8888811
100                                                           
 90                                                           
 80                                                           
 70                                                           
 60                                                           
 50                                                           
 40                                                           
 30                                                           
 20                                                           
 10                                                    *****  
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5    
               CPU% per second (last 60 seconds)

Worst Case

OK, so the world didn't end. The next step was to see how bad it could be so I made the following changes:
  • Increased the rate limit to 100,000 fps (max)
  • Increased the packet buffer to 64 MB (max)
  • Enabled a circular buffer (why?!)
This time I set the capture to run for 60 seconds and checked the impact, which was much more severe:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec
*Jul 11 14:45:02.953: %SPAN-5-PKTCAP_START: Packet capture session 1 started
*Jul 11 14:46:02.945: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 699040 packets captured
Lab-6503E#show proc cpu hist

    2222244444444444444455555444444444444444444444444444444444
    9999999999888889999966666999999999999999999998888899999999
100
 90
 80
 70
 60                     *****
 50      *****************************************************
 40      *****************************************************
 30 **********************************************************
 20 **********************************************************
 10 **********************************************************
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5

               CPU% per second (last 60 seconds)

The impact to the supervisor in this case was much more noticeable - up to 60% CPU utilisation. The scenario is pretty unrealistic (forget circular buffers!) but suffice to say I wouldn't want to do that on a production device.
Now we have a the best and the worst cases, let's look at some realistic use cases and explore some of the other capture parameters that might help us capture what we need without causing havoc on the network.

Narrowing Down the Capture

It's possible to define criteria to decide what gets captured - as the CLI points out, some of these criteria are processed in hardware while others are handled in software:
Lab-6503E(config-mon-capture)#filter ?
  access-group  Filter access-list (hardware based)
  ethertype     Matching ethertype (software based)
  length        Matching L2-packet length (software based)
  mac-address   Matching mac-address (software-based)
  vlan          Filter vlan (hardware based)
Our test traffic consists of nearly a gig of junk run alongside a small 1-per-second ping, which we will decide is "interesting" to us and we want to capture. This traffic profile makes it easy to test ACL, MAC address and length filters and their relative performances.

ACL Filter

As the CLI says, ACL filters are applied in hardware so the junk is discarded at source before it hits the CPU. In this example I set up an ACL as follows:
Lab-6503E(config)#ip access-list extended icmp-only
Lab-6503E(config-ext-nacl)#permit icmp any any
Lab-6503E(config-ext-nacl)#deny ip any any
... and applied it to my capture as follows:
Lab-6503E(config)#monitor session 1
Lab-6503E(config-mon-capture)#filter access-group icmp-only
Now, I re-ran the "worst case" test, with massively different results:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec
*Jul 11 14:49:00.345: %SPAN-5-PKTCAP_START: Packet capture session 1 started
*Jul 11 14:50:00.337: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 60 packets captured
Lab-6503E#show proc cpu hist

         22222222223333355555          11111
100
 90
 80
 70
 60
 50
 40
 30
 20
 10                     *****
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per second (last 60 seconds)
Two things to note here - because this is a hardware filter, applied in the ASICs:
  • Only 1 packet per second was punted to the CPU, resulting in essentially no hit at all
  • All 60 of the ping packets were received and nothing else
In summary, ACL filters are ideal when you know what you want to pull from a big stream as you only pay a CPU penalty for the "good stuff".

MAC Filter

In contrast, the MAC filter runs in software. It's also pretty limited, only matching on source MAC. I repeated the above test but instead of an ACL match, applied a MAC filter as follows:
Lab-6503E(config)#monitor session 1
Lab-6503E(config-mon-capture)#filter mac-address 0011.2233.4455
This took us more-or-less back to the worst case:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec
*Jul 11 15:05:55.197: %SPAN-5-PKTCAP_START: Packet capture session 1 started
*Jul 11 15:06:55.189: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 5 packets captured
Lab-6503E#show proc cpu hist

      44444444443333344444333334444444444444444444444444333333
    4422222111119999900000999990000011111333331111111111888889
100
 90
 80
 70
 60
 50
 40   ********************************************************
 30   ********************************************************
 20   ********************************************************
 10   ********************************************************
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per second (last 60 seconds)

Yuck. I wouldn't do that in production. This is basically because the software filters are applied after the packets have been punted to the CPU, so you pay a penalty for the garbage as well as the good stuff. You'll notice that it only captured 5 packets as well, more on this later but that's another side effect of software filters.

Length Filter

The frame length filter is another software-based mechanism, which means it's pretty terrible under load, too. Our junk traffic consists of large frames, our interesting traffic is small, so let's configure the capture to only catch the short frames:
Lab-6503E(config)#monitor session 1
Lab-6503E(config-mon-capture)#filter length 0 100 
Again, the output is pretty miserable:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec
*Jul 11 15:15:12.145: %SPAN-5-PKTCAP_START: Packet capture session 1 started
*Jul 11 15:16:12.137: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 17 packets captured
Lab-6503E#show proc cpu hist

    1111444443333344444444443333333333333333333333333444443333
    4444000009999911111222229999999999999999999999999222229999
100
 90
 80
 70
 60
 50
 40     ******************************************************
 30     ******************************************************
 20     ******************************************************
 10 **********************************************************
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per second (last 60 seconds)
Again, the CPU took a hammering and we only captured a few of the ping packets - 17 out of 60.

Quirks / Order of Operations

Now you may think that software filters might be OK if we just reduce the rate-limit configured on the capture:
Lab-6503E(config)#monitor session 1
Lab-6503E(config-mon-capture)#rate-limit 100
This *does* do what we want for the CPU load - here's an example with a MAC filter:
Lab-6503E#monitor capture circular buffer size 65535 start for 60 sec
*Jul 11 15:26:43.793: %SPAN-5-PKTCAP_START: Packet capture session 1 started
*Jul 11 15:27:43.785: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 0 packets captured
Lab-6503E#show proc cpu hist

                    11111          11111          6666633333
100
 90
 80
 70
 60
 50
 40
 30
 20
 10                                               *****
   0....5....1....1....2....2....3....3....4....4....5....5....
             0    5    0    5    0    5    0    5    0    5
               CPU% per second (last 60 seconds)
Great - nothing to see here. But also nothing to see in the capture buffer - 0 packets captured!
Just for fun let's try the same with a hardware ACL filter:
*Jul 11 15:25:26.921: %SPAN-5-PKTCAP_STOP: Packet capture session 1 ended after the specified time, 60 packets captured
Why is this? Well, it's the order of operations. Basically the flow for hardware ACL filters is:
So the filter throws out the junk before the rate limiter, meaning that the rate limiter only counts the good stuff. If the good stuff exceeds the rate limit then you'll lose some of it but the junk doesn't count.
Compare that to the flow for the software filters:
The software filters are applied after the rate limiter, so clearly when the rate limit is exceeded you throw out a mix of good and bad traffic, then pick out what's left of the good. If your traffic is overwhelmingly garbage, you may not get any of the good stuff at all!

Summary - Play it Safe

So in answer to the question "is it safe to run a local capture on a production 6500" - packet capture on even a relatively modest SUP32-3B supervisor is pretty safe provided you are cautious. If you want to do this in a busy production environment then my message to you is:
  • Use ACLs where at all possible
  • Set the rate limit to a sensible value (the default 1000 fps is fine for most cases)
  • Use linear buffers of a sensible size (do you really need 64MB of capture?)
  • Limit the frame count or capture duration at first (it may turn out there is a lot more "interesting" traffic than you thought!)
I'm convinced enough to do this in production but obviously I'm only testing one device, on one release of code, so don't blame me if you try it on something different and encounter a bug!

Saturday 4 July 2015

OSPF stuck in EXCHANGE / EXSTART

One problem that occasionally comes up in network troubleshooting, mainly in carrier type environments, is a situation where OSPF refuses to come up to a FULL state and instead just sits in the EXCHANGE state at one end and the EXSTART state at the other. To be fair it's one of those things you've either seen or you haven't, but it's something every network engineer should know.

TL;DR - If you don't care why and just want to fix it: it's ALWAYS an MTU mismatch!

For those who are interested, I'll explain what's happening after a quick review of the OSPF neighbour establishment process. Here's a prettified version of the state table from RFC 2328:



Up until the EXSTART state, all the packets are small and no MTU information is shared so everything works fine. Now, to move from the from the EXSTART state into the EXCHANGE state the two devices must agree on who is master. This is done by each device sending an empty database descriptor (DBD) packet to the other - the devices check each other's DBDs and the device with the highest router ID becomes master.

The problem here is that DBDs contain MTU information and if a DBD is received with a higher MTU than the interface on which it arrived, the DBD is silently dropped as per RFC 2328:

"If the Interface MTU field in the Database Description packet indicates an IP datagram size that is larger than the router can accept on the receiving interface without fragmentation, the Database Description packet is rejected."

So, the device with the larger interface MTU receives a DBD, sends its DBD and is happy enough to move into the EXCHANGE state. The device with the smaller MTU has sent its DBD but has effectively not received one in return so it remains in EXSTART. No matter how many times the DBD with the larger MTU is retransmitted it will never be accepted. Eventually the state times out and we go back to the beginning.

Papering Over the Cracks


In Cisco IOS it is possible to configure ip ospf mtu-ignore under the interface, which drops the MTU check for that interface. This might seem like a good idea, however I wouldn't recommend it. Of course the best practice is to make sure MTUs are consistent across your network, there's not really an excuse to have MTU mismatched across a link! While ignoring the MTU might get the link up, you are storing up problems for later. Aside from the obvious data plane issues (black-holing large packets in one direction) you may also break the control plane.

For example, you could have a configuration that has been in place for months without change and has "always worked" but suddenly, following a link flap, is now stuck in EXCHANGE / EXSTART. Initially when you connect the devices up, the odds are that the LSDB will be small. At that point, the mismatched MTU will not cause problems and the neighbour will establish fine. Later on in life, though, the LSDBs will be full and the DBDs larger, until the device with the larger MTU has a big enough LSDB update to fill an over-sized packet which its partner can't handle. Then the state gets all screwed up and neighbours reset... bad times!

Debugging


If you're stuck in EXCHANGE / EXSTART but you're still not convinced it's MTU (or if you're trying to inter-op and your two devices use different conventions to define MTU) you can use debugs to confirm what's going on.

The key one here for Cisco IOS is "debug ip ospf adj", which produces output as shown below:

*Jul 4 22:04:23.979: OSPF-1 ADJ Fa0/0: 2 Way Communication to 10.4.4.254, state 2WAY
*Jul 4 22:04:23.983: OSPF-1 ADJ Fa0/0: Nbr 10.4.4.254: Prepare dbase exchange
*Jul 4 22:04:23.983: OSPF-1 ADJ Fa0/0: Send DBD to 10.4.4.254 seq 0x1D43 opt 0x52 flag 0x7 len 32
*Jul 4 22:04:24.011: OSPF-1 ADJ Fa0/0: Rcv DBD from 10.4.4.254 seq 0xA0CF5CC opt 0x52 flag 0x7 len 32 mtu 1500 state EXSTART
*Jul 4 22:04:24.011: OSPF-1 ADJ Fa0/0: Nbr 10.4.4.254 has larger interface MTU
*Jul 4 22:04:28.435: OSPF-1 ADJ Fa0/0: Rcv DBD from 10.4.4.254 seq 0xA0CF5CC opt 0x52 flag 0x7 len 32 mtu 1500 state EXSTART
*Jul 4 22:04:28.439: OSPF-1 ADJ Fa0/0: Nbr 10.4.4.254 has larger interface MTU
*Jul 4 22:04:28.623: OSPF-1 ADJ Fa0/0: Send DBD to 10.4.4.254 seq 0x1D43 opt 0x52 flag 0x7 len 32
*Jul 4 22:04:28.623: OSPF-1 ADJ Fa0/0: Retransmitting DBD to 10.4.4.254 [1]
[...]
*Jul 4 22:06:27.955: OSPF-1 ADJ Fa0/0: Rcv DBD from 10.4.4.254 seq 0xA0CF5CC opt 0x52 flag 0x7 len 32 mtu 1500 state EXSTART
*Jul 4 22:06:27.955: OSPF-1 ADJ Fa0/0: Nbr 10.4.4.254 has larger interface MTU
*Jul 4 22:06:28.147: OSPF-1 ADJ Fa0/0: Killing nbr 10.4.4.254 due to excessive (25) retransmissions
*Jul 4 22:06:28.147: OSPF-1 ADJ Fa0/0: 10.4.4.254 address 10.4.4.254 is dead, state DOWN
*Jul 4 22:06:28.151: %OSPF-5-ADJCHG: Process 1, Nbr 10.4.4.254 on FastEthernet0/0 from EXSTART to DOWN, Neighbor Down: Too many retransmissions
*Jul 4 22:06:28.151: OSPF-1 ADJ Fa0/0: Nbr 10.4.4.254: Clean-up dbase exchange
*Jul 4 22:06:32.555: OSPF-1 ADJ Fa0/0: Nbr 10.4.4.254 10.4.4.254 is currently ignored


On IOS-XR we have "debug ospf instance-id adj", which returns the same output.

On Juniper JunOS we can configure "set protocols ospf traceoptions flag database-description" which produces the output below:


Jul 4 22:04:41.813313 OSPF rcvd DbD 10.4.4.99 -> 10.4.4.254 (vlan.0 IFL 69 area 0.0.0.0)
Jul 4 22:04:41.814387 Version 2, length 32, ID 10.4.4.99, area 0.0.0.0
Jul 4 22:04:41.814466 checksum 0x0, authtype 0
Jul 4 22:04:41.814566 options 0x52, i 1, m 1, ms 1, r 0, seq 0x1d43, mtu 1492
Jul 4 22:04:41.815182 RPD_OSPF_NBRUP: OSPF neighbor 10.4.4.99 (realm ospf-v2 vlan.0 area 0.0.0.0) state changed from Init to ExStart due to 2WayRcvd (event reason: initial DBD packet was received)
Jul 4 22:04:41.815388 1400 Max dbd packet
Jul 4 22:04:41.815763 OSPF sent DbD 10.4.4.254 -> 224.0.0.5 (vlan.0 IFL 69 area 0.0.0.0)
Jul 4 22:04:41.815889 Version 2, length 32, ID 10.4.4.254, area 0.0.0.0
Jul 4 22:04:41.815970 options 0x52, i 1, m 1, ms 1, r 0, seq 0xa0cf5cc, mtu 1500
Jul 4 22:04:46.254104 OSPF resend last DBD to 10.4.4.99
Jul 4 22:04:46.254753 OSPF sent DbD 10.4.4.254 -> 224.0.0.5 (vlan.0 IFL 69 area 0.0.0.0)
Jul 4 22:04:46.254861 Version 2, length 32, ID 10.4.4.254, area 0.0.0.0
Jul 4 22:04:46.254966 options 0x52, i 1, m 1, ms 1, r 0, seq 0xa0cf5cc, mtu 1500
Jul 4 22:04:46.447212 OSPF rcvd DbD 10.4.4.99 -> 10.4.4.254 (vlan.0 IFL 69 area 0.0.0.0)
Jul 4 22:04:46.447359 Version 2, length 32, ID 10.4.4.99, area 0.0.0.0
Jul 4 22:04:46.447439 checksum 0x0, authtype 0
Jul 4 22:04:46.447584 options 0x52, i 1, m 1, ms 1, r 0, seq 0x1d43, mtu 1492
Jul 4 22:04:50.313983 OSPF resend last DBD to 10.4.4.99
[...]
Jul 4 22:06:45.737775 OSPF sent DbD 10.4.4.254 -> 224.0.0.5 (vlan.0 IFL 69 area 0.0.0.0)
Jul 4 22:06:45.737882 Version 2, length 32, ID 10.4.4.254, area 0.0.0.0
Jul 4 22:06:45.738103 options 0x52, i 1, m 1, ms 1, r 0, seq 0xa0cf5cc, mtu 1500
Jul 4 22:06:50.336478 OSPF resend last DBD to 10.4.4.99
Jul 4 22:06:50.337124 OSPF sent DbD 10.4.4.254 -> 224.0.0.5 (vlan.0 IFL 69 area 0.0.0.0)
Jul 4 22:06:50.337291 Version 2, length 32, ID 10.4.4.254, area 0.0.0.0
Jul 4 22:06:50.337414 options 0x52, i 1, m 1, ms 1, r 0, seq 0xa0cf5cc, mtu 1500
Jul 4 22:06:54.868260 RPD_OSPF_NBRDOWN: OSPF neighbor 10.4.4.99 (realm ospf-v2 vlan.0 area 0.0.0.0) state changed from ExStart to Init due to 1WayRcvd (event reason: neighbor is in one-way mode)

References

RFC2328


Friday 19 June 2015

JunOS Traceroute Error Codes

The other day I was getting a strange result on a traceroute I ran from a Juniper MX router (!S), so I decided to Google what that meant. Strangely enough, it proved pretty difficult to find  - there doesn't seem to be anything in the Juniper knowledge base article about response codes and nobody seems to have written anything about what !S means in blogs.

Eventually I got to the bottom of it by searching for some of the other error / response codes that JunOS traceroutes throw out. It turns out that JunOS traceroute "has some commonalities" with BSD traceroute, including the error / response codes. I guess a lot of the code is probably borrowed :)

So, to paraphrase the BSD manpages, here are some response codes you could expect to see in the wild:

!H - Received Host Unreachable
!N - Received Network Unreachable
!P - Received Protocol Unreachable (host doesn't understand UDP?)
!S - Source routing failed
!M - MTU Exceeded
!A - Communication with destination network admin prohibited (usually Cisco router ACL)
!Z - Communication with destination host admin prohibited
!X - Communication administratively prohibited

And here are some far less likely ones:

!U - Destination network unknown
!W - Destination host unknown
!I - Host is isolated
!Q - Destination network unreachable for this ToS
!T - Destination host unreachable for this ToS
!V - Host precedence violation
!C - Precedence cutoff in effect
!<num> - Gives the ICMP unreachable code number for an unrecognised response

I've not verified some of these, I think my next little project will be a scapy script to generate each possible unreachable response so I can see how the Juniper displays it. I'm also not sure why the MX was trying to source route its traceroute packets. To be fair I was telling it to override the routing table when sending but I didn't think that would cause it to source route. I'll do a bit more investigation into that and post the results if anything interesting emerges.

Monday 15 June 2015

Re-assembling IP Fragments in PCAP Files

Some time ago I created "stripe", a tool for stripping back layers of encapsulation headers from PCAP files leaving plain payload (typically IP) over Ethernet. "stripe" works with a variety of encapsulation types, from simple VLAN tags up to GRE and GTP, however one thing that stripe couldn't handle was if packets were fragmented after being encapsulated.

One user, LisbethS, suggested that I should build in IP fragment reassembly capabilities into stripe, however my first thought was that the two should be separate utilities. As I thought about it more, though, I realised that the two functions (decapsulation and re-assembly) were actually intertwined - if you treat them as separate processes then you can't re-assemble IP that is encapsulated within something else, nor can you decapsulate GRE or GTP that has been fragmented. The key, then, was to do both re-assembly and decapsulation as part of the same process.

Reassembling Packet Fragments


RFC 815 describes a minimal way to re-assemble IP fragments which is not that complex in principle, so I thought I'd add the functionality. I soon realised it wasn't quite as straightforward as I thought and you have to be very careful about the order of operations.

For example, if you have a packet that gets encapsulated then subsequently fragmented, then trying to decapsulate without reassembling first will fail (the first fragment decapsulates to a partial frame, then the subsequent fragment(s) fail to decapsulate). On the other hand, if you re-assemble first then decapsulate then you don't catch the case where a packet is fragmented before encapsulation. Neither approach can catch the case where a packet is fragmented, then encapsulated, then subsequently fragmented again.

To cut a long story short, the answer appears to be that you need to iteratively re-assemble, decapsulate until fragments are found, re-assemble again, decapsulate again... until there are no more fragments and everything is fully decapsulated.

Anyway, stripe now does both decapsulation and IP fragment re-assembly, meaning that it can take a pcap file containing fragmented and / or encapsulated packets, strip off all the encapsulation and re-assemble the fragments and write out the result to a new pcap file.

Download


The latest version is available for download at https://github.com/theclam/stripe - it is available as source code (compiles without dependancies in almost any Linux distro) and there are also Mac and Windows binaries for easy download.

UPDATE:

I've managed to recreate some of the SEGFAULTs that people have been kindly reporting to me. It turns out there was a typo / n00b mistake (I'm not sure which, most of this is coded way too late at night) which I have now corrected. If you tried before and got an error, it may be fixed now. The memory leaks have also been reduced from "raging" to "moderate" :)

Sunday 14 June 2015

Quirks of Traceroute over MPLS Networks

Working on MPLS networks, particularly with VRFs, there are a few questions that come up time and time again about traceroutes. 

In this post I'll try to provide answers to such questions as:
  • Why do only two of my service provider's hops show in a traceroute?
  • I am seeing loss / high latency on my traceroute as soon as it enters my provider's MPLS core. The provider says it's due to the destination site's link being congested but it appears way before that in the trace. What's going on?
  • My firewall is attached to an MPLS provider's PE and it is complaining about spoofed packets. When I examine the packets I see they are ICMP unreachables destined to another site on the MPLS WAN. Why would I see this? 
  • What does it mean when my traceroute shows MPLS: whatever?

To understand the answer to any of these questions, you will need to appreciate a) how traceroutes work (check Wikipedia if unsure) and b) the concept of IP / MPLS TTL propagation.

Life Without IP / MPLS TTL Propagation


IP and MPLS both have a TTL field which is used to prevent routing loops from causing packets to circulate forever. IP / MPLS TTL propagation is an optional mechanism which copies the IP TTL into the MPLS shim header upon encapsulation, and copies it back from the MPLS shim header into the IP packet when decapsulation occurs. 

In order to see why we'd ever want TTL propagation, let's take a worked example of CE1 tracing to CE2 without TTL propagation.

Hop 1


The first hop is pretty straightforward. CE1 sends a packet towards CE2 with a TTL of 1.


PE1 receives the packet and decrements the TTL, which reaches zero. PE1 sends a TTL expired message to the sender (CE1).

Hop 2


Next, CE1 sends a packet towards CE2 with an IP TTL of 2.


PE1 decrements the TTL, finds it to be non-zero (1) and decides to forward the packet on via MPLS. Since IP MPLS TTL propagation is disabled, the MPLS shim header gets a default TTL of 255.

P1 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (254) and label switches it.

P2 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (253) and label switches it.

PE2 receives the labelled frame, pops the MPLS label off and looks at the IP within. When PE2 decrements the IP TTL it becomes zero, so a TTL expired message is sent to the sender. Note, since PE2 has an interface address within the VRF, the TTL expired message is sourced from that address rather than the interface where the original packet entered.

Hop 3


Finally, CE1 sends a packet towards CE2 with an IP TTL of 3.


PE1 decrements the TTL, finds it to be non-zero (2) and decides to forward the packet on via MPLS. Since IP MPLS TTL propagation is disabled, the MPLS shim header again gets a default TTL of 255.

P1 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (254) and label switches it.

P2 receives the labelled frame, decrements the MPLS TTL, finds it to be non-zero (253) and label switches it.

PE2 receives the labelled frame, pops the MPLS label off and looks at the IP within. PE2 decrements the IP TTL, finds it to be non-zero (1) and forwards the packet.

CE2 receives the packet, finds that it is destined to itself and responds to the sender, completing the traceroute.

The Output


As you can see, with TTL propagation disabled, the entry and exit PEs show up in the traceroute but pure label switched hops (P routers) do not:

CE1#trace 2.2.2.2

Type escape sequence to abort.
Tracing the route to lo-0.ce2.wormtail.co.uk (2.2.2.2)

  1 ge-2-1.pe1.wormtail.co.uk (192.168.1.1) 4 msec 4 msec 2 msec
  2 ge-2-1.pe2.wormtail.co.uk (192.168.2.1) 4 msec 4 msec 4 msec
  3 ge-0.ce2.wormtail.co.uk (192.168.2.100) 8 msec 7 msec 6 msec
CE1#

Sometimes that's desirable, particularly if the carrier wants to hide how many hops it takes to get from A to B.

In any case, we can answer the first question:
Why do only two of my service provider's hops show in a traceroute?

Your provider has disabled IP/MPLS TTL propagation, meaning that label-switched hops do not appear in traceroutes.

Enabling TTL Propagation


Let's consider a second worked example showing the same traceroute running over the same network but with TTL propagation enabled.

Hop 1


The first hop doesn't get as far as being MPLS encapsulated so it behaves exactly the same as with propagation disabled, i.e. CE1 sends a packet towards CE2 with an IP TTL of 1, PE1 decrements that to zero and generates a TTL expired.

Hop 2


Now CE1 sends a packet towards CE2 with an IP TTL of 2:


PE1 receives the packet and decrements the TTL to 1. This is non-zero so the packet is forwarded using MPLS. Since IP - MPLS TTL propagation is enabled on PE1, the MPLS TTL is copied from the already decremented IP TTL value.

Now, when P1 receives the labelled packet it decrements the TTL and finds zero. P1 sends a TTL expired message towards the sender.

Hop 3


Hop 3 behaves in essentially the same manner as hop 2 - the MPLS TTL decrements to zero and a TTL expired message is sent to the sender.

Hop 4


Now CE1 sends a packet towards CE2 with an IP TTL of 4:


The behaviour to note here is that when PE2 pops the MPLS label, it copies the MPLS TTL back into the IP TTL. The IP TTL is now 1, as if it had been decremented at each hop. Finally, PE2 decrements the new IP TTL, finds that it is zero and sends a TTL expired.

Hop 5


Finally, CE1 sends a packet towards CE2 with an IP TTL of 5:


Essentially, PE1 decrements and copies into the MPLS TTL, each label switch hop decrements the MPLS TTL, PE2 copies the MPLS TTL back into the IP TTL, then forwards the packet on to CE2. CE2 realises that the packet is destined to itself and sends a response to CE1.

In this case, every hop is visible in the traceroute because the TTL behaves consistently and is decremented at each hop:

CE1#trace 2.2.2.2

Type escape sequence to abort.
Tracing the route to lo-0.ce2.wormtail.co.uk (2.2.2.2)

  1 ge-2-1.pe1.wormtail.co.uk (192.168.1.1) 6 msec 4 msec 3 msec
  2 xe-1-1.p1.wormtail.co.uk (10.1.100.2) 4 msec 4 msec 4 msec
  3 xe-1-1.p2.wormtail.co.uk (10.100.101.2) 5 msec 4 msec 5 msec
  4 ge-2-1.pe2.wormtail.co.uk (192.168.2.1) 2 msec 7 msec 4 msec
  5 ge-0.ce2.wormtail.co.uk (192.168.2.100) 6 msec 6 msec 4 msec
CE1#


Ah, a beautiful traceroute with all the hops on... now we can troubleshoot stuff :)

Hold On!


Now, this all seems good at first glance, but we've forgotten something. One of the greatest features of MPLS VRFs is that the intelligence is only needed around the edge. - The entry / exit points of the network (PEs) need to know about a VRF's routes, but the P routers switching labels in the middle only need to know how to reach the PEs and don't have any VRF routes at all.

So how do the P routers reply to traceroutes run within a VRF? They can't look in the VRF routing table to decide how to return the unreachable message to the host, as they don't have a VRF table. Maybe they see the label that is in use and somehow know what the return label should be? Nope - label-switched paths are unidirectional, so there's no way for a P node to know what labels it would need to attach to an unreachable message to get it back to the sender.

The answer is that the P routers play a clever little trick. The P router doesn't know how to reach the sender, but if it forwards a TTL expired message in the same way as it would have done the original packet eventually it will reach the egress PE which does know how to reach the sender. It sounds a little counter-intuitive, so let's have a diagram:




P1 receives a labelled packet, decrements the MPLS TTL and finds it to be zero. P1 doesn't know anything about the VRF where this traffic originated so it peels off all the labels and looks at the IP packet inside. It generates a TTL expired message ready to respond to the sender and copies the label stack from the original packet onto the message it has just generated. Then it label switches that instead of the original packet. This means it gets sent onwards to P2, then PE2, where the labels are removed and the IP packet can be routed. Seeing that the ICMP TTL expired message is destined for CE1, PE2 applies the appropriate label and sends it back into the network.

Indirect FECs


Now, as if that wasn't quirky enough, we will look at a very similar setup but where the destination of the traceroute is not on the PE's directly attached network but is one hop away:



For efficiency reasons, many vendors allocate a label per next-hop rather than a label per VRF. When the PE receives such a label, it automatically adds the encapsulation for the next hop device's MAC and throws the frame out of the appropriate interface without any kind of routing lookup. Hurray, we burnt a label to save a TCAM lookup.

This has the effect of extending the quirk in the previous example - because the PE does not make a routing decision for these packets, the U-turning of ICMP unreachables occurs one hop further along, at the CE:

No big deal, you might think, but remember: the difference between a PE and a CE is that the PE is normally in the provider's PoP on big fat 10 Gbps links while the CE could be out in the sticks somewhere on the end of a 2 Mbps DSL line... U-turning at the CE means that responses from label-switched hops near the beginning of a trace will appear to inherit any packet loss and / or latency being experienced by hops further along, up to and including the PE - CE link.

I am seeing loss / high latency on my traceroute as soon as it enters my provider's MPLS core. The provider says it's due to the destination site's link being congested but it appears way before that in the trace. What's going on?

When you traceroute to a prefix beyond a CE device, the unreachables from label-switched hops usually have to be U-turned by the CE device, which is on the other end of the congested link. Therefore label-switched hops will often show packet loss / high latency in a traceroute even though the issue is further downstream.


Weird Firewall Logs


If we consider MPLS into the datacentre, we can often tweak the example above and replace the CE router with a firewall:



Now, to recap:

  • Label-switched hops don't have routing knowledge to send a TTL expired message, so they attach the forward path label
  • The PE knows that the label corresponds to a prefix behind the firewall so it encapsulates with the firewall's MAC as the destination and throws it straight out of the interface
So the firewall receives a packet on its WAN interface which is destined for something out of its WAN interface. Most firewalls don't like that - Cisco ASA firewalls, for example, by default will not route any traffic back out of the same interface where it entered.

The traffic split-horizon rule can be overridden in config, but there is another issue to overcome - any self-respecting firewall will perform uRPF checks on incoming packets before too much else happens. Unless the firewall has a default route pointing out into MPLS land then we are going to have a problem when packets arrive from the carrier's label switched core, often sourced from a public IP.

Finally, unless your firewall rules are pretty sloppy, the traffic will be dropped by filter anyway!

Therefore, the firewall logs spoofed packets or at least dropped ICMP unreachable packets with a source IP of the carrier's PE and a destination IP of some device in a remote site.

So, mystery solved:

My firewall is attached to an MPLS provider's PE and it is complaining about spoofed packets. When I examine the packets I see they are ICMP unreachables destined to another site on the MPLS WAN. Why would I see this?

Your carrier is using label per next-hop with IP - MPLS TTL propagation and some remote device tried to traceroute towards an address behind your firewall.

Though, if you jumped to this section, you probably want to read the preceding sections to make sense of that!

MPLS Information in Traceroutes


If your router supports the feature, you may see MPLS labels quoted in a traceroute, as shown here:

PE1#trace 10.255.255.2
Type escape sequence to abort.
Tracing the route to lo-0.pe2.wormtail.co.uk (10.255.255.2)
VRF info: (vrf in name/id, vrf out name/id)
  1 xe-1-1.p1.wormtail.co.uk (10.1.100.2) [MPLS: Label 17 Exp 0] 1 msec 6 msec 2 msec
  2 xe-1-1.p2.wormtail.co.uk (10.100.101.2) [MPLS: Label 42 Exp 0] 6 msec 2 msec 2 msec
  3 xe-1-1.pe2.wormtail.co.uk (10.2.101.1) 4 msec 4 msec 3 msec
PE1#


Having the labels can be useful for troubleshooting, not that I've ever used them. The presence of the EXP marking could be useful for debugging QoS issues, at least. I always assumed that there was some kind of MPLS trick used to relay the information back but then I noticed you sometimes get it when you trace through an external provider's network which hands over to you on IP.

It's actually done using ICMP extensions for MPLS (RFC 4950) - basically the whole MPLS label stack as it arrived at the label switch router is wrapped up in an ICMP extension header and copied wholesale into the unreachable or TTL expired message. As long as the ICMP message gets to you, so will the label stack.

Most hosts don't support the extension, so you just see a normal-looking traceroute result, but recent routers generally do and can decode the information and present it to you.

Here's a sample packet inspected in Wireshark:

So:

What does it mean when my traceroute shows MPLS: whatever?

Your traceroute has run through an MPLS backbone somewhere, the MPLS label stack is included in the unreachable message for troubleshooting purposes.
Fairly self explanatory, but good to know.

References


Wikipedia article on Traceroute
RFC 4950 (MPLS extensions to ICMP)
Per CE and per-VRF labelling mode on Cisco IOS-XR

Sunday 19 April 2015

Quick Build - PPPoE Client on JunOS

In this quick-build guide I'll show you how to set up a very basic JunOS-based PPPoE client. This example is from a Firefly virtual SRX firewall appliance, however the config should be identical on any JunOS platform. As usual, the build will cover the most simple common use case (no VLAN tags, dynamic AC selection, negotiated IP).

Note, if you want a PPPoE access concentrator to go with your client, you may find the Quick Build: Cisco IOS PPPoE Server with RADIUS Authentication post useful.

The Setup


The PPPoE client is set up in two config stanzas - the first being the physical interface which will connect towards the access concentrator, the second being a virtual point to point interface that will become live when the PPPoE session comes up. We'll build the physical interface first, as follows:

set interfaces ge-0/0/2 unit 0 encapsulation ppp-over-ether

In true JunOS fashion, very little config required there. Turn the interface encapsulation dial to PPPoE :)

Next, we need to set up the point to point interface. We'll create it as unit 0 and bind it to the physical interface we just configured:


set interfaces pp0 unit 0 pppoe-options underlying-interface ge-0/0/2.0
set interfaces pp0 unit 0 pppoe-options client

Now the PPP settings:

set interfaces pp0 unit 0 ppp-options chap default-chap-secret b0dges
set interfaces pp0 unit 0 ppp-options chap local-name "user@domain"
set interfaces pp0 unit 0 ppp-options chap passive


The lines above essentially just set the CHAP local name (think username), the CHAP secret (think password) and set CHAP to passive mode (i.e. tell it not to try to get the AC to authenticate to us). All that then remains is to configure up the IP:

set interfaces pp0 unit 0 family inet negotiate-address
set routing-options static route 0.0.0.0/0 next-hop pp0.0

That really is all that you need! In real life you will probably need to add NAT and so on, but the PPPoE configuration is done and the interface should just pop up on its own.

Debugging


Debugging PPPoE setup is best done by enabling trace logging for the PPP and PPPoE protocols as follows:

set protocols ppp traceoptions file ppp
set protocols ppp traceoptions level all
set protocols ppp traceoptions flag all
set protocols pppoe traceoptions file pppoe
set protocols pppoe traceoptions level all
set protocols pppoe traceoptions flag all


The output of these traces can then be seen using "show log ppp" and "show log pppoe" respectively. They are quite verbose and should give a strong steer on what is not working.

Quick Build - Cisco IOS PPPoE Server with RADIUS Authentication

In this guide I'll show you how to quickly set up an IOS-based PPPoE access concentrator and a RADIUS server for it to authenticate against. As part of the setup I'll include both dynamic (pool based) IP subscribers and a fixed IP subscriber, which should cover most basic use cases.

The setup I describe will look like this (we will only build the RADIUS and the AC):



If you need a client to go with it, please check out my post titled Quick build - PPPoE Client on Cisco IOS

Stage 1 - The RADIUS Server


Firstly we'll configure the RADIUS server - the starting point for this is a completely clean install of Kali Linux. It makes no difference if this is running as a VM or on a physical box but you will probably run out of "disk space" if you try to do this from a live CD boot. This assumes that you either have configured an Internet connection or have a full Kali disc / image from which to install packages.

Task 1 - Configure the Network Interface


Edit /etc/networks/interfaces and add the following (adjust interface names as necessary):

auto eth1
iface eth1 inet static
    address 10.0.0.10
    netmask 255.255.255.0


Save the file, then run:

ifdown eth1
ifup eth1


Task 2 - Install FreeRADIUS


Assuming your Internet connection / Kali disc is accessible, just run:

apt-get update
apt-get install freeradius


Task 3 - Configure FreeRADIUS


The FreeRADIUS config files are pretty big and are mostly full of examples that aren't relevant to this setup, so we'll just set them aside and create new ones from scratch. Firstly we will replace the clients.conf file which tells FreeRADIUS which devices are allowed to authenticate against it.

Firstly, ditch the old one:
mv /etc/freeradius/clients.conf /etc/freeradius/ORIG-clients.conf

Next, edit /etc/freeradius/clients.conf and add:

client 10.0.0.100/32 {
    secret        = b0dges
    shortname    = PE1
    nastype        = cisco
}


Save the file and exit. Now do a similar thing with the users file (which is used to define how the users will be authenticated):

mv /etc/freeradius/users /etc/freeradius/ORIG-users

Edit /etc/freeradius/users and add:

DEFAULT         Auth-Type := CHAP, Cleartext-Password := "password1"
                    Framed-Protocol = PPP,
                    Framed-IP-Address = 255.255.255.254
foeh@fixed      Auth-Type := CHAP, Cleartext-Password := "password2"
                    Framed-Protocol = PPP,
                    Framed-IP-Address = 192.168.100.1


Save the file and exit. Note that the special "255.255.255.254" address above instructs the access concentrator to assign an IP locally from its pool.

Task 4 - Restart FreeRADIUS with the New Config


Simply run:

service freeradius restart


The service should restart without error. That's the difficult RADIUS config done, now onto the access concentrator!

Stage 2 - Configure the Access Concentrator


I'll go into a little more detail on this part as it's not quite as intuitive. The base device here is an IOS 12.3 router, Cisco's licensing model is complex and seems to vary wildly between platforms so I'll let you poke around in feature navigator to work out which feature set and release will work on your device... It seems to work fine on a 3845 running security services if that helps.

Task 1 - Configure Interfaces


We need a couple of interfaces configured here:
  • An interface towards RADIUS (obviously)
  • A loopback interface (used to address the "unnumbered" PPPoE tunnel interfaces)
  • A client interface (where the PPPoE users will attach)

interface FastEthernet1/0
 description To RADIUS
 ip address 10.0.0.100 255.255.255.0
 no shutdown
!
interface Loopback0

 description IP for Unnumbered Tunnel Interfaces
 ip address 1.1.1.1 255.255.255.255
!
interface FastEthernet0/1
 description To Clients
 no shutdown
!


We won't configure anything on the client interface for the moment but we will return to it momentarily...

Task 2 - Configure PPPoE


Firstly we need to configure a BBA group, which is just a way to associate a bunch of settings with a particular interface. We'll use the default "global" group and configure it to use virtual-template 1:

bba-group pppoe global
 virtual-template 1

Next, we need to define what Virtual-Template1 is. Virtual template interfaces are used to define a prototype on which to base the Virtual-Access (tunnel) interfaces which are automatically created when a PPPoE user connects. In this most simple of examples we just define where the interface's local IP address should be cloned from, where the PPPoE user's IP address should be allocated from and the authentication protocol we want to use:

interface Virtual-Template1
 ip unnumbered Loopback0
 peer default ip address pool localpool
 ppp authentication chap


The "peer default ip address" command above refers to a pool called "localpool", so we'd better create that:

ip local pool localpool 172.16.0.1 172.16.0.100

Now that we've defined all that good stuff, we need to apply the BBA group to the client interface:

interface FastEthernet0/1
 pppoe enable group global


Pretty simple so far and, in fact, that's most of the config done. All we need to do now is to...

Task 3 - Point the Router at the RADIUS


There are three relatively straightforward steps here, firstly we have to enable AAA new model on the device and then define the RADIUS server details (note, FreeRADIUS' ports differ from Cisco's defaults so we need to specify them):

aaa new-model
radius server kali
 address ipv4 10.0.0.10 auth-port 1812 acct-port 1813
 key b0dges


Then we have to tell the router to go to RADIUS when authenticating PPPoE users:

aaa authentication ppp default group radius
aaa authorization network default group radius


The second line is not necessary for dynamic IP users but instructs the router that it should allow RADIUS to tell it what the user's IP (and some other things) should be. If you leave out the last line then users with a RADIUS defined static IP will get one out of the pool like everyone else, so if RADIUS Framed-IP attributes are being ignored this is probably the cause.

Note: You may want to configure some local usernames for access to the CLI and add "aaa authentication login default local" or similar.

At this point, your PPPoE Access Concentrator with RADIUS authentication is ready for use!

If you're unsure how to set up a client, I've also written quick build posts for that:



Debugging


No build would be complete without a little bit of debugging. This is such a straightforward setup that, barring layer 3 issues, there's not a lot that can go wrong. Troubleshooting would pretty much be as follows:

  • Verify that you are getting PPPoE control traffic in from your client (debug pppoe packet, debug pppoe event). The sequence should be PADI, PADO, PADR, PADS. PADT indicates someone is pulling down the session, the debugs should show you who!
  • Check IP reachability to the RADIUS box using ping
  • Verify that FreeRADIUS is running (ps -aux | grep freeradius), start it if necessary (/etc/init.d/freeradius start)
  • If your client can't authenticate, check the password matches what's in FreeRADIUS (/etc/freeradius/users), not forgetting to restart FreeRADIUS if you make changes (/etc/init.d/freeradius restart)
  • If the passwords match but you are still getting authentication errors, verify that your secrets match between the router ("key" under "radius-server") and FreeRADIUS (/etc/freeradius/clients.conf), the NAS IP matches your router and that FreeRADIUS has been restarted since the last change (/etc/init.d/freeradius restart)
  •  Check your PPP is negotiating OK (debug ppp negotiation)
Some more tips that may be helpful can be found on my post about debugging Cisco PPPoE clients.

Quick build - PPPoE Client on Cisco IOS

In this quick-build guide I'll show you how to set up a very basic IOS-based PPPoE client. This example is from a Cisco 819 router, however the config is pretty much the same on most ISR type devices. As usual, the build will cover the most simple common use case (no VLAN tags, dynamic AC selection, negotiated IP).

Note, if you want a PPPoE access concentrator to go with your client, you may find the Quick Build: Cisco IOS PPPoE Server with RADIUS Authentication post useful.

The Setup



The PPPoE client is basically set up in two parts - the first being the physical interface which will connect towards the access concentrator, the second being a dialer interface that will become instantiated when the PPPoE session comes up. We'll build the physical interface first, as follows:

interface GigabitEthernet0
 description To AC
 pppoe enable pppoe-client dial-pool-number 1
 no shutdown
!

Pretty minimal... turn PPPoE on, and tell it which dialer pool to use. Note, in older versions of IOS the command was simply "pppoe-client dial-pool-number 1". Next, we have to configure the dialer interface, as follows:

interface Dialer1
 ip address negotiated
 encapsulation ppp
 dialer pool 1
 dialer-group 1
 ppp authentication chap callin
 ppp chap hostname user@domain
 ppp chap password 0 b0dges
!
dialer-list 1 protocol ip permit
ip route 0.0.0.0 0.0.0.0 Dialer1

This creates the dialer interface that we will use, tells it to use PPP and to pick up its IP address dynamically.

The "dialer pool" command places this dialer into the pool where the physical interface was set to look, while the "dialer-group" command specifies which dialer-list will be used to decide what traffic is interesting (i.e. will bring or keep the PPPoE session up).

The PPP commands force the authentication type to CHAP, specify that we will not make the AC authenticate to us (generally not supported) and set the CHAP hostname (think username) and password.

Finally, the dialer-list referred to in the earlier "dialer-group" command is defined to match any IP traffic at all, before a static route is used to force traffic out of the dialer interface.

That really is all that you need! In real life you will probably need to add NAT statements and you will definitely need at least one other interface, but that's the PPPoE part done.

Debugging


There's an entire post dedicated to this subject, but the short version is as follows:

  • Verify that you are getting PPPoE control traffic between your client and the server (debug pppoe packet, debug pppoe event). The sequence should be PADI, PADO, PADR, PADS. PADT indicates someone is pulling down the session, the debugs should show you who!
  • Check the static route has installed in your routing table as traffic will only trigger the PPP up if it hits the interface (show ip route)
  • Verify that there is at least one "up" IP interface on the box other than the dialer. If there's no source address usable then any test traffic will fail to encapsulate and you won't be able to bring PPP up. (show ip interfaces brief)
  • If your client can't authenticate, check the credentials (both hostname and password under the Dialer interface) and ensure that the authentication type is CHAP in "callin" mode.
  •  Check your PPP is negotiating OK (debug ppp negotiation)