TruCore : Truminds 5G UPF now supports HA
Nov 22, 2021
In this blog, we are going to tell you about how we went about implementing High Availability (HA) in Truminds 5G UPF as a recent addition. So, first of all let’s understand some key terms quickly.
UPF stands for the User Plane Function. It is a 5G network function which routes traffic from mobile side to internet and vice versa. You can imagine that a UPF must handle packets from tens and thousands of mobiles if not more, so it is operating at scale and must be highly efficient in its processing.
Just imagine that the mobile is towards the far left, UPF is in the middle and the Internet is on the right side, something like this –
- |-- UPF
- ||-- Association-1
- |||-- Session-1
- ||||-- PDR-1
- |||||-- QER-ref-1
- |||||-- QER-ref2
- |||||-- QER-ref-n
- |||||-- QER-ref-1
- |||||-- URR-ref1
- |||||-- URR-ref2
- |||||-- URR-ref-n
- ||||-- PDR-2
- ||||-- PDR-n
- ||||-- FAR-1
- ||||-- FAR-2
- ||||-- FAR-n
- ||||-- QER-1
- ||||-- QER-2
- ||||-- QER-n
- ||||-- URR-1
- ||||-- URR-2
- ||||-- URR-n
- ||||-- BAR
- |||-- Session2
- |||-- Session-n
- ||-- Association-2
- ||-- Association-n
Typically a UPF would handle hundreds of thousands of mobiles, perhaps even millions in a big system. That’s one big ladder ! It would therefore have thousands of Sessions in its memory setup by the SMF. In each of those sessions will be the PDR’s, FAR’s, QER’s, URR’s. Woh ! And if the UPF works flawlessly like the Truminds UPF does, everything would be hunky dory, all the mobiles would be handled correctly with high throughput due to efficient software at UPF. So you see that the UPF becomes a mission critical system in a 5G network. If the UPF is there, then the data would be shunted between the mobile and the internet, if the UPF is not there, then the mobile cannot reach out to the internet. Now imagine a scenario where a fully loaded UPF, with a million Sessions programmed on it by the SMF, is running live in the field. Suddenly due to a hardware failure, the UPF machine crashes. Woops, now all the million mobiles lose all their data connectivity to the internet. As we said earlier, no UPF no data shunting. This is a huge issue. It might take a few minutes before we can spin up a new UPF on a new machine, then it might take SMF several minutes to setup the same calls all over again on this new UPF. All this means downtime in a telecom network, bad user experience and everything else that goes along with it.
So, it is clear that no matter how good a UPF you build, there is a chance that it might come crashing down either because of a rare software fault or perhaps even a rarer hardware fault. We need to have some kind of an insurance policy for such a scenario. That is where the High Availability (HA) solution of Truminds UPF comes into play. In short, Truminds UPF supports a Standby system. So in effect we have 2 UPF’s running – the first is the Active and the second is the Standby. For all external entities like SMF, there is still just one UPF. Internally they are two of them.The Active UPF does all what has been described earlier nicely. But in the rare scenario when the Active UPF goes down for any reason, there is no need to worry about downtime. The Standby UPF is all ready to take over almost immediately and switches into an Active role and continues to run the show in the network. In the meantime, the operator can resolve the problems at the UPF which went down and can launch it again (eg. on a different machine if it was a hardware failure). This newly launched UPF will automatically become Standby and would be ready to intervene if the currently Active UPF goes down ever.
Effectively the picture becomes like this
So it takes two to tango and solves the problem. Good then, we can all go home happy !
But wait, how is the cooperation between 2 UPF’s achieved? Some of the few questions which come to mind are the following –
- When we initially launch 2 UPF’s how do they determine which will be Active and which will be the Standby ?
- How do we synchronize the Standby with the Active so that the Standby has all the information about all the mobile Sessions and their PDR’s, FAR’s, URR’s, QER’s etc. ?
- What if the Active already has a million Sessions on it and then the other instance of UPF is launched, how do we initially replicate these million Sessions at the Standby to keep both the instances in sync ?
- How does the Standby determine that the Active instance is now dead ?
So let’s unravel the answers to all these questions one by one. It should be noted that while we describe all this in context of the 5G UPF, the paradigm is valid for any similar system which handles data with contexts present inside and needs high availability.
One footnote – some of us may jump to a conclusion that UPF can be made stateless by storing all the Session information into an unstructured database like Redis. But we are talking about ‘dataplane’ here. It is ok for control plane applications to use Redis but the data plane has to have a very high throughput. There is simply no time available to fetch the data from an external database like Redis over the network on per packet basis without suffering loss of throughput, so the data must reside in a stateful fashion inside the UPF memory itself for it to achieve high throughput and therefore the challenge of maintaining state sync between the Active and Standby.
Alright, so let’s first see how the role negotiation is done between two UPF’s potentially running in the system. So, when a UPF is started it already knows what its HA interface IP is and what the HA IP address of the peer UPF is. Essentially a dedicated interface is used between the two UPF’s for HA related interactions. When a UPF comes up, it has to first determine if it should become Active or Standby. The way it does this is by doing the following message dialog in various scenarios –
Let’s assume that peer UPF is up and running as Active
Starting UPF to peer UPF: Are you Active ?
Peer UPF to Starting UPF: Yes I am Active
Starting UPF comes up as Standby. Role negotiation over
Let’s assume that peer UPF is not running at at all
Starting UPF to peer UPF: Are you Active ?
Starting UPF waits for the answer from peer UPF and then timesout
Starting UPF comes up as Active. Role negotiation over
Starting UPF is starting and the Peer UPF is also starting at the same time
Starting UPF to peer UPF: Are you Active ? (and at the same time – Peer Starting UPF to first Starting UPF: Are you Active ?)
Woops, the questions go towards each other, now what should each one answer, hmmm.
Ok, so here the starting UPF knows that it got a question while it had already sent a question, so it evaluates if it’s IP address is lesser than the Peer Starting UPF. If it is lesser, then it answers Yes I am Active and becomes Active. If it’s IP address is greater than the Peer Starting UPF, then it remains silent and becomes Standby, in this case the Peer Starting UPF will become Active by the same logic.
So you see it is a simple state machine. Now we have not provided all the details here but it is easy to spot the mechanics of the scheme and it is intuitively clear that it is not a complex implementation at all.
Now let’s proceed to the next step. One of the biggest anxieties of the Standby UPF is to ensure that it is in ‘sync’ with the Active UPF in terms of all the data structures which map to all the Sessions in our case along with their PDR’s, FAR’s, URR’, QER’s etc. It might be that the Active UPF is already handling a million Sessions while the Standby UPF has just come up. So how do we handle this ? Now then, there are a few dimensions to this problem. First is that the Session data structure is a complex data structure at the UPF. One can imagine that it will have a list of various PDR’s, FAR’s etc. inside it so there are pointers involved. So we simply cannot send the data from Active to Standby. The pointers on Active UPF will not be meaningful in the address space of the Standby UPF. The data has to be therefore ‘marshalled’ and then sent to the peer and on the peer the data has to be ‘unmarshalled’ to create a similar data structure with all the consistent relationships in the data structure for cross references amongst the elements of that data structure. It is kind of a deep copy and transmission. So this interface specification has to be carefully designed and maintained as the data structures evolve over a period of time.
Remember our ladder diagram for the data model, the bulk replication really means the following – replicate the entire ladder from Active to Standby UPF
Alright, let’s say we have written the routines to marshal and unmarshall, so that’s good. The second challenge is the amount of data that must be sent from Active to Standby. A million Sessions or even more at Active means that it might take a few seconds to marshal everything and send it to Standby over the network. If all the data is sent in one shot, it would mean that the Active UPF will get blocked for that much amount of time. That would really be problematic because remember that the Active UPF is actively handling calls and data in the system that is live, it simply cannot afford to block for an order of seconds. The way to solve it is to replicate Sessions in several bursts. In between the bursts, the Active is simply doing its normal business. Eg. if there are a million Sessions at the Active, we can define bursts which happen at 100 milliseconds each and replicates, say, ten thousand calls in each iteration. So we would require 100 iterations to replicate all the million Sessions and therefore a little over 10 seconds. During this, what we call the bulk replication phase, the Active UPF runs a little bit slower in terms of doing its normal data handling (because it is also doing the heavy lifting of bulk replication) but it is at least not blocked completely. And once the bulk replication phase is over, the Active UPF is fully back to its own normal business as usual.
There is one more aspect of data replication to the Standby once the bulk phase is over. Whenever the Active UPF gets a message from Control Plane, it also replicates each of these messages to the Standby UPF. The Standby UPF runs exactly the same logic of handling this message as it would have done had the message come directly from the Control Plane except that it does not answer back to the Control Plane. This way the Standby always moves in a lock step with the Active after the bulk phase. An interesting usecase arrives regarding what happens to these individual messages at Active when the bulk replication is still ongoing. In this usecase too the individual messages are all replicated except the Session Creation messages. Basic validations at Standby UPF take care of not making the data structures go into an inconsistent state at the Standby. E.g., if the Standby UPF gets an individual message about a Session which does not exist at Standby, then it simply ignores it. This Session will naturally be replicated as part of bulk replication eventually and everything would come back in sync. So this issue is not as problematic as it initially appears.
Alright, so now the Active and Standby UPF’s are completely in sync and are marching boldly ahead in a lock step together. Now how does the Standby come to know that the Active is presumably dead and that it should take over the role of Active ? That’s simple. As soon as the UPF assumes the role of Standby, amongst the other things it does, it keeps asking the Active if the Active is still out there. When the Active stops replying to it, it means that the Active is dead. In that case Standby immediately assumes the role of Active and there you go. It has all the Sessions on itself and is ready for all the data handling from Control and Data plane perspective.
Something like this –
Standby: Are you still out there Mr. Active ?
Active: Yes sir, I am there
Standby: Are you still out there Mr. Active ?
Active: Yes sir, I am there[Active goes down]
Standby: Are you still out there Mr. Active ?[No answer from Active]
Standby goes Active now.
If the earlier Active is now started again after software or hardware repair work, it will do the role negotiation like explained earlier and would come up as Standby. Cool, isn’t it !
There is one important aspect of this whole scheme of Active and Standby UPF. To the outside world, specifically the RAN (gNB) and the Control Plane (SMF), the two UPF’s are not visible as two different UPF’s. For them the UPF is just an IP address. It is our internal implementation which consists of the two UPF’s (Active and Standby with their role negotiations and transitions and what have you). The RAN (gNB) and Control Plane (SMF) couldn’t care less. So this poses some challenges in the HA implementation. First is the IP address involved as seen by gNB and SMF. That is a simple matter. Once the Standby becomes Active, it can plumb the relevant IP addresses as seen by gNB and SMF (in general the external world) on its interfaces and we are all set. However, the new UPF which is going Active now will likely have a different MAC address for its interfaces. The external entities like gNB and SMF may keep communicating with the old MAC address for quite sometime before realizing that an ARP has to be done again. This would mean that the downtime as seen by external entities may last a few seconds which might not be acceptable. To solve this, the UPF which is going Active not only has to plumb the IP addresses on the relevant interfaces but also send a Gratuitous ARP announcements telling the external entities to update their IP to MAC mappings if they have them cached. This ensures that the external entities start communicating almost instantly to the new MAC address of the UPF which is now Active and everything then works as expected. One might argue that there would be some amount of down time while the Standby UPF detects that the Active is down and then becomes Active, but this could be brought down to under a second and is worth to survive a million setup calls even though a few end to end packets might get lost during the transition. The end-to-end data from the mobile would anyway survive on a TCP connection with retransmissions so overall the scheme works in practice.
So here we are, we have described qualitatively how Truminds UPF achieves HA. The implementation of course turned out to be quite a challenge for taming all the data structures, but in the end Truminds engineers did it. We now have a UPF which supports HA in all its flavours – Linux sockets based UPF, native DPDK application, VPP Plugin and 6Wind GATE plugin. If you want a demo, please do get in touch ! Cheers till next time.
Unstoppable Domain: praupadhyaya.crypto
Blog by Prashant Upadhyaya, Vice President - Technology at Truminds.