Anycast – “Think” before you talk – Part II

This post is Part II of Anycast – “Think” before you talk.

DIRECTING USERS TO POPS

Traditional methods to route users to the closest PoP rely on DNS-based geographic load balancing. DNS attempts to map users requests to the nearest PoP by giving out DNS records based on users latitude and longitude. Essentially, the IP address of the PoP is handed to the user based on the IP of the resolver, not the actual client IP address.

BENEFITS OF DNS-BASED LB

One of the main advantages of DNS-based load balancing is control – administrators can direct any DNS request to any node. This is useful for traffic management purposes. It also offers flexibility regarding deployment as you don’t need to have common carriers and deal with the intricacies Anycast routing on the Internet.

However, there are plenty of tradeoffs as it does not out of the box handle topology changes very well, unlike the Anycast-based approach.

CHALLENGES OF DNS BASED LB

The majority feel it’s a very naive approach with plenty of shortcomings and suboptimal users placement. The issues mainly arise from PoP failover, resolver proximity, low TTL’s and timeouts.

Suboptimal decisions are made as the DNS mapping is based on the user’s name server IP, not the client’s actual IP address. This makes DNS based load balancing an inaccurate method for client proximity routing. The call comes from the client’s DNS server, not the actual client’s IP address. As a result, administrators can only ever optimise the performance metrics for the DNS resolver. This has changed in the last few years with EDNS client subnet, but it’s not fully implemented in all resolvers.

DNS doesn’t fail very gracefully when you are in the middle of a session and want to get rerouted to a different PoP. Users have to close their browsers and open them up to start working again.

DNS TTLS can cause lagging and performance issues. Whenever you decide that you need to change your answer you have to wait for DNS TTL to expire. During a failover scenario, the TTL of the response must be reached to change locations. Unfortunately, some applications hold on to these values for a long time. Setting a low TTL mitigates but the tradeoff is against performance as resolvers must frequently re-request the same DNS record.

THE ANYCAST APPROACH

Instead of giving a different IP, Anycast is a mechanism that announces the same IP address from multiple locations. Anycast is nothing special – simply a route with multiple next hops. It’s not too different than Unicast; a NLRI object has multiple next hops instead of one. All the magic is done when you deliver the packet not the underlying network transporting the packet.

When you advertise multiple destinations, the shortest path is chosen based on the user location. Therefore, traffic organically lands where it should do as opposed to direct control based on GEO IP. Anycast does not rely on a stale Geo IP database and performance rests on the natural flow of the Internet.

The resolver IP is not used, instead the client IP is used for anycast routing. This subtle difference offers a more accurate view of where the users are located. The users can use whatever resolver they want and will still have the same assignment. As a result, the client’s DNS server is trivial with whatever the question is, the answer will be the same.

With an Anycast design, there are trade-offs between performance and stability. Anycast works best with metro or regionally based level design and with single PoP per location deployments. Multiple PoPs per location might run you into some problems. As a general best practice, the more physical space you have between your PoP the more stable the overall architecture will be.

ANYCAST ORGANIC TRAFFIC

Natively, Anycast is natively not load aware. Large volumes of inbound traffic could potentially saturate a PoP. While also true for Unicast traffic, DNS-based routing offers better control for PoP placement as you can hand out specific IP blocks for specific locations.

The DNS response may provide a suboptimal response, but it still represents a better level of supervision for traffic management purposes. The Anycast approach to PoP placement will have organic traffic naturally flowing to each PoP location; you can’t control this. Some control was given up moving from traditional DNS-based routing to a TCP-anycast CDN. So what’s the best line of action to take under these circumstances? Should you oversubscribe each PoP to account for the lack of control?

First and foremost when it happens, you need to be aware of it. It’s not acceptable not to be aware of a flood of traffic entering your network. The right monitoring tools need to be in place along with a responsive and active monitoring team. Much of the reason for large inbound flows happens upstream. For example, a provider breaks something. So it will happen, it’s just a matter of time. The best way to deal with it, is through active monitoring and preparation.

CacheFly has the experience and monitoring in place to detect and mitigate large volumes of inbound traffic. The network architecture consists of private connections between all PoP locations streamlining the shedding of traffic to undersubscribed PoPs as the need arises. In the event of high inbound traffic flows, CacheFly’s proactive monitoring and intelligent network design shifts traffic between locations mitigating the effects of uneven traffic flows due to Anycast design.

BENEFITS OF ANYCAST

Anycast is deemed to fail quicker than DNS, has better performance and simpler to operate. Anycast doesn’t suffer from any of the DNS correlation issues, and it doesn’t matter which DNS server you came from. The client takes the fastest path from its locations as opposed to the fastest path where the DNS resolver is.

Anycast is a simple, less complex way for user assignment. You’re pushing the complexity and responsibility to the Interior Gateway Protocol (IGP) of the upstream provider, relying on the natural forwarding of the Internet to bring users to the closest PoP.

While with an Anycast design the next time you click a link on a page or anytime your browser goes out and refreshes content you are back on your way to a new POP. Anycast is faster as traffic shift can happen much quicker and you don’t have to lower users performance by keeping a low DNS TTL.

Upon network failure, Anycast fails far more quickly in scenarios to that of Unicast. If you are having routing issues between location X and location Y. With Anycast a TCP RST is received and the client works immediately to the new location. Without Anycast, clients will continually attempt to reach the server in location X but as it’s not available, the client is continually stuck in a routing loop between location, until either;

a ) The providers converge, but until then, users are waiting, timing out, reloading and timing out over and over again.

b) If location X is down, the client has to wait for the DNS GEO to realise and offer a new IP address. Clients either need to timeout the IP in the application or resolver and potentially close and open the web browser for things to start working again.

While on the other hand Anycast broke quickly and got back working again rapidly. With outages, traffic is seamlessly routed to the next best location, without requiring browser restarts, a type of convergence not possible with traditional DNS solutions.

Anycast enables the use of high TTL as the actual IP address of endpoints never changes. This allows resolver to cache a response increasing overall end user experience and network efficiency.

It’s also a great tool in a DDoS mitigation solution. With Botnet armies reaching a Terabyte-scale attack, the only cost effective way is to distribute your architecture, naturally absorbing the attack with an Anycast network.

EVERYTHING IS DEBATABLE

However, Anycast requires some form of stickiness, so flows get the same forwarding treatment. As a result, per packet load balancing can break Anycast. However, per packet load balancing is rarely seen these days, but there is a chance it exists somewhere in a far-flung ISP. Generally speaking, we are designing better networks these days.

TCP/IP uses a different protocol for out of band signalling. As a result, it may have different forwarding treatments and massages (Path MTU discovery), and may not reach the intended receiver. Technically this is still an issue but not widely a problem on TCP/Anycast networks.

Anycast endpoint selection is based on hop count number. That does not mean it’s routing based on lowest latency or best performing links. Fewer hops do not mean lower latency. Some destinations may be one hop away, but that could be a high latency intercontinental link. More than often traffic doesn’t have to traverse intercontinental links to reach its final destination. With intelligent PoP placement, content is placed close to the user in the specified regions.

Anycast does take control away from the administrator to the hands of the Internet. As user requests organically land at the closest PoP; the strict supervision of where users lands are removed, potentially leading to capacity management issues at each edge location. As already discussed, this is overcome with experienced monitoring teams. Another reason why you shouldn’t go with a DIY CDN.

SUMMARY

People overestimate how unreliable the Internet is regarding broad events, underestimate the impact of those on Unicast and overestimate the impact on Anycast. The unreliability of the Internet is built into its design. The Internet is designed to fail! However, we assume under a failure if we are using TCP/Anycast and application terminates at the wrong place, the world stops, and everything else breaks.

If for an intermediate failure or misconfiguration event, an HTTP SYN destined to Server X lands on Server Y, and as this server does not have an active TCP session, it will as it should send an RST back to the client. But if your application doesn’t handle network interactions very well you really shouldn’t be running it on the Internet.

Networks are built to fail, and they will fail! If you are looking for 100% network reliability and the application can’t handle failures, then you should maybe look to rebuild the application.

This guest contribution is written by Matt Conran, Network Architect for Network Insight. Matt Conran has more than 17 years of networking industry with entrepreneurial start-ups, government organisations and others. He is a lead Network Architect and successfully delivered major global green field service provider and data centre networks.

Image Credit: Pixabay

SHARE THIS STORY | |