Why DNS Based GSLB Doesn’t Work, Part II

Pete Tenereillo

4/9/04

Copyright Tenereillo, Inc. 2004

Abstract

As the part 1 of this paper shows, DNS based Global Server Load Balancing (GSLB) does not work, for browser based clients, if combined with the common practice of returning multiple A records. Part 2 ignores the issue of multiple A records for high availability, and covers some other pitfalls associated with DNS based GSLB.

Overview of GSLB DNS resolution with GSLB

The following diagram gives an overview of DNS resolution as used with GSLB devices. (If you’ve already read the first document, “Why DNS Based GSLB Doesn’t Work”, please skip ahead, as this section is identical).

Site A in Los Angeles has a virtual IP address (VIP) of 1.1.1.1, and Site B in NYC has a VIP of 2.2.2.2. A GSLB device is acting as the authoritative name server for www.trapster.net. Upon a DNS query for www.trapster.net, the job of the GSLB is to determine whether to return the IP address 1.1.1.1 or 2.2.2.2.

1) The stub resolver (a software program running on the client computer) makes a request to the assigned local DNS server, which in this example is in the client’s Internet Service Provider (ISP) DNS server farm in Atlanta, Georgia. The client must receive either an answer, or an error. This is called a “recursive” query. Note: the stub resolver program is not capable of “digging” through the Internet to find the answer. That is the job of a DNS server.

2) The client’s DNS server performs an “iterative” resolution on behalf of the client, querying the root name servers and eventually ending up at the authoritative name server for www.trapster.net. In this case the GSLB device is that authoritative name server.

3) The GSLB device performs some sort of communications with software or devices at each site, gathering information such as site health, number of connections, and response time.

4) The software or device at each site optionally performs some sort of dynamic performance measurement, such as a round trip time (RTT), or topographical footrace, or BGP hop count, back to the client’s DNS server.

5) Using the information gathered in steps 3 and 4, the GSLB device makes a determination as to the preferred site, and returns the answer to the client’s DNS server. The answer is either IP address 1.1.1.1 or IP address 2.2.2.2. If the time to live (TTL) is not set to zero, the answer is cached at the client’s DNS server, so that other clients that share the server will make use of the previous calculation (and not repeat steps 2 through 4).

6) The DNS answer is returned to the client’s stub resolver.

After DNS resolution is complete, the client makes a TCP connection to the preferred site.

The issues with topographical path measurement at DNS time

Problem 1: Clients are often not topographically “close” to their caching nameservers

At no point in the resolution process (above) is the GSLB device able to determine the IP address or location of the actual client. This is because the only device that communicates directly with the GSLB is the client’s caching nameserver. The DNS based GSLB device must assume that the client is topographically “close” to its assigned caching nameserver. Unfortunately, that’s a bad assumption.

ISPs and caching nameservers

The diagram below shows an example with the locations of two clients and their respective caching nameservers:

The author is located in Carlsbad, CA. The author’s client normally is assigned to a primary DNS server (caching nameserver) which is located about 20 miles away, in San Diego, CA. The author’s family member also lives in Carlsbad, but uses a service that hosts all of its caching nameservers in Atlanta, GA, about 2000 miles away. This example speaks in terms of geographic proximity (rather than topographic proximity), but nevertheless shows the variation in actual associations.

The Web links in the “Related material” section below site specific research regarding proximity between caching nameservers and clients, showing a weak correlation. Here are some quotes from the papers:

“the average distance over all pairs was 7:6 hops, with a median of 8. Some clients were

as far as 15 hops from their nameservers. The average client-to-nameserver round-trip latency was 234 ms, though this was dominated by the average first-hop latency which was 188 ms. These results show that even when considering direct distances, clients and nameservers are often topologically quite far apart.”

“In general, the correlation between nameserver latency and actual

client latency was quite low.”

VPNs and caching nameservers

The diagram below shows a topology for a remote worker in a Fortune 500 company:

As a remote worker, the client, based in San Diego, CA, typically selects Santa Clara as his preferred VPN termination site, however the VPN equipment at the Santa Clara site is often at maximum connection capacity, and rejects the connection. The user is therefore forced to select an alternative site. Sometimes he selects the site in Texas, sometimes the site in Massachusetts. The selection process is manual, therefore the user continues to use the selected site until either the previously selected site becomes overloaded, or he opts to switch back to the Santa Clara site or some other site.

In this example, the user VPNs to the Massachusetts site, and through DHCP, is dynamically assigned an IP address, and primary, and secondary DNS servers.

Most VPN client software supports something called “split tunneling”. Split tunneling allows traffic that is not required to flow over the VPN to go directly to the Internet via the client’s ISP. This reduces load on corporate VPN equipment. With split tunneling, however, DNS requests go over the VPN connection regardless of whether the request is for a resource in the corporate network, or on the greater Internet.

Back to the example, the user should access the fictitious site www.trapster.net at Site A, Los Angeles.

The diagram above has the same steps as those in the diagram from the above section “Overview of DNS Resolution With GLSB” (and are therefore not enumerated). The purpose of this diagram is to show that the topographical proximity calculation will likely direct the client to Site B in New York, when in fact the client should have been directed to Site A in Los Angeles.

Now, how important is this issue? Reports show that the majority of e-commerce and online financial business (and even pornographic site access!) happens during work hours. Given the large number of telecommuting workers, and the fact that workers usually leave VPN connections open continuously, this issue is of utmost importance.

The VPN issue is fundamentally the same as the client-nameserver proximity issue discussed in the previous section, and the papers in the “Related material” section below.

It’s not just about performance

Dynamic proximity measurement is primarily intended to select a site that is topographically closest. The reason is to allow the shortest, fastest, path, to improve the user experience. As mentioned above, the path to the actual client is often significantly different than the path to the caching nameserver:

Again on the topic of high availability, in the case of a widespread outage (power outage, hacker attack on routers, etc.) it is entirely possible that although a path exists between a given datacenter and a caching nameserver, no route exists between the actual client and that datacenter. In this example, if the GSLB device selects Site B, the San Diego client may not be able to connect at all (when in fact the client could have had a healthy path to Site A).

Problem 2: Calculation of proximity can itself significantly degrade the user experience

The fact that the actual client is topographically close to its caching nameserver at one point it time does not imply that it will be topographically close a few minutes later. The very nature of the Internet is that routes frequently change, and traffic is routed around congestion. For this reason, if GSLB is to be useful, samples must be taken at frequent intervals.

The frequency of GSLB sampling is controlled by the TTL, as configured on the GSLB device. GSLB manufacturer recommendations vary as to the recommended (or default) TTL setting. Some GSLB manufacturers recommend a low TTL (such as 10 or 20 seconds), and some recommend (and default to) a TTL of zero seconds. The problem with short TTLs is that they slow the overall performance of DNS. In fact, sites that set TTLs to low values are considered bad DNS citizens.

Sites that host content on different FQDNs (such as those that use Content Delivery Networks, or CDNs) are most significantly impacted, as a DNS lookup is required for each FQDN.

The research papers in the “Related material” below site extensive research on the impact of low TTLs.

Problem 3: Mismatched expectations

Most people know that DNS based proximity is not “perfect”, but the extent of the imperfections is not well known. For example, one of the myths about GSLB proximity is that it will reliably select at least to the correct continent, however even this is often not true.

Static proximity

Many GSLB devices support static site preferences based on the Internet Assigned Number Authority (IANA) tables. At one point the IANA tables could be used to predict which continent an IP address might reside on, but over time the reassignment of IP addresses by ISPs has made this method unreliable.

Geotargeting

There are several services that can be subscribed to, which keep fairly accurate lists of IP address to geography correlations. These services use a combination of information sources such as active probes on common dial up services, and relationships with ISPs that are willing to share information related to IP address assignment and reassignment on a regular basis. While such services are often quite accurate, they are subject to the same issues mentioned above if the information is to be used at DNS GSLB site selection time. The following diagram will show why:

1) The client makes a DNS request for the IP address of www.trapster.net

2) The client’s caching nameserver performs an iterative resolution, ending up at the GSLB device (which is acting as the authoritative nameserver for www.trapster.net).

3) The GSLB device is integrated with a geotargeting database/service, and passes the IP address of the caching nameserver (3.3.3.3) to that service. The geotargeting service correctly identifies the location of IP address 3.3.3.3 as Atlanta, Georgia, US, zip code 30301, and returns the IP address (or some related association) of the site that is geographically closest to Atlanta.

4) In this case, the IP address 2.2.2.2 is returned to the caching nameserver.

5) The caching nameserver passes the answer to the client.

6) The client connects to the site that is closest to Atlanta.

While the service itself may correctly identify the location of a client’s caching nameserver, that result will likely have little or no correlation to the location of the actual client. Clearly this is an issue if the intent is to perform some specific operation such as target marketing (e.g. provide an advertisement that is relevant for a particular ZIP code).

Note: In addition to collecting information about the geographic location of IP addresses, some of these services collect information about the proximity between caching nameservers and their respective clients. In the example above, the geotargeting service may recognize that the caching nameserver with IP address 3.3.3.3 is not a regional nameserver, and therefore that proximity to the actual client can not be determined with a reasonable level of accuracy. In this case it could inform the GSLB to essentially “punt” on the geotargeting operation, but that does not in any way solve the issue at hand. The papers in the “Related material” section will show that if such intelligence were accurately implemented, the GSLB solution would end up “punting” more often than not.

To be sure, geotargeting services can be quite accurate, and can produce useful results if used outside of the context of DNS based GSLB. This section pertains only to the combination of geotargeting and DNS based GSLB.

Surprising results on path performance

Given the size and performance of the modern Internet backbone, the topographically closest or otherwise best path may in fact be quite distant. For example, one marquee Internet site has data centers throughout the World. After some extensive testing, it was determined that clients originating from AOL’s Internet connection in Virginia were topographically closer to the London datacenter than to any of the US datacenters!

Now, from one perspective, if AOL Virginia clients are topographically closer to London, all other things being equal, it makes sense to just allow those users to access the London datacenter. That would provide the best performance and user experience.

There may, however, be more factors to consider than path-related performance. Content providers will often build out datacenters for capacity in line with the percentage of their users on that continent. From the example above, if the London datacenter was not built out with the expectation that much or all of AOL/Virginia would be accessing it, the content provider may be a victim of mismatched expectations. Also, content providers often put different variations of content at different datacenters, such as different advertisements, and different news links, expecting that users will be accessing a site that is geographically close – at least on the same continent.

Related material

A research paper and associated slide presentation from IBM:

http://www.ieee-infocom.org/2001/paper/806.pdf

http://www.research.ibm.com/people/a/aashaikh/slides/infocom01-slides.pdf

Another presentation:

http://sahara.cs.berkeley.edu/jan2002-retreat/morley-poster1.ppt