Pete Tenereillo
4/9/04
Copyright Tenereillo, Inc. 2004
As the part 1 of this paper shows, DNS based Global Server Load Balancing (GSLB) does not work, for browser based clients, if combined with the common practice of returning multiple A records. Part 2 ignores the issue of multiple A records for high availability, and covers some other pitfalls associated with DNS based GSLB.
The following diagram gives an overview of DNS resolution as used with GSLB devices. (If you’ve already read the first document, “Why DNS Based GSLB Doesn’t Work”, please skip ahead, as this section is identical).
Site A in
1) The
stub resolver (a software program running on the client computer) makes a
request to the assigned local DNS server, which in this example is in the
client’s Internet Service Provider (ISP) DNS server farm in
2) The client’s DNS server performs an “iterative” resolution on behalf of the client, querying the root name servers and eventually ending up at the authoritative name server for www.trapster.net. In this case the GSLB device is that authoritative name server.
3) The GSLB device performs some sort of communications with software or devices at each site, gathering information such as site health, number of connections, and response time.
4) The software or device at each site optionally performs some sort of dynamic performance measurement, such as a round trip time (RTT), or topographical footrace, or BGP hop count, back to the client’s DNS server.
5) Using the information gathered in steps 3 and 4, the GSLB device makes a determination as to the preferred site, and returns the answer to the client’s DNS server. The answer is either IP address 1.1.1.1 or IP address 2.2.2.2. If the time to live (TTL) is not set to zero, the answer is cached at the client’s DNS server, so that other clients that share the server will make use of the previous calculation (and not repeat steps 2 through 4).
6) The DNS answer is returned to the client’s stub resolver.
After DNS resolution is complete, the client makes a TCP connection to the preferred site.
At no point in the resolution process (above) is the GSLB device able to determine the IP address or location of the actual client. This is because the only device that communicates directly with the GSLB is the client’s caching nameserver. The DNS based GSLB device must assume that the client is topographically “close” to its assigned caching nameserver. Unfortunately, that’s a bad assumption.
The diagram below shows an example with the locations of two clients and their respective caching nameservers:
The author is located in Carlsbad, CA. The author’s client
normally is assigned to a primary DNS server (caching nameserver) which is
located about 20 miles away, in San Diego, CA. The author’s family member also
lives in
The Web links in the “Related material” section below site specific research regarding proximity between caching nameservers and clients, showing a weak correlation. Here are some quotes from the papers:
“the average distance over all pairs was 7:6 hops, with a median of 8. Some clients were
as far as 15 hops from their nameservers. The average client-to-nameserver round-trip latency was 234 ms, though this was dominated by the average first-hop latency which was 188 ms. These results show that even when considering direct distances, clients and nameservers are often topologically quite far apart.”
“In general, the correlation between nameserver latency and actual
client latency was quite low.”
The diagram below shows a topology for a remote worker in a Fortune 500 company:
As a remote worker, the client, based in
In this example, the user VPNs to
the
Most VPN client software supports something called “split
tunneling”.
Back to the example, the user should access the fictitious
site www.trapster.net at Site A,
The diagram above has the same steps as those in the diagram
from the above section “Overview of DNS Resolution With
GLSB” (and are therefore not enumerated). The purpose of this diagram is to
show that the topographical proximity calculation will likely direct the client
to Site B in
Now, how important is this issue? Reports show that the majority of e-commerce and online financial business (and even pornographic site access!) happens during work hours. Given the large number of telecommuting workers, and the fact that workers usually leave VPN connections open continuously, this issue is of utmost importance.
The VPN issue is fundamentally the same as the client-nameserver proximity issue discussed in the previous section, and the papers in the “Related material” section below.
Dynamic proximity measurement is primarily intended to select a site that is topographically closest. The reason is to allow the shortest, fastest, path, to improve the user experience. As mentioned above, the path to the actual client is often significantly different than the path to the caching nameserver:
Again on the topic of high availability, in the case of a
widespread outage (power outage, hacker attack on routers, etc.) it is entirely
possible that although a path exists between a given datacenter and a caching
nameserver, no route exists between the actual client and that datacenter. In
this example, if the GSLB device selects Site B, the
The fact that the actual client is topographically close to its caching nameserver at one point it time does not imply that it will be topographically close a few minutes later. The very nature of the Internet is that routes frequently change, and traffic is routed around congestion. For this reason, if GSLB is to be useful, samples must be taken at frequent intervals.
The frequency of GSLB sampling is controlled by the TTL, as configured on the GSLB device. GSLB manufacturer recommendations vary as to the recommended (or default) TTL setting. Some GSLB manufacturers recommend a low TTL (such as 10 or 20 seconds), and some recommend (and default to) a TTL of zero seconds. The problem with short TTLs is that they slow the overall performance of DNS. In fact, sites that set TTLs to low values are considered bad DNS citizens.
Sites that host content on different FQDNs (such as those that use Content Delivery Networks, or CDNs) are most significantly impacted, as a DNS lookup is required for each FQDN.
The research papers in the “Related material” below site extensive research on the impact of low TTLs.
Most people know that DNS based proximity is not “perfect”, but the extent of the imperfections is not well known. For example, one of the myths about GSLB proximity is that it will reliably select at least to the correct continent, however even this is often not true.
Many GSLB devices support static site preferences based on the Internet Assigned Number Authority (IANA) tables. At one point the IANA tables could be used to predict which continent an IP address might reside on, but over time the reassignment of IP addresses by ISPs has made this method unreliable.
There are several services that can be subscribed to, which keep fairly accurate lists of IP address to geography correlations. These services use a combination of information sources such as active probes on common dial up services, and relationships with ISPs that are willing to share information related to IP address assignment and reassignment on a regular basis. While such services are often quite accurate, they are subject to the same issues mentioned above if the information is to be used at DNS GSLB site selection time. The following diagram will show why:
1) The client makes a DNS request for the IP address of www.trapster.net
2) The client’s caching nameserver performs an iterative resolution, ending up at the GSLB device (which is acting as the authoritative nameserver for www.trapster.net).
3) The GSLB device is integrated with a geotargeting database/service, and passes the IP address of the caching nameserver (3.3.3.3) to that service. The geotargeting service correctly identifies the location of IP address 3.3.3.3 as Atlanta, Georgia, US, zip code 30301, and returns the IP address (or some related association) of the site that is geographically closest to Atlanta.
4) In this case, the IP address 2.2.2.2 is returned to the caching nameserver.
5) The caching nameserver passes the answer to the client.
6) The
client connects to the site that is closest to
While the service itself may correctly identify the location of a client’s caching nameserver, that result will likely have little or no correlation to the location of the actual client. Clearly this is an issue if the intent is to perform some specific operation such as target marketing (e.g. provide an advertisement that is relevant for a particular ZIP code).
Note: In addition to collecting information about the geographic location of IP addresses, some of these services collect information about the proximity between caching nameservers and their respective clients. In the example above, the geotargeting service may recognize that the caching nameserver with IP address 3.3.3.3 is not a regional nameserver, and therefore that proximity to the actual client can not be determined with a reasonable level of accuracy. In this case it could inform the GSLB to essentially “punt” on the geotargeting operation, but that does not in any way solve the issue at hand. The papers in the “Related material” section will show that if such intelligence were accurately implemented, the GSLB solution would end up “punting” more often than not.
To be sure, geotargeting services can be quite accurate, and can produce useful results if used outside of the context of DNS based GSLB. This section pertains only to the combination of geotargeting and DNS based GSLB.
Given the size and performance of the modern Internet backbone,
the topographically closest or otherwise best path may in fact be quite
distant. For example, one marquee Internet site has data centers throughout the
World. After some extensive testing, it was determined that clients originating
from AOL’s Internet connection in
Now, from one perspective, if AOL Virginia clients are
topographically closer to
There may, however, be more factors to consider than
path-related performance. Content providers will often build out datacenters
for capacity in line with the percentage of their users on that continent. From
the example above, if the
A research paper and associated slide presentation from IBM:
http://www.ieee-infocom.org/2001/paper/806.pdf
http://www.research.ibm.com/people/a/aashaikh/slides/infocom01-slides.pdf
Another presentation:
http://sahara.cs.berkeley.edu/jan2002-retreat/morley-poster1.ppt
Copyright Tenereillo, Inc. 2004