Downes.ca ~ Content Delivery Networks

Content Delivery Networks

Jan 12, 2010
By Stephen Downes

Originally posted on Half an Hour, January 12, 2010.

Summary of a tutorial workshop from Bruce Maggs (Duke University and Akamai).

Services and Design

The basic service to the first customers was the provision of access to static content to websites. For example, cbs.com used Akamai to provide static images for the website. Also, a lot of static objects are delivered to web pages â€“ they donâ€™t appear on web pages. For example, software updates for Microsoft, or virus updates for Symantic. Yahoo uses Akamai to run their DNS service. Apple uses Akamai for realtime streaming of Quicktime. And the FBI uses Akamai to deliver all its content on fbi.gov through Akamai, so thereâ€™s no way to access the FBI servers through the site.

Today Akamai has roughly 65,000 servers (was 40,000 servers last summer). There are 1450 points of presence (POPs), 950 networks or ISPs in 67 countries. The first round of servers were typically co-located with ISPs (thatâ€™s what OLDaily does too). But thatâ€™s expensive; once we got enough traffic we were able to co-locate in universities, ISP, etc to lower the bandwidth coming in. â€œWeâ€™ll serve only your users, but weâ€™re not going to pay for bandwidth, electricity, etc.â€

From 30,000 domains Akamai serves 1.1 terabytes per second, 6,419 terabytes per day. Thatâ€™s 274 billion hits per day to 274 million unique client addresses per day.

Static Content

The simplest service example occurs when xyz.com decides to outsource its image sourcing. To do this, it may change the URL of its images from www.xyz.com to ak.xyz.com. This results in a domain lookup process that points to the Akamai servers instead of xyz.comâ€™s servers. Image [1] (below) displayâ€™s the domain lookup process that happens. In essence, what happens is that the DNS process tells the system that ak.xyz.com is actually a212g.akamai.net (or some such thing, depending on how the load is distributed). These are called â€˜domain delegation responsesâ€™.

You may wonder how we get better performance by including all these steps. The answer is that they rarely perform all 16 steps. The DNS responses are cached, and the â€˜time to liveâ€™ tells the servers how long the data will be valid. At the early stages, or for much-used domains, the time-to-live may be a few minutes or even seconds. But the more stable, static time-to-live values may be several days.

So, the system maps the IP address of the clientâ€™s name server and the type of content being requested to an Akamai cluster (Akamai has clusters designed for specific types of content, eg., Quicktime video, etc; Akamai also has servers reserved for specific users in a community, eg., at a particular university). Where the content is not served by an â€˜Akamai Accelerated Network Partnerâ€™ the request is subjected to a more general â€˜core point analysisâ€™.

Note that this is based on the name server address, not the clientâ€™s address. It assumes that whatâ€™s good for the name server is good for the client. Some ISPs were serving all clients from centralized name servers; weâ€™ve asked them not to do this.

What if the user is using a centralized DNS, like Googleâ€™s (8.8.8.8)? Thereâ€™s a system called â€˜IPanycastâ€™ â€“ the idea is that while many clients name use the same NS IP address, the reality is that IP address is in many different locations â€“ my request to Googleâ€™s server will go to the closest Google NS, but Akamai will get some information about the location of that particular name server (NS).

Now â€“ how does Akamai pick a specific server within a cluster? Remember, the client is mapped to a cluster based on the clientâ€™s name server IP address. You need an algorithm to assign clients to specific servers based on specific type of content requested. Algorithms:
- stable marriage with multi-dimensional hierarchy constraints (for load balancing)
- consistent hashing (for content types)
Letâ€™s look first at consistent hashing.

Again, Akamai puts specific types of content on specific servers. The content needs to be spread around the servers, to put the same content type on the same server, and to allocate more server space to more popular (types of) content.

Hereâ€™s how the hashing works: you have a set U of web objects, each with a serial number, and you have a set of B buckets, where each bucket is a web server. So the function assigns h:U->B. For example, you might have a random allocation function, h(x)=(((a x+b) mod P) mod |B|), where P is prime and greater than |U|, a and b are chosen randomly, and where x is a serial number.

But this wonâ€™t work, because you have a difficulty changing the number of buckets. For example, a server might crash. In consistent hashing, instead of mapping objects directly to buckets, you map both objects and buckets to the unit circle. Then you assign the object to the next bucket in the unit circle. When a bucket is added or subtracted, the only objects affected are the objects mapped to the specific bucket that has changed.

One complication occurs with the use of multiple low-level DNS servers that act independently, since each DNS will have a slightly different view of the hash. The consistent hashing algorithm is sufficiently robust to handle this.

Properties of consistent hashing:
- balance â€“ objects are assigned to buckets random
- monotonicity â€“ when a bucket is added or removed, the only objects affected are those mapped to the bucket
- load â€“ objects are assigned to the buckets evenly over a set of views
- spread â€“ an object is mapped to a small number of buckets over a spread of views

Now, how itâ€™s really done is that each object has a different view of the unit circle. Then it is assigned to the first open server in the circle. They have different views because, if one goes down, you donâ€™t want to assign all the objects to the same new server. Rather, because each view of the unit circle is unique, each is assigned to a (potentially) different server.

Now, what about huge events, such as â€˜flash crowdsâ€™ (a huge crowd where there is no warning) or planned crowds (such as a software release)? For flash crowds, the system is designed so that, in either case, we donâ€™t do anything. Itâ€™s designed to be resilient.

Real-Time Streaming Content

How you pay for bandwidth: typically, you sign a â€™95.5 contractâ€™ â€“ you sign a contract saying that â€˜as measured at 95.5 percent of usage Iâ€™m going to pay so much per megabyteâ€™. But of course many people have less than 95.5 â€“ and you can play games with it, to get the 95.5 number lower (serve some content from elsewhere, eg.). Some huge events change the calculations â€“ big events were 9-11, Steve Jobs keynotes, the Obama inaguration (which doubled the normal peak traffic for the day).

Streaming media has, in the last few years, become the major source of bits that are delivered. Streaming media is becoming viable, itâ€™s happening, but slowly. This is mostly U.S.-based, but there are other locations â€“ Korea, for example. Some sports events have been huge â€“ world cup soccer, NCAA basketball games. The photos below show the network before and after the inauguration (yellow is slow, and red is dead). Everybodyâ€™s performance on the internet was degraded (Akamaiâ€™s belief is that itâ€™s capacity is as big as the internetâ€™s capacity).

There may be redundant data streaming into clusters, to improve fault-tolerance. The actual serving of the stream is typically through proprietary servers, such as the Quicktime server. Up until the proprietary servers, this is a format-agnostic server system. It just sends data â€“ there is very little error-checking or buffering, because it would create too much of a delay at the client end (eg., for stock market conference calls).

A study from 2004 â€“ 7 percent of streams were video, 71 percent were audio, and 22 percent one or the other (too close to tell by bitrate). There were tons of online radio stations, for example. Streams will often (24 percent of the time) create flash crowds, as shows start, for example. Audio streams (in 2004) often ran 24 hours, but none of the video streams did back then. An analysis of users showed that there was a lot of repeat viewership â€“ the number of new users should go down â€“ but it doesnâ€™t go down very fast. There was a lot of experimenting â€“ people would come and go very quickly. In almost all events, 50 percent of the clients show up one day only and never come back. But if you exclude the one-timers, most people watch for a long time, the full extent of the duration.

What makes for a quality stream? Here are criteria that were used: How often does it fail to start? How long does it take to start? How often does it stop and start up again? You could ask about packet loss, but this is a bit misleading â€“ you have to ask about packets that arrive on time to be used (some packets arrive and are therefore not lost but arrive too late to be useful and are thrown away).

FirstPoint DNS

This is the service that is used to manage the traffic into mirrored websites. Itâ€™s a DNS service, eg., for Yahoo. We field the request to send people, eg. To the east coast to the west coast. It directs the browser to the optimal mirror. It may be only two mirrors (which is actually hard to figure out) or as many as 1500. Today, content providers typically want to manage their own servers, and only offload embedded content. This creates a mapping problem: how to direct requests to servers to optimize end-user experience. You want to reduce latency (especially for small objects), reduce loss, and reduce jitter.

So, how do it? We could measure â€˜closenessâ€™, but this changes all the time. We could measure latency, frequently. But you get a lot of complaints, and thereâ€™s too many clients to ping. What they do: on any given day there are 500,000 distinct name server requests on any given day (not individual web servers, higher level requests; eg., university asks Akamai, whereâ€™s Yahoo?). Thatâ€™s way too many to measure.

There are two major approaches:

First, network topology. This is basically a way of creating a map of the internet. Topologies are relatively static, changing in BGP (border gateway protocol - itâ€™s basically a way of the systems to manage rtoutes, exchanging route information to each other) time.

Second, congestion, This is much more dynamic. You have to measure changes in round-trip time in order of milliseconds. So we do accurate measurements to intermediary point. You donâ€™t measure all the way to the end, because any differences are really out of anyoneâ€™s control â€“ we can only control different route to â€˜proxy pointsâ€™ (or â€˜core pointsâ€™). The 500,000 name servers reduce to about 90,000 proxy points. Thatâ€™s still too many, but at least theyâ€™re major routers.

Now if we look at these name servers and sort them according to how many requests we get, if we look at 7,000 of them we are getting 95 percent of all requests from them. So we ping these most often. 30,000 proxy points results in 98.8 percent of every request these are pinged every 6 minutes. We use latency and packet loss to guide the algorithm. We also use other data in mapping â€“ about 800 BGP feeds, traceroute to 1,000,000 per week, constraint-based geolocation, and other data.

SiteShield

There have been attacks that have shut down servers. Siteshield redirects traffic from affected servers. Eg. Yahoo was down once on the east coast, and didnâ€™t know until we told them.

What we want to do is prevent DDOS attacks â€“ this is where the attackers take over innocent user computers (called â€˜xombiesâ€™) which then launch the attack on the service. Akami basically stands in between the content provider and the attacker â€“ this way the providerâ€™s IP address is shielded, and the attacked server can be swapped out. Akamai has a lot of capacity to handle flash crowds, etc., and so can do the load balancing and can resurrect crashed servers very quickly.

Dealing With Failure

Wherever possible try to build in redundancy, decentralization, self-assessment, fail-over at multiple levels, and robust algorithms.

At the OS level, Akamai started with Red Hat Linux and, over time, have built in Linux performance optimizations and eventually created a â€˜secure OSâ€™ derived from 2003 Linux that is â€˜battle hardenedâ€™. Windows was later added for the Windows content, because Windows Media Server runs on no other platform.

To optimize security on the server: disk and disk cached are managed directly. The network kernel is optimized for short transactions. Services are run in user mode as much as possible. And the only way to get into the machine is through ssh.

Akamai relies a lot on GNU (GPL licensed) software. Akamai only runs the code on its own machines. The license says the act of running the program is not restricted â€“ if you want to distribute the program you have to reveal your own code, but they never do that; they never reveal their own code. (Akamai is close to the line in terms of what they are doing with GPL.)

So what kind of failures are there? Hardware failures, network failures, software failures, configuration failure, misperceptions, and attacks. Of these, configuration failures can be the most severe. Hardware and network you expect to fail, and you build around them. Software can be a problem, because itâ€™s difficult to debug, and it is really easy to get feature creep in your system, so that youâ€™re configuring it in real-time.

Harware can fail for a variety of reasons. The simplest way to respond to it is the buddy system â€“ each server in a cluster is buddied with another. You can also suspend an entire cluster if more than one fails. To recover from a hardware failure, you try to restart, and if it doesnâ€™t work, you replace the server.

Network failures result mostly from congestion and connectivity problems. This was discussed above; it is dealt with basically by pinging the proxy points.

For software, an engineering methodology is adopted, Everything is coded in C, complied with gcc. There is a reliance on open source code. There are large distributed testing systems, and there is an â€˜invisible systemâ€™ burned in â€“ that is, you try the system without any customers on it to see whether it causes crashes, etc. (itâ€™s sort of like a shadow system, it uses real data, but doesnâ€™t actually serve customers). The rollout is staged â€“ first you start with the ghost shadow system, then to a few customers, then system-wide. The software is always backward-compatible with the previous version. In principle itâ€™s possible to roll back to a previous version, but in practice itâ€™s not a good idea, because youâ€™d have completely clean the server.

But â€“ this, which is state of the art, is still not good enough. Itâ€™s still fraught with risk. But there isnâ€™t a better answer yet. There are maybe some long term solutions â€“ safer languages, code you can â€˜proveâ€™ is correct, etc. You can prove some things, face the halting problem, but there are some things that are still complex.

Perceived failures â€“ a lot of people see Akamai in their firewalls and think itâ€™s attacking them, because they never requested anything from Akamai. Other misperceptions are created from reporting software, customer-side problems, and third-party measurements (eg. Keynote (http://www.keynote.com), which was swamping servers with too many tests â€“ and now there are servers set up close to keynote for speed measurement, and keynote was told their own servers are overloaded). Or â€“ there was another case where the newspaperâ€™s own internet access was down, and they couldnâ€™t receive Akamai-served content.

Attacks â€“ are usually intended to simply overwhelm a site with too many requests. Hacker-based attacks are usually from individuals. Sometimes there are weird hybrid attacks where volunteers manually click on the websiet (eg. To attack the World Economic Forum (WEF)). Maggs summarized some attacks â€“ the most interesting was a BGP attack, because when the BGP link is broken, they flush all data and reloaded the addresses from scratch, which creates a huge increase in traffic.

War Stories

The Packet of Death â€“ a router in Malaysia was taking servers down. Servers negotiate the â€˜maximum sized packetâ€™ that can be sent. But someone had configured a server in Malaysia that chopped servers at some arbitrary number â€“ and the Linux server had a bug that failed on exactly that number. But, of course, every time a server went down, a new server was used â€“ eventually sending every server to that router in Malaysia!

Lost in Space â€“ a server started receiving authenticated but improperly formatted packets from an unknown host. They were being discarded because they were unformatted. It turns out, the â€˜attackâ€™ was coming from an old server that had been discarded and was trying to come back into the network. We probably have servers in our rack somewhere that have been running for 10 years that we donâ€™t know about. (Of course, this does raise the question of where you keep your secret data, like keys, if servers can be â€˜lostâ€™).

Steve Canâ€™t See the New Powerbook â€“ a certain â€˜Steveâ€™ who is very famous had a problem â€“ his assistance Eddie explained that Steveâ€™s new computer canâ€™t see the pictures. Went through the logs, found no evidence he had tried to access the images. Eddie snuck into Steveâ€™s office to try to access the images. No image appeared, no request was sent. Eddie was â€˜not allowedâ€™ to tell what OS and browser were being used (Safari on OS X). It turned out that the Akamai urls were so long the web designer put in a line feed. Browsers like Internet Explorer compensate for this, but the new browser didnâ€™t.

David is a Night Owl â€“ one person would start doing experiments at 1:00 a.m. pacific time, and would send a message (at about 4:00 eastern) saying that the servers arenâ€™t responding. But he was using the actual IP address. He asks, â€œwhy donâ€™t you support half-closed connectionsâ€. It will be out in two weeks. What about â€œtransactional TCPâ€? We will not support transactional TCP â€“ because it starts and finishes the transaction in a single packet. By sending one packet to a server you can blast data and spoof your source address.

The Magg Syndrome â€“ got a call one day saying Akamai was â€˜hijackingâ€™ a website. â€œI became the most hated person on the internet.â€ It was when people went to Gomez.com they got the Akamai server, and they would get an Akamai error page. It was a major problem, because Gomez was covering the internet boom and was popular. â€œWe were getting 100,000 requests a day for websites that have nothing to do with us.â€ First â€“ debrand the website (changed to weberrors.com). Then track where people were trying to go. The number kept going up and up and up. The problem was â€“ when you issue a DNS request to your local server, you provide the name you want resolved, an identifier for your request, and the port where you want to receive the answer. All the problems were happening on one (the most popular) operating system â€“ the operating system wasnâ€™t checking that the name in the answer was the name in the request.

Whatâ€™s Coming

Itâ€™s now possible to run websphere applications on Akamai servers. Most customers, though, depend on a backend database, which is not provided by Akamai. So there needs to be a way to provide database caching at the server level.

Physical security is becoming more of an issue. This is increasing some costs.

There is also the issue of isolation between customer applications (ie., keeping customer applications separate from each other). Akamai is not providing virtual machines, not providing a general purpose plan (so everyone is running under the same server).

Energy management is also an increasing issue. The servers consume more energy than everything else in the company combined. It would be nice to save energy, but thatâ€™s difficult, because people are reluctant to turn a server off when itâ€™s idle â€“ it may take time to restart, it may be missing updates, it creates wear and tear on a server (from temperature changes). But even if Akamai canâ€™t save energy, it can save money on energy or use greener sources of energy. Eg. You can switch to servers that are using off-peak energy, etc., or adapt server load to exploit spot-energy prices in the U.S.