Consider the case of DNS (the Domain Name Service). This innocuous-seeming protocol, which merely translates hostnames (like www.csoandy.com) into IP addresses (like 96.17.149.33) is such an important foundation of the Internet that, without it, humans would have had a serious challenge building and engaging meaningfully with such a vast distributed system. The web as we know it would not exist without DNS. The beauty of DNS is in its simplicity: you ask a question, and you get a response.
But if you can’t get a response, then nothing else works — your web browser can’t load up news for you, your email client can’t fetch email, and your music player can’t download the latest songs. It’s important to ensure the DNS system is highly available; counterintuitively, we built it assuming that its components are likely to fail. And from that we can learn how to build other survivable infrastructures.
First, DNS primarily uses UDP, rather than TCP, for transport. UDP (often nicknamed the Unreliable Data Protocol) is more like sending smoke signals than having a conversation; the sender has no idea if their message reached the recipient. A client sends a UDP query to a name server; the name server sends back a UDP answer (had this been implemented in TCP, the conversation would instead have taken several round trips).
Because DNS has no reliability assertion from UDP (and, frankly, the reliability assertion that TCP would have provided isn’t worth much, but at least provides failure notification), the implementations had to assume — correctly — that failure would happen, and happen regularly. So failure was planned for. If the client does not get a response within a set time window, it will try again – but then the client may query a different server IP address. Because the DNS query/response is accomplished within a single packet, there is no need for server stickiness.
An array of DNS servers can be placed behind a single IP address, with simple stateless load-balancing required – no complex stateful load balancers required (higher end DNS systems can even use IP-anycasting, to have one IP address respond from multiple geographic regions, with no shared state between the sites). Clients can and do learn which servers are highly response, and preferentially use those.
DNS also has built into itself other means of reliability, like the TTL (time to live). This is a setting associated with every DNS response which indicates how long the response is valid. A client therefore does not need to make queries for some time; if a name server fails, a client may not notice for hours.
On top of this failure-prone infrastructure — an unreliable transport mechanism, servers that might fail at any time, and an Internet that has an unfortunate tendency to lose packets — a highly survivable system begins to emerge, with total DNS outages a rare occurrence.