Lecture 5

Welcome!
Protocols
Internet Protocol
TCP
Routing
DNS
DHCP
UDP
IP Address Exhaustion
NAT
IPv6
HTTP
URLs
Components of a URL
Scaling
Cloud Computing
Virtualization and Containerization
High Availability
Summing Up

Welcome!

In our previous session, we explored artificial intelligence, learning about game playing, Minimax, reinforcement learning, neural networks, and transformers.
This week, we will be implementing the Internet. Among our goals are to understand what it is from the bottom up and ultimately how we can build applications and services on top of it.
The Internet has changed our lives in so many ways. Our communication is considerably better than it ever was before. It helps us learn, it helps us do business together, it helps us interact with our friends and family.
We will be looking at a lot of the technologies that underlie the Internet and some of the design decisions made along the way that help bring the quality of the Internet from where it was 30 years ago to where it is today.

Protocols

What are some other forms of electronic communication that brought us to this point? Initially, we had the telegraph, where we had electronic communications going over a wire, typically to a trained operator who would translate, interpret, and transcribe the message.
Fast forward a little bit, we got the telephone. We still have electronic communications going over a wire, but now we’re able to actually hear audio and hear what the sender is saying.
Moving forward, we got radio. Now we’re not necessarily using wires, we’re using airwaves, sending messages far and wide. But ultimately, some device takes that data and presents it to a human being who interprets the content.
How is the Internet different? With the Internet, we have a brand new paradigm. The computers themselves are able to intercommunicate without necessarily needing a human being to interpret it in real-time. Computers can have conversations amongst themselves.
How might that happen? We start with protocols. What is a protocol? In basic simple language, we have some constraints or conventions that are going to govern and structure some type of interaction.
We have day-to-day examples of protocols. For example, meetings: When you see someone, you likely say hello, how are you? Culturally, we may have different examples: Do we shake hands, hug and kiss, do the fist bump or elbow bump? We’ve come up with different ways to greet one another.
Dining: We may see that in formal occasions, there are expectations of where dishware or silverware may be laid out, whereas at home on a casual meal, we might not follow those same rules.
When we address letters, different countries may have their own standards, but there’s a specific location where we’re expected to write down fields like address, postal code, city, and region.
With electronic communications, this is largely the same, only a bit more specific. With human protocols, there’s room for ambiguity. We don’t always have to follow the rules. With computers, they aren’t as good with ambiguity, so we need to be more specific.
We will look at machine-to-machine communication protocols that are more specific. We’ll examine IP (Internet Protocol), TCP, DNS, DHCP, HTTP, and other protocols. These are some of the important ones you use every day as you browse the Internet.

Internet Protocol

What does Internet Protocol (IP) actually do? For us to send information from one machine to another, we need to be able to address or reference a specific machine.
If I’m in a room with a few hundred people and I’d like to speak to one person, it would be very helpful if I knew their name. In human communications, two different people might have the same name, and there’s some ambiguity. With computers, this ambiguity doesn’t work, so we need more specificity.
We’ve come up with a system in IP version 4, the first standalone protocol governing IP, where we have 32 bits of information that uniquely identify a specific computer somewhere in the world.
With that address, in theory, any computer with an address should be able to speak to any other computer with an address.
Once they’re addressed, we still need to know how to actually get there. If we’re sending data, it’s not enough to say it’s going next door or across the street. If we want to get this across the world, we need some routing mechanism to transfer these pieces of information over a series of wires or airwaves to ultimately reach their destination.
We also have a problem when we start sending bigger pieces of information over long distances. We don’t want one of those to fail and have to do the whole thing over again. So we break down our messages into smaller pieces, send those small pieces, and reassemble them on the other side.
That way, if one piece fails, we only have to send that small piece again without resending the entire message. This allows for more safety and reliability in getting messages delivered.
The IP protocol has data headers, fields of different types of data placed at the front of the content we’re sending. Every broken-down packet will have one of these headers.
Because we agree in advance that these fields will always be in the same order and the same length, that structure is the definition of this protocol. A computer can compare this against the expectation and know what each field means without a human needing to get involved.

TCP

We have a way to get small packets delivered to an address and we’ve broken messages into smaller pieces, but that can lead to problems.
Let’s say I’d like to send the message “let’s meet for coffee tomorrow at noon.” If we break this down into smaller pieces, the recipient might receive “for lets coffee tomorrow at meet noon.” There’s no guarantee with IP that the order of messages will be received the same way they were sent.
What if some packets don’t arrive at all? If I send “the buried treasure is beneath the flower pot” and you receive “the buried treasure is beneath the,” that’s a critical piece of missing information.
Enter Transmission Control Protocol (TCP). This is an additional layer added to IP that allows these problems to be solved.
First, it lets us order these packets. We give each a number and know how many packets were expected. So the recipient, even if they receive them out of order, can see this is seven of ten, this is three of ten, this is one of ten. Once all are received, they can be put back in order.
There’s also a mechanism for packets to be resent if they didn’t arrive. The recipient will acknowledge each individual packet and send that acknowledgment back to the sender. Anything without an acknowledgment gets sent again.
We also check for errors. We have mathematical trickery to make sure the data we receive matches the data sent. Otherwise, we might not know if the message was damaged along the way.
Finally, TCP gives us the concept of port numbers. This is a way to separately number each conversation happening on a machine, grouped by the type of software running.
Port 80 or port 443 are commonly used for HTTP or website traffic. Port 25 or 587 might be used for email. We can take that port number and determine what software should handle this particular message.
The same machine could receive dozens of different messages from all different sources, and we can group them by functionality to make sure they get to the correct piece of software.
Both IP and TCP headers will be included in a full stream of messages. These two protocols create what was initially named ARPANET, which became the eventual Internet. Started from the Advanced Research Projects Agency, it began as a small computer network on a few college campuses in the late 1960s, sending very basic small information.

Routing

We’re going to start a theme looking at protocols specifically designed to solve problems. The Internet was much smaller in the late 1960s than it is today.
If we have computers close by on a local network, it’s fairly easy for them to speak to one another. But what if we want to send a message across town, across the region, across a country, or overseas?
The further away we get, the more wires and geography we have to traverse. How do machines know how to get their message from where we sit to the other side of the world?
Do they need to know the path? If I’m getting in a car and driving somewhere, I need to know the directions. But your computer actually doesn’t need to know. And do we need to follow the same directions every time? The answer is absolutely not.
This is actually a design decision: If a wire gets cut, something gets damaged, or a natural disaster causes an outage, we can get there a different way by taking a different route.
We have a device called a router. It does what it says: It routes things. It takes packets and figures out: Do I know how to route this packet? Am I responsible for this network or not?
If it’s not responsible for that network, the router forwards the packet to its gateway, another router with slightly larger responsibility or broader scope. That gateway might also not know where this should go, so it forwards again, upstream, until some machine knows: This is a network I’m familiar with, you need to go in this direction.
This is largely how the postal service works. We have postal codes on letters. When you send a letter, it goes to a sorting station. The station asks: Is this an address I service? If no, it goes to a larger distribution facility. It might travel across the country until ultimately one facility says: I’m responsible for this geographical area.
Routers work almost the same way. We go wider and wider into distribution facilities until we recognize where to send the packet, then we zoom in closer and closer to the destination.

DNS

We have a different problem to solve. We’re getting more and more computers on the Internet. Now that we’re routing, we can talk to computers all over the world. But we don’t have a phone book. How do we know someone’s address?
Who’s responsible for that information? Who can edit it? What if there are disagreements? Our address book is getting bigger. We need a way to deal with this.
DNS (Domain Name System) is a protocol that does exactly this. We have an organization called ICANN, the International Corporation for Assigned Names and Numbers. They are globally recognized as the entity responsible for assigning temporary ownership of a specific name to an owner.
You can pay for the rights to a domain, and while you control it, you are the authority for all addresses at that domain.
When I say domain, we have something like harvard.edu or google.com. We have a name, and then a top-level domain or suffix like .edu, .com, .net, .biz, .uk. All of these are managed by the same organization, and the owner manages all records in their zone.
If we’re looking for the address of harvard.edu, we need to ask that question. DNS is a protocol where we phrase a question in a specific way and pose it to a DNS server, expecting a response.
The server decides: Do I know the address or don’t I? If yes, it answers with the address. If no, we go specifically to the authority responsible for that domain and ask them.
After the answer comes back, that server will remember the answer for a while, maybe a day or a couple of hours. If the question comes up again, we don’t have to keep searching. Eventually, we forget in case someone changes the information.
We can ask these questions of any DNS server anywhere in the world, and it will get an answer. It either knows or it looks the answer up.

DHCP

DHCP (Dynamic Host Configuration Protocol) solves a different problem. We have more computers coming onto our network, and each needs a unique address.
Where do these addresses come from? How do I know which address a computer should get? How do I update my list every time a computer turns on or off?
We need to provide a mechanism to give out addresses and return them to a pool when a computer stops using them. We also need to ensure there are no conflicts. Each computer must be uniquely identified.
If two computers have the same address and someone sends a message to that address, which computer gets it? We have problems if that happens.
We came up with a way for computers, when they boot or connect to a network, to ask another computer: Can you please assign me an address that is safe to use?
There might be a different computer that keeps track of all the numbers already in use and says: Here is one that is free to use, no other computer is using it. It assigns a lease.
That’s a DHCP client and DHCP server interaction. The client asks, the server dishes out numbers, makes sure they’re safe, assigns a lease. When the lease expires, usually after a day or so, the computer either continues using the same number or asks for a new one.
In this way, we don’t have to manually track all day-to-day changes every time someone turns a computer on or off. At scale, this saves an awful lot of time managing overhead.

UDP

We talked about TCP solving problems with fragmentation and ordering. If you miss some packets, you don’t know where the buried treasure is. But solving that with TCP has costs.
We have to acknowledge every single packet. If you’re sending me a thousand packets, I have to send back a thousand acknowledgments. I also have to wait until every packet is received before I can reorder the message.
For mission-critical data where I need exact contents, that’s fine. But that’s not always the case.
Imagine a web conferencing system or watching video or audio. We don’t necessarily care if we lose a piece of data from four seconds ago. If the audio blipped a little during a Zoom call, I don’t need to get that packet back and replay that tiny fragment because the conversation has moved on.
UDP (User Datagram Protocol) is not better than TCP; it’s meant for a different use case. A lot of TCP overhead is eliminated. We’re not looking for guarantees in delivery or proper ordering.
We accept that most of the time packets will come through in order and we’ll play what we get. There may be duplicates, but we let those go. We focus on what is coming next.
You may be familiar with audio or video glitches when streaming. That’s okay with this protocol. We focus on getting the newest data and continuing from where we are. We don’t need to go backwards.
It is preferable to drop or lose packets than to delay or buffer and wait until we have the entire message before playing smoothly.

IP Address Exhaustion

It’s been years since the 1960s. The Internet has gotten bigger, and more computers all over the world are joining.
IP addresses in version 4 are 32 bits. That’s roughly 4.3 billion different unique addresses. A long time ago, that seemed like a tremendous amount.
Nowadays, there are seven-plus billion people on the planet. Think about how many devices are Internet-connected: PCs, laptops, phones, tablets. Yet more: Washing machines, dishwashers, refrigerators, security cameras, and other peripherals.
A single person may be responsible for dozens of Internet-connected products. And what are those computers talking to? There are servers and public machines on the other side of those conversations.
We simply have way more computers than addresses. We have run out of addresses allowed in IP version 4.

NAT

There are a couple of ways we’ve tried to deal with this. A short-term solution is NAT (Network Address Translation).
We take several IP ranges out of those four billion addresses and call them private. They’re not publicly routable, meaning you’ll never assign a computer that IP address and have another computer in the world contact you directly.
These addresses live behind devices like routers. It’s possible that my organization uses several of these IP numbers, and your organization uses those same numbers, but they’re not in conflict because we’re not on the same network.
We’ve all agreed that these numbers will not be publicly routable, that no router will offer a non-local connection to one of these addresses.
There are ranges for different sizes of organizations: The 10 range (10.0.0.0 to 10.255.255.255) for large networks, the 172.16 to 172.31 block for medium networks, and the 192.168 block for smaller networks like homes and small businesses.
When you make a request from one of these private computers, your router rewrites your return address from your private IP to your public IP. When the response returns, it does the same in reverse.
We’re hiding and reusing several of these numbers in smaller networks, then translating them. A big organization might take just one of those four billion IP addresses rather than hundreds.
Consider what this might look like. You have a modem or fiber terminal where the Internet comes in. From there, you have a router, which has two interfaces with two different IP addresses.
One is your one-in-four-billion public IP address that the world can route to. The second is a private IP only accessible by machines that router is responsible for.
Your computer, laptop, printer, phone, smart TV, dishwasher, all get private addresses from the router. Every request going to the public is translated to appear as having the public IP address.
You can test this at home. Go to a site like whatismyip or ipchicken.com from several different computers in your house. You will likely see the same number.

IPv6

NAT is not the only solution and not the best solution. The logical next protocol is IP version 6.
Instead of using 32 bits, we use 128 bits, which without going through the math, is a lot more. To shorten these numbers, we count in base 16 rather than base 10, using letters to replace two-digit numbers.
How many unique addresses are possible with 128 bits? It’s an enormous number that I won’t attempt to pronounce. We will not be running out of that in anyone’s lifetime anytime soon.
You may have seen some IPv6 addresses. They’re still less common visually than IPv4 addresses, but we are using more every day.
The trouble is getting an entire planet to agree on the exact rules of the next version. Everyone is opinionated with strong reasons. It takes a long time to get agreement.
The proposal for IPv6 came out a long time ago, and we’re still working on upgrading machines to support it. It will eventually replace IPv4, but even though our addresses have been exhausted, we still use IPv4 every single day.
Going forward, more IPv6 addresses are being adopted. Eventually, IPv4 will be fully retired and we’ll have more address space than we know what to do with.

HTTP

We’ve talked about communications and packets going back and forth, but we haven’t talked about the most important piece you likely use daily: Websites.
It’s common to conflate the Internet with the World Wide Web, which is a collection of websites or data that can be interchanged in a web browser. They’re not necessarily the same thing.
To make websites happen, we use HTTP (Hypertext Transfer Protocol). That defines all the rules for interacting with what you know as a website. We talk to a web server, ask questions, get responses, and a website appears on your screen.
Let’s say we go to https://nytimes.com. What happens between hitting Enter and that website appearing?
The journey starts with a web browser. What is a web browser? It’s a program like any other. It specifically speaks HTTP. It writes HTTP requests to send out, receives HTTP responses, and renders them so you can interact with them.
The browser translates that behind-the-scenes computer-speak into a beautiful, functional website. First, we draft an HTTP request, asking a computer somewhere to send us the homepage.
The request has a method: GET means we want to get a file, to receive something. We have a path we’re requesting, in this case just slash for the homepage. And the protocol and version: HTTP/2.
We also have potentially many headers with values. For simplicity, the host we’re looking for, like nytimes.com, is certainly required.
We send that request over the Internet. All the other protocols are still in play. We use DNS to figure out what address we’re sending to. We break it into packets with TCP/IP and use routing tools to send it around the world.
The server receives our message and needs to do something with it. Because we requested an HTTPS website (encrypted HTTP), this came over on port 443. The computer forwards this to software called a web server.
The web server sees someone would like to GET a specific file, particularly the homepage. There are other request types. GET is simple: We’re downloading a file. We could also send information, like when checking out on a shopping website with address and payment information.
Finally, we draft the HTTP response and send it back. It includes the version (HTTP/2), a status code like 200, headers like the date and content-type.
Content-type gives the browser a hint about what type of file we’re sending: Text/html for a homepage, or perhaps PDF, Excel, images, video, audio. The browser knows how to delegate based on the type.
And of course, we send the file itself, the actual content of the homepage.

URLs

That 200 code we mentioned is one of many status codes. These are ways for the server to respond and give the browser an idea of how the request was handled.

There are numerous status codes, including:

OK
Moved Permanently
Not Modified
Unauthorized
Forbidden
Not Found
I'm a teapot
Internal Server Error
Service Unavailable

The most common is 200, meaning everything worked fine, here’s your content. Other codes appear when something goes wrong.
301 Moved Permanently is a redirect. You’ll see this if you access a non-encrypted website and get forwarded to the encrypted version.
404 Not Found is when you type a URL that doesn’t exist anymore. Maybe the page was deleted or renamed. The 400-range numbers are client-related errors.
The 500-range numbers are server-related errors. Internal Server Error means something happened on the server.
The browser receives the response and renders it, presenting an experience with articles, headlines, pictures, and ads that you can interact with.
One thing we’ll see is references to additional files. We requested one file, the homepage, but it might tell us to download JavaScript files, stylesheets, images, videos. Each of those starts a new HTTP request.
In HTTP version 1 or 1.1, each file needed to be requested individually. Loading something like New York Times that references hundreds of files means creating hundreds of individual requests, making latency painful.
With HTTP/2 or HTTP/3, we have features like multiplexing, sending multiple files at once as a package rather than unique requests, consolidating overhead into quicker transactions.

Components of a URL

Let’s talk about the components of a URL. When we look at a URL, we have several components.
First, the scheme, which is the protocol. Most of the time now, it’s HTTPS (encrypted HTTP). We can specify this before the colon-slash-slash.
Next, the subdomain. Initially, we expected it to be WWW (World Wide Web), a dated term now. The Internet and World Wide Web are somewhat different things. The Internet is the connected network; the World Wide Web refers to websites and their interconnected nature.
Nowadays, we use subdomains for different business units, brands, or subdepartments.
The domain itself is the primary way of saying this is my area that I control: Harvard.edu, Google, Facebook.
The top-level domain or suffix: Commonly .com, .net, .org. Much like IPv4 addresses, we ran out of these. People wanted every misspelling and version of their company name. We’ve added many more, and different countries have their own suffixes.
Beyond that, we have the file and folder structure. After the slash, we specify what folder and file name we’re requesting. There could be multiple folders: Folder A inside folder B inside folder C.

Scaling

With all this, we’ve talked about many protocols developed over 50 years to make the Internet what it is today. Now that we have reliable communications, let’s talk about how we use them from a business standpoint and how scaling works.
Our progress is always constrained by processing power and data storage. If we look at what we expected 10, 20, 30 years ago, there has been enormous improvement.
The computer for the Apollo 11 mission had about 12,250 FLOPS (floating-point operations per second). The Cray supercomputer in the 1980s did 1.9 billion FLOPS and weighed about 2,500 kilograms.
Fast forward to 2020: The iPhone 12 has 11 trillion FLOPS and weighs 164 grams. Such a huge improvement, and it will continue.
Consider a scenario where we want to scale Internet operations. Let’s say we’re a power utility with an outage map for customers.
On a normal day, maybe 100 users visit that map. Very small reason for anyone to check if they have power.
But worst case: A natural disaster causes massive damage, power lines are out everywhere, millions of people without power. Everyone on their phones, checking when their power will return.
The site that had 100 visitors daily now potentially has millions of people hitting it repeatedly. The servers get overwhelmed.
We have options. We could optimize for 100 users and not care if we go over. But if millions are without power, that won’t work. People will get upset.
We could spend a lot of money on more powerful servers or way more servers, ready for a million people. That works, but it’s underutilized hardware most of the time, sitting idle while we pay for and maintain it.
We could scale vertically: Make a machine more powerful. Buy processor B instead of processor A to get more FLOPS, or buy more processors, bigger hard drives.
We could scale horizontally: Instead of adding power, add quantity. Get several cheaper computers and use a lot of them. When one gets busy, start giving load to the second, third, fourth. Spread the load.
That’s not too bad, but there are still limits. Vertical scaling: Cost typically outpaces marginal performance gains. Eventually, there is a fastest computer. Beyond that, we have to scale horizontally. Also, more power requirements and cooling requirements.
Horizontal scaling: More physical machines means more maintenance, software updates, data backups, power backups, hardware that may fail. Physical space constraints too.

Cloud Computing

We can consider co-location. Instead of buying a massive computer and installing it in your server closet, rent space from a data center whose sole job is to maintain a facility with redundant power, redundant Internet, cooling, and other overhead.
Pay them to do that, rent the computing you need. Let them handle hardware and software upgrades, physical security, fire suppression, cooling. Focus on performance, not overhead.
But going back to our electricity example, we only need massive capacity 1% of the time or less. There’s got to be a better way than buying or even renting all this hardware.
How can we scale when we need it but scale back down when we don’t? This brings us to cloud computing.
Using concepts called virtualization and containerization, we can simulate a computer or applications on another computer.
We get massive data centers, like co-location, but rather than pay for hardware itself, we pay to have our hardware simulated or virtualized.
We take an entire machine, operating system and all, and turn it into a file. We execute that file, and now we have a virtual computer running. If we want to stop, we shut it down and stop putting resources into it.
We run these simulations on very powerful machines, more powerful than we’d pay for. We share that power with other customers, not the same simulation, but the machines they run on.
We only pay for the time we’re executing. We see storms coming up the coast, we know we’ll get hit. We start spinning up extra simulated computers. As requests come in, we have machines waiting to accept the influx.
As load goes back down and residents get power back, we shut down those extra simulations. We kept costs to a minimum, still fulfilled requirements, still communicated with customers.

Virtualization and Containerization

Virtualization and containerization are subtly different.
With virtualization, the operating system of a usually powerful computer is a hypervisor. Its job is specifically to run simulations. Each virtual computer has its own operating system. You could virtualize Windows or Linux, installing software just as on physical hardware.
Containerization is different. Instead of running on a hypervisor and recreating guest operating systems, we run a containerization package like Docker on a normal operating system.
Docker simulates just the apps themselves in their own little sandbox. They don’t affect your larger computer, and we can create as many or as few simulations as we want without redoing the operating system.
When we virtualize, every operating system, especially Windows, requires licensing. We pay for that over and over. We don’t need to reproduce all the operating system overhead if we only care about spinning up the application.
In a containerized situation, we only simulate the application, keeping it sandboxed. We can scale across multiple container hosts. If we need more, we spin more up. Slightly different approaches, but the idea is simulating execution as if it were actual hardware.
There are several large industrial players that support this market. It is really how major compute is being handled nowadays.

High Availability

Ultimately, the goal is to have our services available as often as possible. We are trying to eliminate single points of failure.
Any one device where if somebody trips over the cable and it unplugs and everything goes down: That’s a single point of failure. We want at least two of everything. If something fails, something else can pick up the slack with minimal downtime.
What goes with that is load balancing: Taking requests that might overwhelm us and distributing usage across as many machines as necessary so nothing crashes.
If one machine dies, we take it out of the pool and direct to one fewer machine. No user realizes whether servers are added or removed as long as requests are filled in a timely way.
We want to distribute load as best we can and avoid overloading any one node. How we implement that takes trial and error, understanding the usage we have.
But once we do that, we have tools with cloud computing concepts to simulate an entire network and make sure all of it is redundant. If any piece goes down, even an entire data center due to natural disaster, we can automatically shift to a completely different data center somewhere else in the world because everything is virtualized.
This has come an awful long way from getting a few packets from a computer lab across one university to another.

Summing Up

In this lesson, you learned how to implement the Internet. Specifically, you learned…

About protocols as constraints and conventions that govern and structure interactions between computers.
How Internet Protocol (IP) provides unique 32-bit addresses to identify computers and breaks messages into smaller packets for reliable delivery.
How TCP (Transmission Control Protocol) adds ordering, acknowledgment, error checking, and port numbers to ensure reliable communication.
About routing and how routers forward packets through gateways until they reach their destination, much like the postal service.
How DNS (Domain Name System) acts as the Internet’s phone book, translating domain names into IP addresses.
How DHCP (Dynamic Host Configuration Protocol) automatically assigns IP addresses to devices on a network.
About UDP (User Datagram Protocol) for use cases like streaming video where speed matters more than guaranteed delivery.
About IP address exhaustion and how NAT (Network Address Translation) and IPv6 address this problem.
How HTTP (Hypertext Transfer Protocol) governs website interactions, including requests, responses, and status codes.
About the components of URLs: Scheme, subdomain, domain, top-level domain, and path.
About scaling strategies including vertical scaling, horizontal scaling, co-location, and cloud computing.
How virtualization and containerization allow us to simulate computers and applications for flexible, on-demand computing.
About high availability and load balancing to ensure services remain accessible even when individual components fail.

See you next time!