Distributed Systems – Network Communication and Remote Procedure Calls (RPCs)

June 20, 2022

Distributed Systems System Design Interview

Table of Contents

The problem of communication

A process on Host A wants to talk to the process on Host B
A and B must agree on the meaning of the bits being sent and received at many different levels, including:
- How many volts is a 0 bit, a 1 bit?
- How does the receiver know which is the last bit?
- How many bits long is a number?

Re-implement every application for every new underlying transmission medium?
Change every application on any change to an underlying transmission medium?
No! But how does the Internet design avoid this?

Solution: Layering

Intermediate layers provide a set of abstractions for applications and media
New apps or media need only implement for the intermediate layer’s interface

Layering in the Internet

Transport: Provide end-to-end communication between processes on different hosts
Network: Deliver packets to destinations on other (heterogeneous) networks
Link: Enables end hosts to exchange atomic messages with each other
Physical: Moves bits between two hosts connected by a physical link

Logical communication between layers

How to forge an agreement on the meaning of bits exchanged b/w two hosts?
Protocol: Rules that govern the format, contents, and meaning of messages
Each layer on a host interacts with its peer host’s corresponding layer via the protocol interface

Physical communication

Communication goes down to the physical network
Then from network peer to peer
Then up to the relevant application

Communication between peers

How do peer protocols coordinate with each other?
The layer attaches its own header (H) to communicate with peer
Higher layers’ headers, data encapsulated inside the message
Lower layers don’t generally inspect higher layers’ headers

Network socket-based communication

Socket: The interface the OS provides to the network
- Provides inter-process explicit message exchange
Can build distributed systems atop sockets: send(), recv()
- e.g.: put(key,value) -> message

Socket programming: still not great

// Create a socket for the client
if ((sockfd = socket (AF_INET, SOCK_STREAM, 0)) < 0) {
 perror(”Socket creation");
 exit(2);
}

// Set server address and port
memset(&servaddr, 0, sizeof(servaddr));
servaddr.sin_family = AF_INET;
servaddr.sin_addr.s_addr = inet_addr(argv[1]);
servaddr.sin_port = htons(SERV_PORT); // to big-endian

// Establish TCP connection
if (connect(sockfd, (struct sockaddr *) &servaddr, sizeof(servaddr)) < 0) {
 perror(”Connect to server");
 exit(3);
}

// Transmit the data over the TCP connection
send(sockfd, buf, strlen(buf), 0);

Lots for the programmer to deal with every time
- How to separate different requests on the same connection?
- How to write bytes to the network/read bytes from the network?
  - What if Host A’s process is written in Go and Host B’s process is in C++?
- What to do with those bytes?
Still pretty painful… have to worry a lot about the network

Why RPC?

The typical programmer is trained to write single-threaded code that runs in one place
Goal: Easy-to-program network communication that makes client-server communication seem transparent
- Retains the “feel” of writing centralized code
- Programmers needn’t think (much) about the network

Everyone uses RPCs

Google gRPC
Facebook/Apache Thrift
Twitter Finagle
…

What’s the goal of RPC?

Within a single program, running in a single process, recall the well-known notion of a procedure call:
- Caller pushes arguments onto the stack,
  - jumps to address of callee function
- Callee reads arguments from the stack,
  - executes, puts return value in a register,
  - returns to the next instruction in the caller

RPC’s Goal: make communication appear like a local procedure call: way less painful than sockets

RPC issues

Heterogeneity
- The client needs to rendezvous with the server
- The server must dispatch to the required function
  - What if the server is a different type of machine?
Failure
- What if messages get dropped?
- What if the client, server, or network fails?
Performance
- Procedure call takes ≈ 10 cycles ≈ 3 ns
- RPC in a data center takes ≈ 10 μs (10^3 slower)
  - In the wide area, typically 10^6 slower

Problem: Differences in data representation

Not an issue for local procedure calls
For a remote procedure call, a remote machine may:
- Run process written in a different language
- Represent data types using different sizes
- Use a different byte ordering (endianness)
- Represent floating point numbers differently
- Have different data alignment requirements
  - e.g., 4-byte type begins only on a 4-byte memory boundary

Solution: Interface Description Language

Mechanism to pass procedure parameters and return values in a machine-independent way
Programmer may write an interface description in the IDL
- Defines API for procedure calls: names, parameter/return types
Then runs an IDL compiler which generates:
- Code to marshal (convert) native data types into machine-independent byte streams (and vice-versa, called unmarshaling)
- Client stub: Forwards local procedure call as a request to server
- Server stub: Dispatches RPC to its implementation

A day in the life of an RPC

Client calls stub function (pushes parameters onto the stack)

Stub marshals parameters to a network message

OS sends a network message to the server

Server OS receives the message, sends it up to stub

Server stub unmarshals params, calls server function

The server function runs, returns a value

Server stub marshals the return value, sends message

Server OS sends the reply back across the network

Client OS receives the reply and passes it up to stub

Client stub unmarshals return value, returns to the client

What could possibly go wrong?

All of these may look the same to the client
- The client may crash and reboot
- Packets may be dropped
  - Some individual packet loss on the Internet
  - Broken routing results in many lost packets
- The server may crash and reboot
- The network or server might just be very slow

Summary

Layers are our friends!
RPCs are everywhere
Necessary issues surrounding machine heterogeneity
Subtle issues around failures