Post-REST

More or less all the big APIs are RESTful these days. Yeah, you can quibble about what “REST” means (and I will, a bit) but the assertion is broadly true. Is it going to stay that way forever? Seems unlikely. So, what’s next?

What we talk about when we talk about “REST” · These days, it’s used colloquially to mean any API that is HTTP-based. In fact, the vast majority of them offer CRUD operations on things that have URIs, embed some of those URIs in their payloads, and thus are arguably RESTful in the original sense; although these days I’m hearing the occasional “CRUDL” where L is for List.

At AWS where I work, we almost always distinguish, for a service or an app, between its “control plane” and its “data plane”. For example, consider our database-as-a-service RDS; the control plane APIs are where you create, configure, back-up, start, stop, and delete databases. The data plane is SQL, with connection pools and all that RDBMS baggage.

It’s interesting to note that the control plane is RESTful, but the data plane isn’t at all. (This isn’t necessarily a database thing: DynamoDB’s data plane is pretty RESTful.)

I think there’s a pattern there: The control plane for almost anything online has a good chance of being RESTful because, well, that’s where you’re going to be creating and deleting stuff. The data plane might be a different story; my first prediction here is that whatever starts to displace REST will start doing it on the data plane side, if only because control planes and REST are such a natural fit.

RESTful imperfections · What are some reasons we might want to move beyond REST? Let me list a few:

Latency · Setting up and tearing down an HTTP connection for every little operation you want to do is not free. A couple of decades of effort have reduced the cost, but still.

For example, consider two messaging systems that are built by people who sit close to me: Amazon SQS and MQ. SQS has been running for a dozen years and can handle millions of messages per second and, assuming your senders and receivers are reasonably well balanced, can be really freaking fast — in fact, I’ve heard stories of messages actually being received before they were sent; the long-polling receiver grabbed the message before the sender side got around to tearing down the PutMessage HTTP connection. But the MQ data plane, on the other hand, doesn’t use HTTP, it uses nailed-up TCP/IP connections with its own framing protocols. So you can get astonishingly low latencies for transmit and receive operations. But, on the other hand, your throughput is limited by the number of messages the “message broker” terminating those connections can handle. A lot of people who use MQ are pretty convinced that one of the reasons they’re doing this is they don’t want a RESTful interface.

Coupling · In the wild, most REST requests (like most things labeled as APIs) operate synchronously; that is to say, you call them (GET, POST, PUT, whatever) and you stall until you get your result back. Now (speaking HTTP lingo) your request might return 202 Accepted, in which case you’d expect either to have sent a URI along to be called back as a webhook, or to get one in the response that you can poll. But in all these cases, the coupling is still pretty tight; you (the caller) have to maintain some sort of state about the request until the caller has done with it, whether that’s now or later.

Which sort of sucks. In particular when it’s one microservice calling another and the client service is sending requests at a higher rate than the server-side one can handle; a situation that can lead to acute pain very quickly.

Short life · Handling some requests takes milliseconds. Handling others — a citizenship application, for example — can take weeks and involve orchestrating lots of services, and occasionally human interactions. The notion of having a thread hanging waiting for something to happen is ridiculous.

A word on GraphQL · It exists, basically, to handle the situation where a client has to assemble several flavors of information do its job — for example, a mobile app building an information-rich display. Since RESTful interfaces tend to do a good job of telling you about a single resource, this can lead to a wasteful flurry of requests. So GraphQL lets you cherry-pick an arbitrary selection of fields from multiple resources in a single request. Presumably, the server-side implementation issues that request flurry inside the data center where those calls are cheaper, then assembles your GraphQL output, but anyhow that’s no longer your problem.

I observe that lots of client developers like GraphQL, and it seems like the world has a place for it, but I don’t see it as that big a game-changer. To start with, it’s not as though client developers can compose arbitrary queries, limited only by the semantics of GraphQL, and expect to get uniformly decent performance. (To be fair, the same is true of SQL.) Anyhow, I see GraphQL as a convenience feature designed to make synchronous APIs run more efficiently.

A word on RPC · By which, these days, I guess I must mean gRPC. I dunno, I’m old enough that I saw generation after generation of RPC frameworks fail miserably; brittle, requiring lots of configuration, and failing to deliver the anticipated performance wins. Smells like making RESTful APIs more tightly coupled, to me, and it’s hard to see that as a win. But I could be wrong.

Post-REST: Messaging and Eventing · This approach is all over, and I mean all over, the cloud infrastructure that I work on. The idea is you get a request, you validate it, maybe you do some computation on it, then you drop it on a queue (or bus, or stream, or whatever you want to call it) and forget about it, it’s not your problem any more.

The next stage of request handling is implemented by services that read the queue and either route an answer back to the original requester or passes it on to another service stage. Now for this to work, the queues in question have to be fast (which these, days, they are), scalable (which they are), and very, very durable (which they are).

There are a lot of wins here: To start with, transient query surges are no longer a problem. Also, once you’ve got a message stream you can do fan-out and filtering and assembly and subsetting and all sorts of other useful stuff, without disturbing the operations of the upstream message source.

Post-REST: Orchestration · This gets into workflow territory, something I’ve been working on a lot recently. Where by “workflow” I mean a service tracking the state of computations that have multiple steps, any one of which can take an arbitrarily long time period, can fail, can need to be retried, and whose behavior and output affect the choice of subsequent output steps and their behavior.

An increasing number of (for example) Lambda functions are, rather than serving requests and returning responses, executing in the context of a workflow that provides their input, waits for them to complete, and routes their output further downstream.

Post-REST: Persistent connections · Back a few paragraphs I talked about how MQ message brokers work, maintaining a bunch of nailed-up network connections, and pumping bytes back and forth across them. It’s not hard to believe that there are lots of scenarios where this is a good fit for the way data and execution want to flow.

Now, we’re already partway there. For example, SQS clients routinely use “long polling” (typically around 30 seconds) to receive messages. That means, they ask for messages and if there aren’t any, the server doesn’t say “no dice”, it holds up the connection for a while and if some messages come in, shoots them back to the caller. If you have a bunch of threads (potentially on multiple hosts) long-polling an SQS queue, you can get massive throughput and latency and really reduce the cost of using HTTP.

The next two steps forward are pretty easy to see, too. The first is HTTP/2, already widely deployed, which lets you multiplex multiple HTTP requests across a single network connection. Used intelligently, it can buy you quite a few of the benefits of a permanent connection. But it’s still firmly tied to TCP, which has some unfortunate side-effects that I’m not going to deep-dive on here, partly because it’s not a thing I understand that deeply. But I expect to see lots of apps and services get good value out of HTTP/2 going forward; in some part because as far as clients can tell, they’re still making, and responding to, the same old HTTP requests they were before.

The next step after that is QUIC (Quick UDP Internet Connections) which abandons TCP in favor of UDP, while retaining HTTP semantics. This is already in production on a lot of Google properties. I personally think it’s a really big deal; one of the reasons that HTTP was so successful is that its connections are short-lived and thus much less likely to suffer breakage while they’re at work. This is really good because designing an application-level protocol which can deal with broken connections is super-hard. In the world of HTTP, the most you have to deal with at one time is a failed request, and a broken connection is just one of the reasons that can happen. UDP makes the connection-breakage problem go away by not really having connections.

Of course, there’s no free lunch. If you’re using UDP, you’re not getting the TC in TCP, Transmission Control I mean, which takes care of packetizing and reassembly and checksumming and throttling and loads of other super-useful stuff. But judging by the evidence I see, QUIC does enough of that well enough to support HTTP semantics cleanly, so once again, apps that want to go on using the same old XMLHttpRequest calls like it was 2005 can remain happily oblivious.

Brave New World! · It seems inevitable to me that, particularly in the world of high-throughput high-elasticity cloud-native apps, we’re going to see a steady increase in reliance on persistent connections, orchestration, and message/event-based logic. If you’re not using that stuff already, now would be a good time to start learning.

But I bet that for the foreseeable future, a high proportion of all requests to services are going to have (approximately) HTTP semantics, and that for most control planes and quite a few data planes, REST still provides a good clean way to decompose complicated problems, and its extreme simplicity and resilience will mean that if you want to design networked apps, you’re still going to have to learn that way of thinking about things.

Contributions

Comment feed for ongoing:

From: Chad Brewbaker (Nov 19 2018, at 11:50)

At Global Day of Code Retreat last weekend, the easiest Game of Life implementation was actually in SQL. Surprised me a bit.

Most transactional data is in SQL data stores. Perhaps the problem is on the database end? Both failure of SQL databases to have REST style discoverabilty, and our failure to give SQL schemas for the data views offered by "REST" APIs.

Blowing away data schemas at the server application level has hurt more than helped for the sake of Type erasure "flexibility".

[link]

From: Mike Bannister (Nov 19 2018, at 14:13)

>Anyhow, I see GraphQL as a convenience feature designed to make synchronous APIs run more efficiently.

I was curious if you meant to use the word "synchronous" here or not? Doesn't seem right but maybe I'm misunderstanding?

[link]

From: Alastair Houghton (Nov 20 2018, at 10:12)

It's a shame that SCTP hasn't taken off outside of telecoms; it solves a lot of the problems with TCP and it would make a lot of sense to use it instead of reimplementing TCP-like semantics over UDP or coming up with complicated multi-channel application protocols that can work over TCP (these have significant problems when running on top of TCP, as the TCP layer enforces potentially unnecessary ordering constraints that can cause head-of-line blocking and other problems).

[link]

From: Santiago Gala (Nov 20 2018, at 13:29)

Not completely unrelated. :)

Re: latency and persistent connections, I had an "aha!" moment when I recently experimented with Wireguard as compared with my usual L2TP/IPsec VPNs: when I suspend the laptop, move elsewhere and resume it, 2 roundtrips, prompted by the first packet routed, is all it takes to have the connection back. It is virtually stateless and maintenance free. When I compare with L2TP/IPsec, it takes minimum 5 roundtrips, to get a security association going, even more if the L2TP tunnel dies and needs to be restarted.

[link]

From: Lewis Cowles (Nov 20 2018, at 22:49)

I'm not so sure that this millions of requests per-second monoliths (more than superficial) are needed. After recently using Graph-QL as the sole source of truth for an application, I hate it with a passion, and consider it antithetical to achieving high throughput.

Problems GraphQL introduces which REST / single-object RPC do not

- Traversing the leaf nodes of the graph that can be n-levels deep (assuming we're not expecting junk we ignore)

- Aggregation of data from systems which do not (or should not) store hierarchical data (SQL, KV)

- Separation of data to systems which do not (or should not) store hierarchical data (SQL, KV)

- What to do in case of a potentially increased space of conflicts

We recently had an error in one of our applications. In comes a request to update just one field of a resource that was "designed to use deep-structures". It's a JSON API endpoint, and it simply sets approval for a person. Somewhere in that nest of code (which I did not author), it triggers a cascade effect which wipes out skills, which can be sent with a resource, but have not in this case been sent.

If we instead had REST endpoints for attached resources for all write-operations, we'd have no problems apart from selecting the appropriate endpoint for our requests, reducing debugging, error count, software complexity.

I don't pretend that the problem lies solely with graph-inspired API's, but I do think if people are honest, having RPC and REST APIs as well as a few graph endpoints might be more the future than "REST is dead, long live graph"

[link]

From: Matthew Pava (Nov 27 2018, at 08:36)

I don't think we should be spending so much time focusing on a million-requests-per-minute infrastructure. The Internet was supposed to be distributed. It should be unlikely that a million endpoints are accessing a single server for any purpose. It's time to go back to our roots and focus on truly distributed computing. The problem becomes simpler to solve.

[link]

From: Nox (Dec 03 2018, at 18:39)

On coupling and Post-REST Messaging and Eventing. Is the understanding that the client/caller is required to maintain the state of the call on both cases?

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

November 18, 2018
· Technology (90 fragments)
· · Web (396 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!