Semantic Contracts - The Unwritten API Agreements

This post talks about the implicit semantic contracts between services and APIs that aren't as apparent as the explicit data contracts when broken. It discusses a method to deploy such contract changes safely.

Semantic Contracts - The Unwritten API Agreements

Introduction

An API contract is an agreement between a service and its clients that determines how they communicate. However, the API contract alone isn’t enough to explain the observable behavior of the service. Consider a simple calculator service that exposes an API to return the sum of two integers. The API contract for this service can be as follows:

message SumRequest {
  int32 a = 1;
  int32 b = 2;
}

message SumResponse {
  int32 result = 1;
}

service Calculator {
  rpc Sum (SumRequest) returns (SumResponse);
}

(“API” and “RPC” are used interchangeably here to allow easy demonstration.)

The API contract for the Sum API states it will accept two integers as arguments and return an integer. However, it doesn’t dictate the semantics of the return value. It could be that today Sum(1, 2) returns 3, but tomorrow, it might start returning 4 due to a bug. The API contract isn’t breached, but the semantic contract is.

How can a semantic contract be specified?

One indicator of the semantic contract of an API is its name. In this case, Sum indicates that it would return the sum of the two integer arguments it is invoked with. We can also document this explicitly by adding a comment.

 service Calculator {
+  // Sum returns the sum of two input integers.
   rpc Sum(int, int) int
 }

The users of the service will rely on this semantic contract. Based on the behavior of the API, they can also make assumptions that aren’t fulfilled by the contract. For example, the above contract doesn’t make any performance guarantees. However, based on the current performance, clients can assume that the p95 latency of the Sum API is (and will be) 0.1s.

While the above approach improves the status quo, it doesn’t go far.

Quoting Hyrum’s law from “Software Engineering at Google”, page 8:

“With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

We can mitigate it, but we know that it can never be eradicated.”

How to detect a semantic contract change?

Let’s consider the problem described previously: a bug is introduced in the Sum API’s implementation, causing Sum(1, 2) to return 4 instead of 3. This bug will be caught before deployment if the API is tested sufficiently. However, returning 4 might be intentional. The semantic contract of the API is being modified to return one more than the sum of the integer arguments. The individual doing this change can fix the tests, deploy the new API, and call it a day. However, clients who rely on the older semantic contract will break. They will need to adjust their implementation per the new contract, after which there will again be a mutual agreement on the semantics of the API between the clients and the service.

How can we avoid breaking the clients in the first place?

Implications of breaking a test

If changing or adding a functionality results in a test failure, it implies that all clients who depend directly or indirectly on that functionality will break once the change is deployed. This is a good enough litmus test to identify a semantic contract change.

There is an important detail here - if the tests exercise enough features, they represent the semantic contract of the service. Writing good and maintainable tests is an art and is as important as writing good code.

Safely deploying a semantic contract change

The following diff highlights a mechanism to deploy such contract changes using versioning. By default, it doesn’t impact any downstream clients. However, it allows the clients to switch to the new version (that has the new functionality) at their own pace.

 message SumRequest {
   int32 a = 1;
   int32 b = 2;
+  // TODO: remove when all downstream clients have moved to the new version.
+  enum Version {
+    v1 = 1;
+    v2 = 2;
+  }
+  Version version = 3;
 }

 message SumResponse {
   int32 result = 1;
 }

 service Calculator {
   rpc Sum (SumRequest) returns (SumResponse);
 }

 func (s *server) Sum(ctx context.Context, in *api.SumRequest) (*api.SumResponse, error) {
+   switch in.GetVersion() {
+   case api.SumRequest_v2:
+     return &api.SumResponse{
+       Result: in.A + in.B + 1,
+     }, nil
+   default:
+     // Resort to old behavior until all downstream clients have moved to the new version.
+     return &api.SumResponse{
+       Result: in.A + in.B,
+     }, nil
+   }
 }

At DevRev, we have been actively using the above approach to deploy implicit contract changes.

Quoting from “Software Engineering at Google”, page 12:

“Your organization’s codebase is sustainable when you are able to change all of the things that you ought to change, safely, and can do so for the life of your codebase.” Hidden in the discussion of capability is also one of costs: if changing something comes at inordinate cost, it will likely be deferred. If costs grow superlinearly over time, the operation clearly is not scalable.”

The upcoming posts in this series will be case studies of some interesting contract breaches we identified and mitigated at DevRev. As it turns out, they all weren’t easy.

essential