Improving GraphQL Federation Resiliency: Investigating Failed Schema Updates

Christian Ernst
Booking.com Engineering
12 min readOct 24, 2022

--

Primer: What are GraphQL and the Federation Gateway?

At Booking.com, we are creating a unified data access layer for our accommodation services — a single entry point for accessing all relevant data, regardless of what resource it comes from. In order to make this a reality, we are using GraphQL.

If you are unfamiliar with GraphQL, you can think of it simply as a query language for APIs. GraphQL allows us to map these APIs to a schema file, which defines the structure and types (such as boolean or int) of available fields. The key benefit here is that a GraphQL query, similar to a database query, allows the user to request only the fields they need, reducing bloated API responses. These fields are resolved by the GraphQL server itself, pulling in the data from a variety of sources such as databases or services, and using mapper functions to resolve this data from those downstream responses.

Many GraphQL layers start out as a single service calling downstream services for data, and Booking.com’s was no exception here. As our single GraphQL service grew, it became more difficult to maintain not only for the GraphQL team who had dozens of merge requests a day to review for schema changes and service mappings but also manage frequent service deployments and help resolve merge conflicts with multiple teams. This also made it difficult for service owners to have strong ownership of their data and schema as they were heavily tied to the single service. To alleviate this issue, we migrated our GraphQL layer to a Federated GraphQL layer using a service we built called Federation Gateway.

The Federation Gateway

Instead of one service resolving and mapping all the requested data by calling downstream services via their transport method of choice, we let the services themselves define their own GraphQL endpoint and let them function as their own data resolvers. These services each define their own schema, which is then composed together and sent to the Federation Gateway.

A digram showing the difference in the request flow of a single graph and a federated graph
In a single graph one service is in charge of the schema and mapping service data to that schema. In a federated graph each service can own their slice of the schema and map their data themselves.

The Federation Gateway is aware of what services can resolve which fields and will handle the routing of the requested fields from a client to the services and combine all the requested data back into a single response. This has the benefit of also being able to have more complex types that span multiple services. For example, think of a hotel type containing reviews data and facilities data: Federation Gateway will understand where to get each bit of the data from and merge it together for the client response.

What is Apollo and why do we use it?

As with most things in life, building a basic GraphQL service is easy(ish). Booking.com even has something similar based on some of the same ideas of GraphQL but just in Perl. The issue is building it to work at our scale and in a federated manner is a bit more complicated as it is an older legacy system that only works with Perl and is tightly coupled to our legacy monolith causing any expansion or separation on it to be next to impossible.

To build a reliable GraphQL gateway for supporting multiple subgraphs, we need to consider the following:

  • Composing a schema with all the types from all the subgraphs
  • Validating that schema, including a process to override conflicts
  • Monitoring field usage to understand when it’s ok to remove a field
  • Assisting in handling the GraphQL n+1 problem
  • Developing libraries to build Federated Gateways and the subgraphs
  • Processing request traces to gain insight into each request
  • Increasing general technical expertise in GraphQL as we build our team

These reasons and many more are why we work with Apollo, a third party GraphQL provider. They provide the base libraries for building the Federation Gateway, GraphQL services and they provide the tools to gain insight into GraphQL queries allowing us to have better insight into the data requests.

What’s the problem?

Apollo and their tools sound pretty great, and they are! However, using third-party services and tools always results in dependencies. Those dependencies can fall into a few categories: strong or weak, as well as runtime or build time.

The Federated Gateway has one strong runtime dependency, and that is schema fetching. The schema is a file that includes service information and types for the graph as well as custom directives and some other metadata.

Schema fetching happens every 10 seconds by default. An ID based on the timestamp of the build is sent as part of the schema check, and if it is older than the current build timestamp on Apollo’s side ID, the updated schema is sent from Apollo.

If Apollo goes down, we will not be able to fetch a schema until it is back online. Any currently running process will continue to utilize the most recently fetched schema, and if a process or pod goes down it will stay down until Apollo is back up. This inability to pull a schema also means no rollouts and appeasing the chaos monkeys to not to take down an internal data center while Apollo is down.

This is an obvious issue, but Apollo runs on two different cloud providers with an uptime guarantee of 99.95%, and its service health dashboards indicated this to be true. Therefore, we believed this was not a top priority to mitigate. That was until we received a few reports from our subgraph partners (downstream services from the gateway) indicating that, when testing in production, the same request would sometimes succeed and other times would fail with invalid field errors. This error showed that not all processes were getting the updated schema. This indicated that there was an inconsistency in the schemas across the processes and that the 99.95% uptime figure may not be accurate — or that something else may be going on.

Investigation

The first thing we did was to look at the Apollo service dashboard and make sure that everything looked good on their end. In this case, the dashboard looked great with 99.9% uptime for the past 30 days.

99.9% Uptime from Apollo’s side

We then took a look at the logs, and this is where the first issues were found. We saw a few types of unique errors being logged: the first being a `503` response, the second being TLS mismatch, and — last but most frequent — an error with the following message in the response body:

UplinkFetcher failed to update supergraph with the following error: RETRY_LATER: Internal Server Error: If this continues please contact support

This is where we reached out to Apollo to get some help with the investigation.

While we waited for a follow-up on Apollo’s investigation, we worked with our internal networking teams to determine there was nothing wrong on our side, at least as relates to our network. We also tried staggering the poll rate for the schema from the initial 10 seconds to a randomized time between 10 and 30 seconds, thinking the burst requests might be overloading one of the systems in the request chain, but this did not resolve or even reduce the problem. This meant that something was either wrong with Apollo’s uplink server or their library code that fetches the schema.

Apollo was as confused as us as to what could be going on. Their monitoring showed all green on their end. They had a couple of recommendations for us to try:

  • Enable TLS debugging and hope the issue would show its face (spoiler: it did.)
  • Override the schema request URLs (uplink URLs) so that one points to their AWS cluster and another one points to a GCP cluster.

This neither shed light on the issue nor reduced the occurrences of the issues happening. We also did not have great observibility into the issue due to our log aggregator having the same error across multiple entries, which made it difficult to get an accurate count. Luckily, in an update Apollo added a new feature to enable a custom logger, which we wired up to our internal events system for improved insights. This logging gave us better observability into the errors we were seeing and when it updated the schema. What we found in terms of errors was vastly more than what we expected.

Currently we have nearly 1,000 instances of our Gateway running in production. This means that a schema check and fetch is made approximately 8,300,00 times a day. In our events system we saw the same two distinct errors being recorded while fetching the schema but this time with an accurate count and now some nice graphs to go with it! Within a 24-hour period we saw 2,000 `503` responses and 181,668 the RETRY_LATER error message, which indicated a total failure rate of ~2.4% — a number that is considerably higher than the expected 0.05%.

We reported our findings to Apollo, who then started to take a deeper look into the issue. Apollo messaged us to let us know they had several updates planned in order to hopefully alleviate the issue but had no timeline for completion.

Solutions

Fix 1: The Big Fix

Although we knew Apollo would mitigate the issue (and as of this posting, they have in their v2.1.0 of the Gateway library), we knew that this dependency could cause other issues down the line, so we needed to find a mitigation strategy in the meantime.

The general solution that our team came up with is to host the schema file locally so that Federation could pull the schema into our own data center (DC) where it would be less likely for connection issues to be present.

In order to store and host the schema locally we built the Federation Schema Caching Service, or FSCS for short. (We’re developers, not copywriters.)

FSCS works by receiving a webhook from Apollo that includes a URL to an updated schema file. This webhook is sent after every build and feeds into each of our data centers.

FSCS listens in each DC for the message, and once the message is received it fetches the schema via the URL provided in the webhook and stores it safely in persistent storage. This means that when the service restarts it can pull the latest schema even if Apollo is down.

Currently, we are using FSCS as the third endpoint; therefore, we now have triple redundancy in schema fetching. Building a service to support 1,000 instances of another service provided some challenges in synchronization of the schema between process and DCs, capacity planning and storage.

Capacity Planning for FSCS

Calculating the capacity for FSCS took a bit of math but was straightforward. We have approximately 1,000 instances of our Federation Gateway running. We also know that the schema fetcher in Federation pulls every 10 seconds. Therefore, in the span of 10 seconds we can expect about 1,000 requests for the schema, or 100 requests per second (RPS) as a lower bound when our gateway is only using our service to fetch the schema. Since the processes are not started all at once but gradually rolled out, there is a distribution of requests across a time span, but that cannot be guaranteed. Although unlikely, it could be that they all happen at once as there is no guarantee that the request happens exactly at 10 seconds; like two metronomes, they could synchronize shortly. In the end, to ensure minimum constraints we assumed that a peak of 1,000 RPS could be needed in the worst-case scenario. Although the RPS is relatively low, the schema being nearly 350 KB+ in size means that almost 350 MB of small data is sent across the wire in this short time.

Schema Reprocessing and keeping things in sync

Although we would like everything to work perfectly during our initial test run, we saw performance degradation in query processing time and CPU and memory usage. At first we were confused, as all we were doing was effectively just adding another endpoint to fetch the schema.

We took a look at our events system and the issue stood out plain as day: the schema was being reprocessed every time it was fetched. Schema processing is where the Gateway takes the updated schema and creates an in-memory abstract syntax tree of it. This is used to help in validation of the incoming requests in a performant way. This is an issue because schema processing is expensive, especially when we have thousands of types to process. We were not sure at first what was going on but assumed there was some odd mismatch in how it was figuring out if the schema was new.

In order to test this theory, we set the feature to only use our endpoint. After restarting some instances to see the impact (there is unfortunately no way to change endpoints at runtime) we saw the same reprocessing schema each time, so we knew there must be something more going on.

To figure that out, we went through the schema fetcher source and realized there were a couple things we had potentially missed. The first was that it doesn’t send the whole schema everytime it will send a response indicating that no changes were made. It was up to the server to determine this, and since we did not have the source code we could not know what exactly was going on server-side. But a little more digging in the request itself provided more information. In the request, Federation sets an `afterID` field, and with a little dev tools debugging this field turned out to be an ISO timestamp.

To test if this is what the server was looking for, we made a curl request using that timestamp and it sent back the “no change” response. To test if it needed the exact timestamp, we played with adjusting the timestamp to see the response and found that, unless it was the exact same, it was going to send the schema. This led to a problem: how do you coordinate a timestamp between three different services? The answer to this question was in the webhook itself.

We discovered that the timestamp field in the webhook itself is the timestamp the uplink endpoints look for as a comparison. In order to keep things simple, we made a change to serialize the data from the webhook itself into a JSON file that is then loaded into the FSCS process when it changes as well. We then added a similar check on our endpoint to emit the same “no change” response if the `afterID` (timestamps) showed no change.

Once we rolled out the changes to FSCS and started the uplink feature we saw the constant reprocessing of the schema went away and the processing overhead of the query returned to normal.

Fix 2: Apollo improves its running form

As FSCS was being built, Apollo released its first update for the schema fetcher itself. The update was rather large but enhanced how the schema fetcher worked internally. First, it used a state management engine to better know when schema requests were being processed and not resend a schema request during its time. The next is it now uses a round-robin approach for the uplink URLs, which allows it to use each URL instead of randomly plucking one from the list. With the changes to the state engine, the `RETRY_LATER` error disappeared as it was caused by a bug in the schema fetcher itself. This drastically reduced the number of errors we saw in our event reporter.

The update also had an unexpected effect on performance for the Federation Gateway itself. The number of sockets used decreased dramatically, going from ~500 to ~100, due to there not being overlapping requests keeping sockets open. This freed up resources for Federation Gateway to focus on doing what it does best and process queries. We saw our heaviest queries halve their query processing time, as now the event loop was not being held up with all the other active sockets.

Future changes

As we continue to monitor Federation, we expect to see the elimination of different processes having different loaded schemas. To ensure this, we plan to add a metric in our reporting to display the schema version of each process. Unfortunately, there is no way to get the timestamp-based ID at the time when it fetches the schema, but we have asked Apollo to expose this data for us in order to make this happen.

--

--