The Secret Serverless Computing Service in Azure

Lessons learned from designing a cost-effective containerized data processing solution on Azure

Jorge Machado
12 min readSep 5, 2023

Written by: Johannes Schmidt

As a small team of data engineers and data scientists, we often work on projects that involve designing and implementing data processing solutions for various customers. Recently, we had an interesting challenge: one of our customers had developed an optimisation algorithm that they wanted to run on a container-based platform in the cloud. The algorithm was designed to solve a complex optimisation problem that required some specific libraries installed in a Linux environment. The customer wanted us to design and implement an architecture that could run a data processing job on a daily basis, using their optimisation algorithm as the core component. The job would take about 10 to 20 minutes to complete, and the customer expected to run about five jobs per day. It all had to be extremely cost-effective and secure, meaning that only authorized users or systems could create job runs and view the results. Once rolled out, we would have to maintain the architecture, so monitoring and observability were also important aspects for us.

This article describes how we approached this problem and what solution we came up with. We also discuss some of the challenges we faced while designing and implementing the architecture, and how we solved them. We hope that this article will provide some insights and lessons learned for anyone who is interested in building container-based data processing solutions.

TLDR: We designed and implemented a cost-effective asynchronous container-based data processing architecture for a customer on Azure that runs an optimisation algorithm on a daily basis. The architecture is summarised in the following diagram:

The user/system interacts with the Databricks service for containerised jobs via a secure REST API. The result is written to a storage location once completed. — Image by author

Overall we used the following Azure services:

Costs for the customer per month: ~15€ — 30€

Considerations

We decided to use Azure as our cloud provider since we were familiar with the services and features.

Given the requirements mentioned earlier, we decided to split the workload between two services. One service runs the optimisation algorithm in a serverless fashion, while the other provides a secure API for users to manage job runs and other tasks.

After careful consideration, we chose to implement a REST API,
which offered us ease of development and use, compatibility with other services and platforms, and security and monitoring capabilities.

A very simple REST API architecture could look like this:

Simple REST API architecture where the API stands between the user/system and the resource responsible for the optimisation job — Image by author

Due to the lengthy duration of job runs (approximately 10 minutes), we could not use a synchronous approach to trigger them.
Instead, we opted for an asynchronous solution, where the users consume the REST API, which triggers the job runs and then monitors them.
The result of the job runs will be stored in a storage account (blob) and the API will provide a way to download the results.

Job submits are forwarded to the Optimisation job service — Image by author
Running the optimisation algorithm takes ~10 min — Image by author
Accessing the result is also mediated through the REST API — Image by author

Please note that an event-driven architecture would also be possible,
especially for asynchronous communication (fire-and-forget).
However, it would likely be more complex and time-consuming to implement.

REST API

With the REST API, the users should be able to create new job runs, check the status of the existing job runs, and download the results.
Additionally, the API should support some other functionalities that are not relevant for this article.

We wanted to make our architecture as cost-effective as possible, so we considered using serverless Azure Functions as our API endpoints. Azure Functions are very convenient for building REST APIs, as they are charged per execution, easy to scale, and easy to integrate with other services.
Moreover, we did not care much about the response times, so we were not worried about the cold starts that may occur with serverless functions.
However, we encountered a major drawback: Azure Functions do not support linux containers in the free tier, but only in the
premium tier, which costs about 130€/month. This was too expensive for the customer’s budget, so we decided to use Azure App Service instead.

Azure App Service allows us to run our API in a containerised environment, and it only costs about 10€/month for an App
Service Plan (B1). Furthermore, Azure App Service keeps our API always on, so we do not have to deal with cold starts at all.
Both, Azure Functions and Azure App Service (and Azure Container Apps) offer a built-in authentication feature called EasyAuth, which we used to secure our API.

The API was developed quickly using FastAPI, a modern, fast (high-performance), web framework for building APIs with
Python 3.6+. We chose FastAPI because we have plenty of experience with it, and it is very easy to use and develop. Together with the built-in feature EasyAuth, we were able to provide a secure REST API on Azure in no time.

Serverless Compute Service

The most difficult part of this project was to find a suitable compute service for running the optimisation algorithm. We consulted the Microsoft documentation on how to choose a compute service for Azure, and we initially considered four options:

We needed a serverless service, as we did not want to pay for a VM that would run 24/7. We also needed a service that could run containers, as our optimisation algorithm required a linux environment with some
libraries. Moreover, we needed a service that could accept some input parameters, monitor the job runs, and provide enough memory
and CPU for the algorithm to properly work and deliver results in a reasonable timeframe.

Azure Functions were ruled out pretty quickly, as they do not support containers in the consumption plan, and it has a 10-minute execution time limit.

Azure Container Apps was also not an option, as it only offers 2 cores per instance in the consumption plan, which was not enough for our use case. If it had better quotas, it could have been a good option for both, the REST API and the compute service.

Azure Container Instances was a better option, as it offers up to 16 GB of memory and 4 cores, depending on the region, and has some built-in monitoring features that we could use. But it also has some drawbacks: Although ACI does accept input parameters via environment variables or by overwriting the docker entrypoint, it requires deploying the resource to run a container, for example via AZ CLI. So, whenever a job run with some input parameters is triggered via our REST API, it deploys the ACI resource in the specified resource group. We would prefer to roll out the ACI via IaC (Terraform) once and then start the container via REST API.

Azure Virtual Machines could be used to run containers in a serverless way, but it requires some additional configuration, and it is not as straightforward as the other options. As an example, we could combine Azure Logic Apps & Azure Automation to start an Azure VM and use a Run Command to run scripts inside the VM.

A serverless VM solution with Azure Logic Apps — Image by author

However, this approach requires 3 services: Azure Logic Apps, Azure Automation, and Azure Virtual Machines. And when it comes to setting up alerts for failed job runs, the execution of the Run Command would have to be monitored by, for example, continuously polling the runbook or the VM.

Not ideal.

Solution

Fortunately, we experimented quite extensively on a popular data engineering tool while working on another project for a different customer: Databricks.

It can be described as a managed Spark cluster service. And surprisingly, it can be used in a serverless fashion with custom docker containers as you will see!

On its own, it provides a lot of features that we needed: a REST API for job runs, user and access management, an intuitive UI, incredibly many cluster configurations for different use cases, and a way to monitor the job runs.

If there are no special requirements, I’m sure that Databricks can be used as a standalone service for similar use cases. However, in our case, a REST API in front of Databricks was needed because users and other systems should be able to call a simple API endpoint with defined input parameters and not have to deal with the Databricks REST API.

Databricks comes with a free & premium tier, and you are only billed for the compute resources that you use. This is great for testing and development but also when it should be used as a cost-effective, serverless option for running containers. The difference between these tiers is well explained here. In essence, the billing per execution is cheaper for DBUs (the units that are used to measure the compute resources) with the free tier, but is missing some features, such as RBAC and AAD pass-through authentication. So if you don’t run anything with either free or premium tier, you don’t pay anything. That’s great! On top of that, it’s a service that’s available on AWS, Azure and GCP, so it’s easy to switch between cloud providers. Don’t underestimate this advantage, as it allows you to avoid vendor lock-in.

A noticeable downside with custom containers in Databricks, however, is that the container needs to have the Databricks runtime installed.
So Spark will be installed in your container, even if you don’t need it. But that was a small price to pay for us.

Anyway, we decided to try Databricks as our compute solution and started to containerise the optimisation algorithm and defined an entrypoint script
that would allow us to run the algorithm within the container. Explaining how to build and run custom docker container images in Databricks would be out of scope for this article but is explained in detail in this blog
post: Running Python Wheel Tasks in Custom Docker Containers in Databricks

The image was stored in an Azure Container Registry (ACR), which is a private Docker registry that allows us to store and manage container images.

All we had to do is to write some code for our REST API that would communicate with the Databricks REST API. Databricks has a well documented REST API which made it easy to create and trigger the job runs.

However, we noticed that starting job runs with custom docker containers using the secure way (AAD pass-through authentication) is only provided in the premium tier. With the free tier, username and password authentication is used. We wanted to keep the DBUs as low as possible, so we decided to use the free tier for the beginning. Hence, the REST API needs these credentials to start a job run. This is not ideal, but it is the only way to do it with the free tier. The username and password was a service principal with a client id and a client secret that we created. These credentials should be securely stored somewhere, for example in our case, we stored them in an Azure Key Vault. The REST API can then retrieve the credentials from the Key Vault and use them to authenticate with Databricks.

Deploying the whole infrastructure with Terraform (CDKTF Python) was straightforward. So we got this up and running in no time, and it worked like a charm pretty much from the beginning. All that was left to do was to provide proper logging and monitoring for the REST API and the job runs.

Monitoring

Logging and monitoring are important aspects of any architecture.
They allow us to keep track of the health of our system and to detect and fix problems as soon as possible.

REST API

For the REST API this was not a big deal as Azure App Service provides built-in logging and monitoring features. Metrics such as CPU, memory, and network usage are available out of the box, and we also enabled application logging and request tracing with Azure Application Insights.
To track the health of the REST API, we made use of the Health Check feature of the App Service.

Enabling “Health check” with the Azure App Service — Image by author

Databricks

For the job runs however, we encountered some challenges. Databricks allows you to get notified when a job run starts, succeeds or fails with System Notifications.

System notifications are messages that tell you when your workflow experiences a run event (start, success, and failure). By default, notifications are sent to user email addresses, but admins can configure alternate notification destinations using webhooks. This allows you to build event-driven integrations with Databricks.

Creating a new system notification/notification destination in Databricks — Image by author

Our notification destination was Microsoft Teams, and we wanted to get notified immediately when a job run fails. However, as it turned out later, Teams is not supported for job runs (2023). Probably because the message sent from Databricks does not conform to the Teams webhook schema. So we had to find another way to get notified when a job run fails until Teams is supported.

All we had to find is a service with a webhook that can be triggered when a job run fails. Ideally this webhook would execute some script that would send a message to a Teams channel. This is exactly what Azure Automation can do. External services, such as Databricks, can use a webhook via a single HTTP request to execute a runbook in Azure Automation. This runbook can process the data and invoke the webhook of a Teams channel.

So our solution looked like this:

Databricks events (start, success, failure) can be used for notifications — Image by author

It’s not ideal as it requires an additional service, but it works for now. Azure Automation offers 500 minutes of free job run time per month, so
there will be no costs for occasional notifications.

Something I have to mention though is that there’s a manual step we had to do: Once the Azure Automation Account and Runbook were created via IaC, we couldn’t find an automated way to create a system notification in Databricks that uses the webhook of the Azure Automation Runbook other than creating it manually. Once created, we stored the id of the system notification in the key vault so that the REST API can use it when creating job runs.

But we are looking forward to the day when Teams is supported for job runs and Databricks supports the creation of system notifications via REST API. Until then, we have notifications for failed job runs in place that look like this:

Microsoft Teams notifications from Databricks events — Image by author

Conclusion

While working on this project, we first realised that providing a cost-effective container solution is quite a challenge on Azure. Only by looking into another service (Databricks), we were able to find a solution that met our requirements. We learned that Databricks is a very powerful service that can be used in a lot of different ways that go beyond the usual Spark or data science use cases. It allowed us to run custom docker containers in a serverless fashion. This was a very neat solution to run the optimisation algorithm of the customer, while keeping the costs low! Generally, there are plenty of cluster sizes one can choose from, even for very demanding computations, and the billing is very transparent, making it, amongst other reasons, a good compute service option.

However, as already mentioned, there are some downsides to using Databricks like this that should be considered:

  • The free tier of Azure Databricks only supports username and password authentication & no RBAC. If the premium tier is used, DBUs are more expensive.
  • The docker images require to have the Databricks runtime (including Spark) installed, which is not always wanted for small images.
  • There are cold starts due to cluster provisioning (~5 minutes). Alternatively, cluster pools can be used which would reduce the cold start time significantly but increase the costs: In this case, Databricks does not charge DBUs while instances are idle in the pool but instance provider billing does apply.
  • Setting up notifications for failed job runs is not yet supported for Microsoft Teams. So a workaround with e.g. with Azure Automation is needed.

Overall, we noticed that using Databricks as a cost-effective serverless compute resource seems not to be very well known or popular as we couldn’t find many people that use Databricks in this way. Maybe that’s because it’s not mentioned in the compute decision tree or there are even more downsides we haven’t considered.

Anyway, we wanted to share this with the community, as we think that this is a very interesting approach that could be a good fit for some use cases.

What do you think about this solution? Could this something for you? Or do you have any other ideas or suggestions?

--

--