Automatically Managing Data Pipeline Infrastructures With Terraform

I know the manual work you did last summer

João Pedro
Towards Data Science

--

Photo by EJ Yao on Unsplash

Introduction

A few weeks ago, I wrote a post about developing a data pipeline using both on-premise and AWS tools. This post is part of my recent effort in bringing more cloud-oriented data engineering posts.

However, when mentally reviewing this post, I noticed a big problem: the manual work.

Whenever I develop a new project, whether real or fictional, I always try to reduce the friction of configuring the environment (install dependencies, configure folders, obtain credentials, etc) and that’s why I always use Docker. With it, I just pass you a docker-compose.yaml file + a few Dockerfiles and you are capable of creating exactly the same environment as me with just one command — docker compose up.

However, when we want to develop a new data project with cloud tools (S3, Lambda, Glue, EMR, etc) Docker can’t help us, as the components need to be instantiated in the providers’ infrastructure, and there are two main ways of doing this: Manually on the UI or programmatically through service APIs.

For example, you can access the AWS UI on your browser, search for S3 and create a new bucket manually, or write a code in Python to create this same instance making a request on the AWS API.

In the post mentioned earlier, I described step-by-step how to create the needed components MANUALLY through the AWS web interface. The result? Even trying to summarize as much as possible (and even omitting parts!), the post ended up with 17 min, 7 min more than I usually do, full of PRINTS of which screen you should access, where you should click, and which settings to choose.

In addition to being a costly, confusing, and time-consuming process, it is still susceptible to human errors, which ends up bringing more headaches and possibly even bad surprises in the monthly bill. Definitely an unpleasant process.

And this is the exactly kind of problem that Terraform comes to solve.

not sponsored.

What is Terraform?

Terraform is an IaC (Infrastructure as Code) tool that manages infrastructure in cloud providers in an automatic and programmatically manner.

In Terraform, the desired infrastructures is described using a declarative language called HCL (HashiCorp Configuration Language), where the components are specified, e.g. a S3 bucket named “my-bucket” and an EC2 server with Ubuntu 22 in the us-east-1 zone.

The described resources are materialized by Terraform through calls in the cloud provider’s service APIs. Beyond creation, it is also capable of destroying and updating the infrastructure, adding/removing only the resources needed to move from the actual state to the desired state, e.g. if 4 instances of EC2 are requested, it will create only 2 new instances if 2 others already exist. This behavior is achieved because Terraform stores the actual state of the infrastructure in state files.

Because of this, it's possible to manage a project’s infrastructure in a much more agile and secure way, as it removes the manual work needed of configuring each individual resource.

Terraform’s proposal is to be a cloud-agnostic IaC tool, so it uses a standardized language to mediate the interaction with the cloud providers’ APIs, removing the need of learning how to interact with them directly. Still in this line, HCL language also supports variables manipulation and a certain degree of ‘flux control’ (if-statements and loops), allowing the use of conditionals and loops in resource creation, e.g. create 100 EC2 instances.

Last but not least, Terraform also allows infrastructure versioning, as its plain-text files can be easily manipulated by git.

The implementation

As mentioned earlier, this post seeks to automate the process of infrastructure creation of my previous post.

To recap, the project developed aimed at creating a data pipeline to extract questions from the Brazillian ENEM (National Exam of High School, on literal translation) tests using the PDFs available on the MEC (Ministry of Education) website.

The process involved three steps, controlled by a local Airflow instance. These steps included downloading and uploading the PDF file to S3 storage, extracting texts from the PDFs using a Lambda function, and segmenting the extracted text into questions using a Glue Job.

Note that, for this pipeline to work, many AWS components have to be created and correctly configured.

0. Setting up the environment

All the code used in this project is available in this GitHub Repository.

You’ll need a machine with Docker and an AWS account.

The first step is configuring a new AWS IAM user for Terraform, this will be the only step executed in the AWS web console.

Create a new IAM user with FullAccess to S3, Glue, Lambda, and IAM and generate code credentials for it.

This is a lot of permission for a single user, so keep the credentials safe.

I’m using FullAccess permissions because I wanna make things easier for now, but always consider the ‘least privileged’ approach when dealing with credentials.

Now, back to the local environment.

On the same path as the docker-compose.yaml file, create a .env file and write your credentials:

AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID>
AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>

These variables will be passed to the docker-compose file to be used by Terraform.

version: '3'
services:
terraform:
image: hashicorp/terraform:latest
volumes:
- ./terraform:/terraform
working_dir: /terraform
command: ["init"]
environment:
- TF_VAR_AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- TF_VAR_AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
- TF_VAR_AWS_DEFAULT_REGION=us-east-1

1. Create the Terraform file

Still on the same folder create a new directory called terraform. Inside it, create a new file main.tf, this will be our main Terraform file.

This folder will be mapped inside the container when it runs, so the internal Terraform will be able to see this file.

2. Configure the AWS Provider

The first thing we need to do is to configure the cloud provider used.

terraform {
required_version = ">= 0.12"

required_providers {
aws = ">= 3.51.0"
}
}

variable "AWS_ACCESS_KEY_ID" {
type = string
}

variable "AWS_SECRET_ACCESS_KEY" {
type = string
}

variable "AWS_DEFAULT_REGION" {
type = string
}

provider "aws" {
access_key = var.AWS_ACCESS_KEY_ID
secret_key = var.AWS_SECRET_ACCESS_KEY
region = var.AWS_DEFAULT_REGION
}

This is what a Terraform configuration file looks like — a set of blocks with different types, each one with a specific function.

The terraform block fixes the versions for Terraform itself and for the AWS provider.

A variable is exactly what the name suggests — a value assigned to a name that can be referenced throughout the code.

As you probably already noticed, our variables don’t have a value assigned to them, so what’s going on? The answer is back in the docker-compose.yaml file, the value of these variables was set using environment variables in the system. When a variable value is not defined, Terraform will look at the value of the environment variable TF_VAR_<var_name> and use its value. I’ve opted for this approach to avoid hard-coding the keys.

The provider block is also self-explanatory — It references the cloud provider we’re using and configures its credentials. We set the provider’s arguments (access_key, secret_key, and region) with the variables defined earlier, referenced with the var.<var_name> notation.

With this block defined, run:

docker compose run terraform init 

To set up Terraform.

3. Creating our first resource: The S3 bucket

Terraform uses the resource block to reference infrastructure components such as S3 buckets and EC2 instances, as well as actions like granting permissions to users or uploading files to a bucket.

The code below creates a new S3 bucket for our project.

resource "aws_s3_bucket" "enem-bucket-terraform-jobs" {
bucket = "enem-bucket-terraform-jobs"
}

A resource definition follows the syntax:

resource <resource_type> <resource_name> {
argument_1 = "blah blah blah blah"
argument_2 = "blah blah blah"
argument_3 {
...
}
}

In the case above, “aws_s3_bucket” is the resource type, “enem-bucket-terraform-jobs” is the resource name, used to reference this resource in the file (it is not the bucket name in the AWS infrastructure). The argument bucket=“enem-bucket-terraform-jobs” assigns a name to our bucket.

Now, with the command:

docker compose run terraform plan

Terraform will compare the current state of the infrastructure and infer what needs to be done to achieve the desired state described in the main.tf file.

Because this bucket still does not exist, Terraform will plan to create it.

To apply Terraform’s plan, run

docker compose run terraform apply

And, with only these few commands, our bucket is already created.

Easy, right?

To destroy the bucket, just type:

docker compose run terraform destroy

And Terraform takes care of the rest.

These are the basic commands that will follow us until the end of the post: plan, apply, destroy. From now on, all that we’re going to do is configure the main.tf file, adding the resources needed to materialize our data pipeline.

4. Configuring the Lambda Function part I: Roles and permissions

Now on the Lambda Function definition.

This was one of the trickiest parts of my previous post because, by default, Lambda functions already need a set of basic permissions and, on top of that, we had also to give it read and write permissions to the S3 bucket previously created.

First of all, we must create a new IAM role.

# CREATE THE LAMBDA FUNCTION
# ==========================

# CREATE A NEW ROLE FOR THE LAMBDA FUNCTION TO ASSUME
resource "aws_iam_role" "lambda_execution_role" {
name = "lambda_execution_role_terraform"
assume_role_policy = jsonencode({
# This is the policy document that allows the role to be assumed by Lambda
# other services cannot assume this role
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "lambda.amazonaws.com"
}
}
]
})
}

When developing these things, I strongly suggest that you first ask what you want in ChatGPT, GitHub Copilot, or any other LLM friend and then check the provider’s documentation on how this type of resource works.

The code above creates a new IAM role and allows AWS Lambda Functions to assume it. The next step is to attach the Lambda Basic Execution policy to this role to allow the Lambda Function to execute without errors.

# ATTACH THE BASIC LAMBDA EXECUTION POLICY TO THE ROLE lambda_execution_role
resource "aws_iam_role_policy_attachment" "lambda_basic_execution" {
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
role = aws_iam_role.lambda_execution_role.name
}

The nice thing to note in the code above is that we can reference resource attributes and pass them as arguments in the creation of new resources. In the case above, instead of hard-coding the ‘role’ argument with the name of the previously created role ‘lambda_execution_role_terraform’, we can reference this attribute using the syntax:
<resource_type>.<resource_name>.<attribute>

If you take some time to look into the Terraform documentation of a resource, you’ll note that it has arguments and attributes. Arguments are what you pass in order to create/configure a new resource and attributes are read-only properties about this resource available after its creation.

Because of this, attributes are used by Terraform to implicitly manage dependencies between resources, establishing the appropriate order of their creation.

The code below creates a new access policy for our S3 bucket, allowing basic CRUD operations on it.

# CREATE A NEW POLICY FOR THE LAMBDA FUNCTION TO ACCESS S3
resource "aws_iam_policy" "s3_access_policy" {
name = "s3_access_policy"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject"
]
Resource = aws_s3_bucket.enem-data-bucket.arn
}
]
})

# ATTACH THE EXECUTION POLICY AND THE S3 ACCESS POLICY TO THE ROLE lambda_execution_role
resource "aws_iam_policy_attachment" "s3_access_attachment" {
name = "s3_and_lambda_execution_access_attachment"
policy_arn = aws_iam_policy.s3_access_policy.arn
roles = [aws_iam_role.lambda_execution_role.name]
}

Again, instead of hard-coding the bucket’s ARN, we can reference this attribute using aws_s3_bucket.enem-data-bucket.arn.

With the Lambda role correctly configured, we can finally create the function itself.

# CREATE A NEW LAMBDA FUNCTION
resource "aws_lambda_function" "lambda_function" {
function_name = "my-lambda-function-aws-terraform-jp"
role = aws_iam_role.lambda_execution_role.arn
handler = "lambda_function.lambda_handler"
runtime = "python3.8"
filename = "lambda_function.zip"
}

The lambda_function.zip file is a compressed folder that must have a lambda_function.py file with a lambda_handler(event, context) function inside. It must be on the same path as the main.tf file.

# lambda_function.py
def lambda_handler(event, context):
return "Hello from Lambda!"

5. Configuring the Lambda Function part II: Attaching a trigger

Now, we need to configure a trigger for the Lambda Function: It must execute every time a new PDF is uploaded to the bucket.

# ADD A TRIGGER TO THE LAMBDA FUNCTION BASED ON S3 BUCKET CREATION EVENTS
# https://stackoverflow.com/questions/68245765/add-trigger-to-aws-lambda-functions-via-terraform

resource "aws_lambda_permission" "allow_bucket_execution" {
statement_id = "AllowExecutionFromS3Bucket"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.lambda_function.arn
principal = "s3.amazonaws.com"
source_arn = aws_s3_bucket.enem-data-bucket.arn
}


resource "aws_s3_bucket_notification" "bucket_notification" {
bucket = aws_s3_bucket.enem-data-bucket.id

lambda_function {
lambda_function_arn = aws_lambda_function.lambda_function.arn
events = ["s3:ObjectCreated:*"]
filter_suffix = ".pdf"
}

depends_on = [aws_lambda_permission.allow_bucket_execution]
}

This is a case where we must specify an explicit dependency between resources, as the “bucket_notification” resource needs to be created after the “allow_bucket_execution”.

This can be easily achieved by using the depends_on argument.

And we’re done with the lambda function, just run:

docker compose run terraform apply

And the Lambda Function will be created.

6. Adding a module to the Glue job

Our main.tf file is getting pretty big, and remember that this is just a simple data pipeline. To enhance the organization and reduce its size, we can use the concept of modules.

A module is a set of resources grouped in a separate file that can be referenced and reused by other configuration files. Modules enable us to abstract complex parts of the infrastructure to make our code more manageable, reusable, organized, and modular.

So, instead of coding all the resources needed to create our Glue job in the main.tf file, we’ll put them inside a module.

In the ./terraform folder, create a new folder ‘glue’ with a glue.tf file inside it.

Then add a new S3 bucket resource in the file:

# INSIDE GLUE.TF
# Create a new bucket to store the job script
resource "aws_s3_bucket" "enem-bucket-terraform-jobs" {
bucket = "enem-bucket-terraform-jobs"
}

Back in main.tf, just reference this module with:

module "glue" {
source = "./glue"
}

And reinitialize terraform:

docker compose run terraform init

Terraform will restart its backend and initialize the module with it.

Now, if we run terraform plan, it should include this new bucket in the creation list:

Using this module, we’ll be able to encapsulate all the logic of creating the job in a single external file.

A requirement of AWS Glue jobs is that their job files are stored in an S3 bucket, and that’s why we created “enem-bucket-terraform-jobs”. Now, we must upload the job’s file itself.

In the terraform path, I’d included a myjob.py file, it is just an empty file used to simulate this behavior. To upload a new object to a bucket, just use the “aws_s3_object” resource:

# UPLOAD THE SPARK JOB FILE myjob.py to s3
resource "aws_s3_object" "myjob" {
bucket = aws_s3_bucket.enem-bucket-terraform-jobs.id
key = "myjob.py"
source = "myjob.py"
}

From now on, it is just a matter of implementing the Glue role and creating the job itself.

# CREATE A NEW ROLE FOR THE GLUE JOB TO ASSUME
resource "aws_iam_role" "glue_execution_role" {
name = "glue_execution_role_terraform"
assume_role_policy = jsonencode({
# This is the policy document that allows the role to be assumed by Glue
# other services cannot assume this role
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "glue.amazonaws.com"
}
}
]
})
}

# ATTACH THE BASIC GLUE EXECUTION POLICY TO THE ROLE glue_execution_role
resource "aws_iam_role_policy_attachment" "glue_basic_execution" {
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole"
role = aws_iam_role.glue_execution_role.name
}

Not so fast. We must assure that this job has the same read and write permissions to the bucket “enem-data-bucket” as the Lambda Function, i.e. we need to attach the aws_iam_policy.s3_access_policy to its role.

But, because this policy was defined in the main file, we cannot reference it directly in our module.

# THIS WILL RESULT IN A ERROR!!!!
# ATTACH THE THE S3 ACCESS POLICY s3_access_policy TO THE ROLE glue_execution_role
resource "aws_iam_policy_attachment" "s3_access_attachment_glue" {
name = "s3_and_glue_execution_access_attachment"
policy_arn = aws_iam_policy.s3_access_policy.arn
roles = [aws_iam_role.glue_execution_role.name]
}

In order to achieve this behavior, we must pass the access policy arn as an argument to the module, and that’s pretty simple.

First, in the glue.tf file, create a new variable to receive the value.

variable "enem-data-bucket-access-policy-arn" {
type = string
}

Go back to the main file and, in the module reference, pass a value to this variable.

module "glue" {
source = "./glue"
enem-data-bucket-access-policy-arn = aws_iam_policy.s3_access_policy.arn
}

Finally, in the glue file, use the value of the variable in the resource.

# ATTACH THE THE S3 ACCESS POLICY s3_access_policy TO THE ROLE glue_execution_role
resource "aws_iam_policy_attachment" "s3_access_attachment_glue" {
name = "s3_and_glue_execution_access_attachment"
policy_arn = var.enem-data-bucket-access-policy-arn
roles = [aws_iam_role.glue_execution_role.name]
}

Now, take a minute to think about the power of what we had just done. With modules and arguments, we can create fully parametrized complex infrastructures.

The code above doesn’t just create a specific job for our pipeline. By just changing the value of the enem-data-bucket-access-policy-arn variable, we can create a new job to process data from an entirely different bucket.

And that logic applies to anything you want. It’s possible, for example, to simultaneously create a complete infrastructure for a project for the development, testing, and production environments, using just variables to alternate between them.

Without further talking, all that rests is to create the Glue job itself, and there is no novelty in that:

# CREATE THE GLUE JOB
resource "aws_glue_job" "myjob" {
name = "myjob"
role_arn = aws_iam_role.glue_execution_role.arn
glue_version = "4.0"
command {
script_location = "s3://${aws_s3_bucket.enem-bucket-terraform-jobs.id}/myjob.py"
}
default_arguments = {
"--job-language" = "python"
"--job-bookmark-option" = "job-bookmark-disable"
"--enable-metrics" = ""
}
depends_on = [aws_s3_object.myjob]
}

And our infrastructure is done. Run terraform apply to create the remaining resources.

docker compose run terraform apply

And terraform destroy to get rid of everything.

docker compose run terraform destroy

Conclusion

I met Terraform a few days after publishing my 2nd post about creating data pipelines using cloud providers, and it blew my mind. I instantly thought about all the manual work that I did to set up the project, all the prints captured to showcase the process and all the undocumented details that will haunt my nightmares when I need to reproduce the process.

Terraform solves all these problems. It is simple, easy to set up, and easy to use, all it needs are a few .tf files along with the providers’ credentials and we’re ready to go.

Terraform tackles that kind of problem that people usually don’t are so excited to think about. When developing data products, we all think about performance, optimization, delay, quality, accuracy, and other data-specific or domain-specific aspects of our product.

Don’t get me wrong, we all study to apply our better mathematical and computational knowledge to solve these problems, but we also need to think about critical aspects of the development process of our product, like reproducibility, maintainability, documentation, versioning, integration, modularization, and so on.

These are aspects that our software engineer colleagues have been concerned about for a long time, so we don’t have to reinvent the wheel, just learn one thing or two from their best practices.

That’s why I always use Docker in my projects and that’s also why I will probably add Terraform in my basic toolset.

I hope this post helped you in understanding this tool — Terraform — including its objectives, basic functionalities, and practical benefits. As always, I’m not an expert in any of the subjects addressed in this post, and I strongly recommend further reading, see some references below.

Thank you for reading! ;)

References

All the code is available in this GitHub repository.
Data used — ENEM PDFs, [CC BY-ND 3.0], MEC-Brazilian Gov.
All the images are created by the Author, unless otherwise specified.

[1] Add trigger to AWS Lambda functions via Terraform. Stack Overflow. Link.
[2] AWSLambdaBasicExecutionRole — AWS Managed Policy. Link.
[3] Brikman, Y. (2022, October 11). Terraform tips & tricks: loops, if-statements, and gotchas. Medium.
[4] Create Resource Dependencies | Terraform | HashiCorp Developer. Link.
[5] TechWorld with Nana. (2020, July 4). Terraform explained in 15 mins | Terraform Tutorial for Beginners [Video]. YouTube.
[6] Terraform Registry. AWS provider. Link.

--

--

Bachelor of IT at UFRN. Graduate of BI at UFRN — IMD. Strongly interested in Machine Learning, Data Science and Data Engineering.