Easily Validate User-Generated Data Using Pydantic

How to us Pydantic to validate Excel data

Charles Mendelson

Published in

Towards Data Science

6 min readFeb 5, 2023

Using Pydantic to validate Excel data

As a data engineer, I frequently encounter situations where I have built pipelines and other automations based on user-generated data from Excel. Excel’s flexibility allows it to be used by a wide variety of users, but unfortunately, that flexibility leads to invalid data entering the pipeline. Before I discovered Pydantic, I wrote incredibly complex Pandas functions to check and filter data to make sure it was valid.

What is Pydantic?

Pydantic is a Python library that lets you define a data model in a Pythonic way, and use that model to validate data inputs, mainly through using type hints.

Sample data:

Before we get going, let’s examine our sample data; a spreadsheet of RPG characters I created using random name generators:

Column headers in bold indicate required fields, and we can see that some of the required fields are missing. Also, some of our text fields, like Race, Class and Gender should be restricted to specific words instead of being free-text fields.

Starting our data model

Pydantic has a number of starting points for a data model, but ours is pretty simple so we are going to use pydantic.BaseMode l:

import pydantic

class RpgCharacterModel(pydantic.BaseModel):
    DATE: datetime
    NAME: str
    GENDER: str
    RACE: str
    CLASS: str
    HOME: str
    GUILD: str
    PAY: int

The syntax is pretty simple. After defining the class and inheriting from our base model, we enter each of our field names and provide a type hint.

While this is a good start, this model has some severe limitations. Right now it will accept any string for a number of fields that we would prefer to be restricted to certain values.

Enums to control string fields

Type hints in Pydantic are more powerful when working with other class types in Python. One of these powerful features is being able to limit string entries by defining Enums and passing the Enum as a type hint.

Now we define those choices as Enums:

import enum

class GenderEnum(enum.Enum):
    M = 'M'
    F = 'F'
    NB = 'NB'


class ClassEnum(enum.Enum):
    Druid = 'Druid'
    Fighter = 'Fighter'
    Warlock = 'Warlock'
    Ranger = 'Ranger'
    Bard = 'Bard'
    Sorcerer = 'Sorcerer'
    Paladin = 'Paladin'
    Rogue = 'Rogue'
    Wizard = 'Wizard'
    Monk = 'Monk'
    Barbarian = 'Barbarian'
    Cleric = 'Cleric'


class RaceEnum(enum.Enum):
    Human = 'Human'
    Dwarf = 'Dwarf'
    Halfling = 'Halfling'
    Elf = 'Elf'
    Dragonborn = 'Dragonborn'
    Tiefling = 'Tiefling'
    Half_Orc = 'Half-Orc'
    Gnome = 'Gnome'
    Half_Elf = 'Half-Elf'

Now, we can use these enums to be much more specific with our data model:

class RpgCharacterModel(pydantic.BaseModel):
    DATE: datetime
    NAME: str
    GENDER: GenderEnum
    RACE: RaceEnum
    CLASS: ClassEnum
    HOME: str
    GUILD: str
    PAY: int

Our data model is looking better, but there are two more things we need to do to really tighten it up.

Specify required fields
Specify a range for pay. Numbers are a common place for data entry mistakes

Pydantic offers an optional function that lets us define even more specificity in our data model.

Field() for even greater specificity.

When defining a field in our data model, we can call the function Field() to specify additional options, including whether or not a field is required, and setting up limitations on numeric inputs.

import pydantic

class RpgCharacterModel(pydantic.BaseModel):
    DATE: datetime
    NAME: str = pydantic.Field(...)
    GENDER: GenderEnum
    RACE: RaceEnum = pydantic.Field(...)
    CLASS: ClassEnum = pydantic.Field(...)
    HOME: str
    GUILD: str
    PAY: int = pydantic.Field(..., ge=1, le=500)

Passing ... as the first argument to Field indicates that it is required.

the ge keyword means Greater than or equal to
the le keyword means Less than or equal to

Using the model

Now we’ve done all the work to define the model, we need to use it.

Pydantic models expect to receive JSON-like data, so any data we pass to our model for validation must be a dictionary.

For our data validation, we need to do the following:

Receive a DataFrame as an input
Convert it to a list of dictionaries (one dictionary per row)
Run each row through data validation
Append successfully validated rows to one list
Append failed rows to another list, with the row number and the error message

import pandas as pd
import pydantic

def validate_df_data(df: pd.DataFrame, model: pydantic.BaseModel, index_offset: int = 2) -> tuple[list, list]:
    # Python index starts at 0, excel at 1, and 1 row for the header in Excel

    #capturing our good data and our bad data
    good_data = []
    bad_data = []
    df_rows = df.to_dict(orient='records')
    for index, row in enumerate(df_rows):
        try:
            model(**row)  #unpacks our dictionary into our keyword arguments
            good_data.append(row)  #appends valid data to a new list of dictionaries
        except pydantic.ValidationError as e:
            # Adds all validation error messages associated with the error
            # and adds them to the dictionary
            row['Errors'] = [error_message['msg'] for error_message in e.errors()]

            row['Error_row_num'] = index + index_offset
            bad_data.append(row)  #appends bad data to a different list of dictionaries

    return (good_data, bad_data)

Using this function, we can process the good rows of data, and return the bad rows of data for QA, modification, and resubmission.

Full code

# Standard Library imports
from datetime import datetime
import enum

# 3rd Party package imports
import pandas as pd
import pydantic

# Enums for limiting string data in our model

class GenderEnum(enum.Enum):
    M = 'M'
    F = 'F'
    NB = 'NB'


class ClassEnum(enum.Enum):
    Druid = 'Druid'
    Fighter = 'Fighter'
    Warlock = 'Warlock'
    Ranger = 'Ranger'
    Bard = 'Bard'
    Sorcerer = 'Sorcerer'
    Paladin = 'Paladin'
    Rogue = 'Rogue'
    Wizard = 'Wizard'
    Monk = 'Monk'
    Barbarian = 'Barbarian'
    Cleric = 'Cleric'


class RaceEnum(enum.Enum):
    Human = 'Human'
    Dwarf = 'Dwarf'
    Halfling = 'Halfling'
    Elf = 'Elf'
    Dragonborn = 'Dragonborn'
    Tiefling = 'Tiefling'
    Half_Orc = 'Half-Orc'
    Gnome = 'Gnome'
    Half_Elf = 'Half-Elf'

class RpgCharacterModel(pydantic.BaseModel):
    DATE: datetime
    NAME: str = pydantic.Field(...)
    GENDER: GenderEnum
    RACE: RaceEnum = pydantic.Field(...)
    CLASS: ClassEnum = pydantic.Field(...)
    HOME: str
    GUILD: str
    PAY: int = pydantic.Field(..., ge=1, le=500)

def validate_df_data(df: pd.DataFrame, model: pydantic.BaseModel, index_offset: int = 2) -> tuple[list, list]:
    # Python index starts at 0, excel at 1, and 1 row for the header in Excel

    #capturing our good data and our bad data
    good_data = []
    bad_data = []
    df_rows = df.to_dict(orient='records')
    for index, row in enumerate(df_rows):
        try:
            model(**row)  #unpacks our dictionary into our keyword arguments
            good_data.append(row)  #appends valid data to a new list of dictionaries
        except pydantic.ValidationError as e:
            # Adds all validation error messages associated with the error
            # and adds them to the dictionary
            row['Errors'] = [error_message['msg'] for error_message in e.errors()]
            # Python index starts at 0, excel at 1, and 1 row for the header in excel
            row['Error_row_num'] = index + index_offset
            bad_data.append(row)  #appends bad data to a different list of dictionaries

    return (good_data, bad_data)

df = pd.read_excel('sample_dnd_character_data.xlsx')
valid_data, invalid_data = validate_df_data(df, RpgCharacterModel, index_offset=2)

All code plus sample data can be found in my GitHub repo

Conclusion and next steps

Pydantic is exceptionally powerful, and while this was a simple example, it can handle complex nested models. This really allows a lot of granularity with data validation without writing a ton of code.

As an additional benefit, modeling the data really helps you understand it, rather than just taking whatever is thrown at you.

While this tutorial focused on Pandas, you can use Pydantic to validate most forms of data inputs with Python.

About

Charles Mendelson is a Seattle-based Data Engineer, with delusions of software engineering. In addition to working as a DE, he also assists teaching Python at the University of Washington’s School of Professional and Continuing Education. The best way to get in touch with him is via LinkedIn.

Originally published at https://charlesmendelson.com on February 5, 2023.