Data Science Better Practices, Part 2 — Work Together

You can’t just throw more data scientists at this model and expect the accuracy to magically increase.

Shachaf Poran

Published in

Towards Data Science

10 min readJan 5, 2024

Photo by Joseph Ruwa: https://www.pexels.com/photo/set-of-chess-pieces-in-daylight-4038397/

(Part 1 is here)

Not all data science projects were created equal.

The vast majority of data science projects I’ve seen and built were born as a throw-away proof-of-concept. Temporary one-off hacks to make something tangentially important work.

Some of these projects might end up becoming something else, perhaps a bit bigger or more central in helping the organization’s goal.

Only a select few get to grow and mature over long periods of time.

These special projects are usually those that solve a problem of special interest to the organization. For example, a CTR predictor for an online advertising network, or an image segmentation model for a visual effects generator, or a profanity detector for a content filtering service.

These are also the ones that will see considerable company resources used to optimize them, and rightly so. When even a minor improvement of some accuracy metric can be directly responsible for higher revenue or be the make-or-breaker of product launches and funding rounds — the organization should spare no expense.

The resource we’re talking about in this post is Data Scientists.

If you’ve never managed a project, a team, a company or such — it might sound strange to treat people as a “resource”. However keep in mind that these are experts with limited time to offer, and we use this time to accomplish tasks that benefit the organization.

Now take note: resources have to be managed, and their use should be optimized.

Once a model becomes so big and central that more than a couple of Data Scientists work on improving it, it’s crucial to make sure that they can work on it without stepping on each other’s toes, blocking each other, or otherwise impeding each other’s work. Rather, team members should be able to help each other easily, and build on each other’s successes.

The common practice I witnessed in various places is that each member in the team tries their own “thing”. Depending on the peculiarities of the project, that may mean different models, optimization algorithms, deep learning architectures, engineered features, and so on.

This mode of work may seem to be perpendicular between members as each of them can work separately and no dependencies are created that may impede or block one’s progress.

However, that’s not entirely the case, as I’ve ranted before.

For example, if a team member strikes gold with a particularly lucrative feature, other members might want to try and use the same feature in their models.

At some point in time a specific model might show a leap in performance, and quite quickly we’ll have branched versions of that best model, each slightly different from the next. This is because optimization processes tend to search for better optimums in the vicinity of the current optimum — not only with gradient descent but also with human invention.

This scenario will probably lead to much higher coupling and more dependencies than previously anticipated.

Even if we do make sure that not all Data Scientists converge this way, we should still try to standardize their work, perhaps enforcing a contract with downstream consumers to ease deployment as well as to save Machine Learning Engineers time.

The premise

We would like to have the Data Scientists work on the same problem in a way that allows independence on the one hand, but allows reuse of other’s work at the same time.

For the sake of examples we’ll assume we are members of a team working on the Iris flower data set. This means that the training data will be small enough to hold in a pandas dataframe in memory, though the tools we come up with might be applied to any type and size of data.

We would like to allow creative freedom, which means that each member is at full liberty to choose their modeling framework — be it scikit-learn, Keras, Python-only logic, etc.

Our main tool will be the abstraction of the process applied with OOP principles, and the normalization of work of individuals into a unified language.

Disclaimer

In this post, I am going to exemplify how one could abstract the Data Science process to facilitate teamwork. The main point is not the specific abstraction we’ll come up with. The main point is that data science managers and leaders should strive to facilitate data scientists’ work, be it by abstraction, protocols, version control, process streamlining, or any other method.

This blog post is in no way promoting reinventing the wheel. The choice whether to use an off-the-shelf product, open source tools, or developing an in-house solution should be made together with the data science and machine learning engineering teams that are relevant to the project.

Now that this is out of the way, let’s cut to the chase.

Start from the end

When we’re done, we’d like to have a unified framework to take our model through the entire pipeline from training to prediction. So, we start with defining the common pipeline:

First we get training data as input.
We might want to extract additional features to enhance the dataset.
We create a model and train it repeatedly until we’re satisfied with its loss or metrics.
We then save the model to disk or any other persisting mechanism.
We need to later load the model back to memory.
Then we can apply prediction on new unseen data.

Let’s declare a basic structure (aka interface) for a model according to the above pipeline:

class Model:
    def add_features(self, x):
        ...
    def train(self, x, y, train_parameters=None):
        ...
    def save(self, model_dir_path):
        ...
    @classmethod
    def load(cls, model_dir_path):
        ...
    def predict(self, x):
        ...

Note that this is not much more than the interfaces we’re used to from existing frameworks — however, each framework has its own little quirks, for example in naming: “fit” vs. “train” or the way they persist the models on disk. Encapsulating the pipeline within a uniform structure saves us from having to add implementation details elsewhere, for example when using the different models in a deployment environment.

Now, once we’ve defined our basic structure, let’s discuss how we’d expect to actually use it.

System design

Features

We’d like to have “features” as elements that can be easily passed around and added to different models. We should also acknowledge that there may be multiple features used for each model.

We’ll try to implement a sort of plugin infrastructure for our Feature class. We’ll have a base class for all features and then we can have the Model class materialize the different features sequentially in memory when it gets the input data.

Encapsulated models

We’d also like to have actual models that we encapsulate in our system to be transferrable between team members. However we would like to keep the option to change model parameters without writing a lot of new code.

We’ll abstract them in a different class and name it ModelInterface to aviod confusion with our Model class. The latter will in turn defer the relevant method invocations to the former.

Features

Our features can be regarded as functions with a pandas dataframe as an input.

If we give each feature a unique name and encapsulate it with the same interface as the others, we can allow the reuse of these features quite easily.

Let’s define a base class:

class Feature(ABC):
    @abstractmethod
    def add_feature(self, data):
        ...

And let’s create an implementation, for example sepal diagonal length:

class SepalDiagonalFeature(Feature):
    def add_feature(self, data):
        data['SepalDiagonal'] = (data.SepalLength ** 2 + \
                                 data.SepalWidth ** 2) ** 0.5

We will use an instance of this class, and so I create a separate file where I store all features:

sepal_diagonal = SepalDiagonalFeature()

This specific implementation already presents a few decisions we made, whether conscious or not:

The name of the output column is a literal within the function code, and is not saved elsewhere. This means that we can’t easily construct a list of known columns.
We chose to add the new column to the input dataframe within the add_feature function rather than return the column itself and then add it in an outer scope.
We do not know, other than by reading the function code, which columns this feature depends on. If we did, we could have constructed a DAG to decide on feature creation order.

At this point these decisions are easily reversible, however later when we have dozens of features built this way we may have to refactor all of them to apply a change to the base class. This is to say that we should decide in advance what we expect from our system as well as be aware of the implications of each choice.

Let’s expand on our Model base class by implementing the add_features function:

    def __init__(self, features: Sequence[Feature] = tuple()):
        self.features = features

    def add_features(self, x):
        for feature in self.features:
            feature.add_feature(x)

Now anyone can take the sepal_diagonal feature and use it when creating a model instance.

If we didn’t facilitate reusing these features with our abstraction, Alice might choose to copy Bob’s logic and change it around a bit to fit with her preprocessing, applying different naming on the way, and generally inflating technical debt.

A question that may arise is “What about common operations, like addition. Do we need to implement an addition each time we want to use it?”.

The answer is no. For this we may use the instance fields through the self parameter:

@dataclass
class AdditionFeature(Feature):
    col_a: str
    col_b: str
    output_col: str  
    
    def add_feature(self, data):
        data[self.output_col] = data[self.col_a] + data[self.col_b]

So if, for example, we want to add petal length and petal width, we’ll create an instance with petal_sum = AdditionFeature('petalLength', 'petalWidth', 'petalSum').

For each operator/function you might have to implement a class, which may seem intimidating at first, but you will quickly find that the list is quite short.

Model interface

Here is the abstraction I use for model interfaces:

class ModelInterface(ABC):
    @abstractmethod
    def initialize(self, model_parameters: dict):
        ...

    @abstractmethod
    def train(self, x, y, train_parameters: dict):
        ...

    @abstractmethod
    def predict(self, x):
        ...

    @abstractmethod
    def save(self, model_interface_dir_path: Path):
        ...

    @classmethod
    def load(cls, model_interface_dir_path: Path):
        ...

And here’s an example implementation by using a scikit-learn model is given below:

class SKLRFModelInterface(ModelInterface):
    def __init__(self):
        self.model = None
        self.binarizer = None

    def initialize(self, model_parameters: dict):
        forest = RandomForestClassifier(**model_parameters)
        self.model = MultiOutputClassifier(forest, n_jobs=2)

    def train(self, x, y, w=None):
        self.binarizer = LabelBinarizer()
        y = self.binarizer.fit_transform(y)
        return self.model.fit(x, y)

    def predict(self, x):
        return self.binarizer.inverse_transform(self.model.predict(x))

    def save(self, model_interface_dir_path: Path):
        ...

    def load(self, model_interface_dir_path: Path):
        ...

As you can see, the code is mainly about delegating the different actions to the ready-made model. In train and predict we also translate the target to and fro between an enumerated value and a one-hot encoded vector, practically between our business need and scikit-learn’s interface.

We can now update our Model class to accommodate a ModelInterface instance. Here it is in full:

class Model:
    def __init__(self, features: Sequence[Feature] = tuple(), model_interface: ModelInterface = None,
                 model_parameters: dict = None):
        model_parameters = model_parameters or {}

        self.features = features
        self.model_interface = model_interface
        self.model_parameters = model_parameters

        model_interface.initialize(model_parameters)

    def add_features(self, x):
        for feature in self.features:
            feature.add_feature(x)

    def train(self, x, y, train_parameters=None):
        train_parameters = train_parameters or {}
        self.add_features(x)
        self.model_interface.train(x, y, train_parameters)

    def predict(self, x):
        self.add_features(x)
        return self.model_interface.predict(x)

    def save(self, model_dir_path: Path):
        ...

    @classmethod
    def load(cls, model_dir_path: Path):
        ...

Once again, I create a file to curate my models and have this line in it:

best_model_so_far = Model([sepal_diagonal], SKLRFModelInterface(), {})

This best_model_so_far is a reusable instance, however note that it is not trained. To have a reusable trained model instance we’ll need to persist the model.

Save and load

I choose to omit the specifics of save and load from this post as it is getting wordy, however feel free to check out my clean data science github repository for a fully operational Hey example.

Summary

The framework proposed in this post is definitely not a one-size-fits-all solution to the problem of standardizing a Data Science team’s work on a single model, nor should it be treated as one. Each project has its own nuances and niches that should be addressed.

Rather, the framework proposed here should merely be used as a basis for further discussion, putting the subject of facilitating Data Scientist work in the spotlight.

Streamlining the work should be a goal set by Data Science team leaders and managers in general, and abstractions are just one item in the toolbox.

FAQ

Q: Shouldn’t you use a Protocol instead of ABC if all you need is a specific functionality from your subclasses?
A: I could, but this is not an advanced Python class. There’s a Hebrew saying “The pedant can’t teach”. So, there you go.

Q: What about dropping features? That’s important too!
A: Definitely. And you may choose where to drop them! You may use a parameterized Feature implementation to drop columns or have it done in the ModelInterface class, for example.

Q: What about measuring the models against each other?
A: It will be awesome to have some higher-level mechanism to track model metrics. That’s out of scope for this post.

Q: How do I keep track of trained models?
A: This could be a list of paths where you saved the trained models. Make sure to give them meaningful names.

Q: Shouldn’t we also abstract the dataset creation (before we pass it to the train function)
A: I was going to get around to it, but then I took an arrow in the knee. But yeah, it’s a swell idea to have different samples of the full dataset, or just multiple datasets that we can pass around like we do with features and model interfaces.

Q: Aren’t we making it hard on data scientists?
A: We should weigh the pros and cons on this matter. Though it takes some time to get used to the restrictive nature of this abstraction, it may save loads of time down the line.

Data Science Better Practices, Part 2 — Work Together

You can’t just throw more data scientists at this model and expect the accuracy to magically increase.

The premise

Disclaimer

Start from the end

System design

Features

Encapsulated models

Features

Model interface

Save and load

Summary

FAQ

Written by Shachaf Poran