Build your data pipelines like the Toyota Way

If there is one only book to read about lean manufacturing, this is the one. This is the kind of book you can read again and again and still learn something about your current context.

It is also a book you can read whatever your industry, you will always find situations covered by this book.

Today, we are going to apply these principles to the data pipelines.

“The right process will deliver the right results” – Totoya way (section II)

In the 14 Toyota way principles, you have 7 of them devoted to the process. This is where you have all the main tools to improve manufacturing processes. The idea is to transpose these 7 principles to data pipeline knowing that

  • Data pipelines are 100% flexible : if you have the skills, you build the pipeline you want.
  • Data pipelines are virtual : you do not have any people involved, it is a computer workload
  • Data pipelines are “unlimited” : if you have the correct system, you can have thousands data pipelines

So it looks like a quiet easy activity where a good design with the right system should be enough. In fact it is not, everyday you have in the world tons of broken data pipelines. A tech team can spend 40% of their time just fixing them. It can be horrible expensive too with software licences cost and a very large infrastructure. And of course, it is often late or down (the famous data downtine promoted by Barr Moses).

How does a bad data pipeline process look like ?

Before going deeper with the 7 principles, I just want to give real live examples to illustrate a very bad (or not very optimized) data pipeline looks like.

  1. We have only batch processes : we are receiving all the files at the same time and we have to process them in a very short window of time. We are mainly focus on these files and less on the end result. We do not have time to check dashboards or reports produced with these data.
  2. The other applications are sending us files but we never know if the shipment is complete. If it is late, we can’t wait, we will start our “batch”.
  3. We have upgraded many time our main servers but it is often too short : especially when we have to do catchups.
  4. We have 2 teams : one is building the pipelines and the other to maintain them. If it is broken, the team who is building the pipelines is too busy on their projects and they do not want to fix it.
  5. Because we have different squads (you should look my previous article why it is not a good idea), we have many ways of building the same data pipelines !
  6. We have tried the last open source project because it was more easy but we did not check the support.
  7. We do not really know what it happening (see again data observability by Barr Moses)

Let’s do it like the Toyota Way

#1 “Create continuous process flow to bring problems to the surface”

Capturing the data continuously is creating a flow and it is much more harder because you will to anticipate all the problems. And if there is an unexpected problem, you will have to be quick to fix it ! It will be much more customer oriented because he does need to wait every day to have his report, he can have a refresh all the time. In the book, it is very focus about reducing waste (muda) and it can be transposed to data pipelines : no need to store files before being processed and no need to have a powerful server to manage all the files to be ingested.

#2 “Use “Pull” Systems to avoid Overproduction

I guess that if you ask this question to any BI manager, he will tell you “it’s better if you can push me the data” because he has nothing to do and the responsability is on the source side. He is right but if you want to control your destiny, pull is a better philosophy. You know when you are ready to do it, you can put checks on the sequence because you know where you are and at the end, you are always responsible if you did not deliver.

#3 “Level out the workload”

It the consquence of #1. If you have a continous flow, you can also agregate the data continuously and calculate your KPI. This is a very common functionallity for real time processing. Doing that, you will avoid any overload and be sure that everything is going smoothly.

#4 “Build a culture of stopping to fix problems, to get quality right the first time”

It could have been a devops principle too ‘you build it, you run it”. But clearly, you should stop any developer if there is a bug. Organisations based on projects on one side and run on the other side are doomed. You will never get the quality.Stop everything, fix the bug and I am sure that devs will be better.

#6 “Standardize tasks are the foundation for continous improvement and employee empowerment”

Same, could be a devops principle too ! It is not very far from the continuous integration / continous deployement. How can you standardize the way that each pipeline will have the same structure, same metadata generated, same observability. I agree too about the employee empowerment. Spend more time on the business logic than developping these data pipelines.

#7 “Use visual control so no problems are hidden”

It is maybe the most important point about your data pipelines : what kind of visibilty do you have on what’s happening. When I hear, it is about reporting, I think it is a very small way to look at this topic. We can do much more better than red or green light. Process mining is a source of inspiration for visualizing your data pipelines.

Conclusion

Still amazed that these principles dedicated to manufactoring can work very well with data pipelines/You can also see the similarities with devops principles. The zoom was on “processes” but you 3 other interesting sections too :

  • Having a long term philosophy,
  • Value to organisation by developing people
  • And solving root problems drives organizational’s learning.

This book is a gem because all the principles in this book can fit with your data teams. You should not see data pipelines as a technical subject, you should see it bigger as a system to provide tools for people to continually improve their work !

Leave a comment