, ,

Developing Production Level Databricks Pipelines.

A question that comes up often … “How do I develop Production Level Databricks Pipelines?” Or maybe someone just has a feeling that using Notebooks all day long is expensive and ends up being an unreliable way to produce Databricks Spark + Delta Lake pipelines that run well … without error.

It isn’t really that hard and revolves around a few core ideas.

  • You must have a good Development Lifecycle
    • Local Development
    • Local Testing
    • Deploy to Development Environment
      • CI/CD
      • Testing
    • Deploy to the Production Environment
      • CI/CD
  • You need to use Docker and Docker-compose
    • With Spark and Delta Lake installed + whatever else.
    • Run code locally and unit test locally.
  • You need to invest in CI/CD and auto testing and deployments
    • Nothing should be done manually, the entire process automated
    • Learn bash and things like CircleCI or GitHub actions

Watch below video for the full details.

4 replies
  1. Costas
    Costas says:

    Hey great video!

    What’s your opinion on running notebook tests against production tables (using less rows) as part of the CI?

    Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *