Docker for data science and engineering - Blog

Working as a data scientist, I had only heard of containerization. This was the case until I ended up working on a project that required shipping model training and deployment in containers on one of the cloud platforms. That’s how I came across docker and it has been an integral part of my toolbox ever since.

I’ve mentioned some of the excellent resources for getting docker below

In this series, I’ll cover docker from a data science and data engineering perspective which makes docker a very handy asset in your data science or software development toolkit.

Introduction

Before we learn what docker is, an important question to answer is why do we need docker?

Docker lets us create a completely reproducible environment. We specify the libraries that you need along with their specific versions. Docker lets us create an environment that will run the same way, irrespective of the system it’s run on. This can lead to a lot of time-saving in projects with multiple environments, developers, testers, etc.

What are containers?

I think the official documentation on docker does a very good job of explaining what containers are.

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. Container images become containers at runtime and in the case of Docker containers - images become containers when they run on Docker Engine. Available for both Linux and Windows-based applications, containerized software will always run the same, regardless of the infrastructure. Containers isolate software from its environment and ensure that it works uniformly despite differences for instance between development and staging.

How can docker help with data science?

Let’s talk about an end-to-end data science solution. A data science project often involves a whole team of data scientists, data engineers, software architects, often working along with other software development teams to create a viable solution. You can have a situation where different data scientists end up working with different versions of the library only to realize after hours of debugging that there are small differences in their environments. Docker lets you create a consistent environment for data scientists and data engineers to deal with these kinds of situations.

Okay, so we know now that how docker helps in maintaining consistency of the environment, but what if you work alone? Should you invest time learning and implementing docker for small scale projects? The answer is yes and I’ll explain why.

Ease of model building - Dockerizing your data science project can help set things up faster. I’ll demonstrate this with an example below.
Deployment - You’ve created a state of the art model with data science libraries and now you wish to create a solution out of it. Deploying a data science solution is much easier with docker in place. I’ll demonstrate an example of this as well in later posts in this series.

Lets look at some examples

Let’s use docker to set up a data science environment for ourselves. The image that I’ll set up is a jupyter notebook using docker to set up python, R and Julia with just a few commands. The image is developed and maintained by the jupyter team and the same can be found here. We need to run the following commands from a shell in your system.

docker pull jupyter/datascience-notebook:latest
docker run -p 8888:8888 jupyter/datascience-notebook:latest

Navigate to localhost:8888 on your filesystem and copy the auth token from the shell and there you have it, you can have a data science environment to run R, python, and Julia code.
Alt Text

Let’s see a slightly more technical use case. We’ll implement a server with fastAPI. For more details, please refer the FastAPI documentation. Let’s see how easy it is to set up a fastAPI server on your local machine. Run the following commands from a shell in your machine in a new folder

python -m venv venv
source venv/bin/activate
pip install fastapi

Now, create a Dockerfile with the following code

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.7

COPY ./app /app

Now, create a folder named app and create a script inside with the name main.py. Add the following code to main.py

from fastapi import FastAPI

app = FastAPI()


@app.get("/")
def read_root():
    return {"Hello": "World"}


@app.get("/items/{item_id}")
def read_item(item_id: int, q: str = None):
    return {"item_id": item_id, "q": q}

Now, go to your shell and run the following from the project root

docker build -t myimage .
docker run -d --name mycontainer -p 80:80 myimage

Navigate to http://127.0.0.1/docs to see the swagger UI docs for the API

swagger docs

Navigate to http://127.0.0.1/redoc for redoc based docs.

reDocs

In a later post, I’ll demonstrate how to serve your machine learning model with FastAPI but for now, we have a FastAPI server served in docker container ready to be shipped anywhere.

In this post, I demonstrated how Docker can be used in data science projects. In my next post, I’ll go into more detail into docker terms and docker commands that can speed up your development process. Thanks for reading!

This post is also available on DEV.