Zero-overhead scalable machine learning-Part 2

By Peter Zhokhov, Senior Data Scientist

A common obstacle in sharing results of machine learning research with industry data science is reproducibility of experiments. The research community often has their own dedicated machines and clusters; whereas industry often relies on cloud compute providers such as AWS EC2 or Google Compute Engine. As a result, even when GitHub provides a turnkey solution for code sharing, actual re-running of the experiments is preceded by manual, often rather laborious and error-prone setup of the environment. This process has intrinsically nothing to do with data science or machine learning research and is an artifact of varying compute environments; but it can and should be automated.


To illustrate the statements above, let us try to reproduce recently released Facebook paper on Fader Networks (, presented at NIPS 2017). The code is open-sourced commendably well ( – there are very clear description of dependencies (the only missing dependency that I have found is matplotlib for generating images with interpolated attributes), set-up instructions, running examples etc. Is it easy to reproduce?


Depends. Do you have a computer with a GPU? Does your computer have 150 Gb of RAM? What if you have a computer with a GPU, but you’d like to use more powerful GPU and / or run multiple experiments in parallel? My laptop got disqualified at the first round of system requirements; what do I do next? Fortunately, tools like Amazon SageMaker and Studio.ML are there to help us provision hardware and run machine learning experiments at scale. Let us try training fader network using both of those.


Training data for fader network (CelebA dataset) is available as a zip file with images and a text file with attributes for each image. Dataset size is 1.3G; however, if we were to save the pre-processed data (resized and converted into a 202560 x 3 x 256 x 256 tensor) it would take ~150 Gb even at single precision. Even if we reduce the image size to 128 x 128 (I had to do that in order to to fit data into the memory of p2.xlarge EC2 instances and not use p2.8xlarge that are an order of magnitude more expensive), it is still 37Gb, which is fairly hefty to move around. Obvious solution to the problem is to do the pre-processing on the fly – fortunately, it does not take much time in comparison to the actual training.


Here’s how we can do it. Note that full version of the code compatible with SageMaker or studio is available in my fork of the fader networks repo at Briefly, the outline is as follows:

  1. We moved file into the root directory (this is not necessary, but eliminates the need to deal with python packaging schemes)
  2. Add the code that unzips the images from a location specified via an environment variable (we’ll use that to point to the data in SageMaker Dockerfile) or from studioml artifact – this sounds scary, but it is much simpler than it sounds.
  3. Include “import preprocess” statement in “src/” that preprocessing always happens before the data is loaded


The training process consists two stages – 1) training of the evaluation classifier (optional) and 2) actual training. The evaluation classifier is needed to see the progress of training – in the training process the discriminator loss will not be informative because discriminator is trained as well. Also, in the current implementation classifier has the architecture similar to the encoder but with dense layer connecting representation in latent space to the class labels. As such, it is indicative of how well the fader network can be trained to remove the attributes – i.e. if the classifier does not train (which happened in practice when we tried “Male-Female” attribute removal for mens and womens trainers), the adversarial component of the fader network won’t work either.


0. Prerequisites

  1. In order to be able to run fader network locally (even the interpolation stage) you’ll need pyTorch installed. Please follow the instructions on for instructions for your platform
  2. Clone / fork the fork of the fadernetworks repo from (there are some changes to the original code to make it play nicer with SageMaker and Studio.ML)
  3. Install the dependencies (I added the requirements file with the dependencies that worked for me in requirements.txt, can be installed as pip install -r requirements.txt)
  4. To use SageMaker, we’ll need docker installed (community edition is fine)
  5. To use SageMaker, it is handy to configure default AWS region and credentials via running `aws configure`
  6. To use Studio.ML we’ll need – you guessed it – Studio.ML installed (pip install studioml)


1.1. Training the classifier in SageMaker

First, we’ll need to move the data into the bucket on s3. I have modified the readout code to take a single zip file where attributes.txt file has a list of relative paths to images and their attributes (this makes swapping the datasets for your experiments much easier – simply replace the zip file). The zip file with attributes and aligned-cropped images from celebA can be downloaded from our bucket on s3: either over http: or over s3: s3://peterz-sagemaker-east/data/


Note that file has to be placed in your bucket in the same region as your SageMaker notebook (otherwise, you’ll get an error).


Then we need to prepare a container that SageMaker will run, and upload it to the Amazon ECR  (elastic container repository). To simplify the process a bit, I have written a script that does it for you (can’t take the full credit though, most of it comes from this tutorial:


In the FaderNetworks folder, run ./ – this should build and upload the containers for classifier training and actual fader network training.


To take a peek of what’s happening in that script let’s take a look into the Dockerfile-base (which is the Dockerfile for the image with common layers across classifier, training and interpolation images):


FROM nvidia/cuda:8.0-cudnn6-runtime-ubuntu16.04

# install python, pytorch, opencv and matplotlib
RUN apt-get update && apt-get -y install python-dev python-pip && \
    pip install && \ 
    pip install torchvision && \
    pip install matplotlib opencv-python

# install system libraries for rendering and other tools
RUN apt-get install -y libxrender1 libsm6 libglib2.0 libxext6 unzip

COPY . /opt/code

ENV PATH=/opt/code:$PATH

ENV MODELS_PATH=/opt/ml/model
ENV IMG_ZIP_PATH=/opt/ml/input/data/training/


Basically, we take ubuntu16.04 image with cuda and cudnn installed, we download and install torch and other dependencies; and then configure environment variables to point to the data and directory for models. Note “ENV PYTHONUNBUFFERED=True” section – it is very important to disable python output buffering, because otherwise you will not see the output of training (in the it is said that this variable is needed to see the output faster). I ran classifier training for a day and did not see any output with buffering enabled, so I’ll rephrase “faster” into “at all”. Side note – later on I discovered, however, that the logs can also be seen via the SageMaker consoler and CloudWatch either way, and even if the output is piped into the notebook without buffering, the pipe is brittle in long term (so that it won’t stay active, for example, for a day), so going to the logs via console and cloud watch may be a preferred solution one way or another.


With all the code and data mounted, all we have to do is to specify a correct entrypoint for the classifier image – this is how our Dockerfile-cls (Dockerfile for classifier image) will look like:


ENTRYPOINT "cd /opt/code/ && python"


Note that here we go a bit across SageMaker suggested paradigm of having the same image for both training and inference. Normally, SageMaker will call the image with “train” or “serve” argument, so the method suggested in “Bring your own” section of SageMaker examples ( is to create executable scripts called “train” or “serve” and put them in one of the folders PATH env variable points to. Here, we hijack that process by overriding the entrypoint – this way the command gets executed no matter which argument image was started with.


As mentioned above, ./push_to_ecr script builds a docker image using this dockerfile and uploads it to Amazon ECR (it uses awscli to log into your aws account, so you’ll need it installed, but if you followed prerequisites, you should have it by now).


Finally, we are ready to do some training! Create a new notebook in sagemaker, and put the following into the notebook (I’ll type the code it for copy/paste possiblity, you can see the output at the screenshot):


from sagemaker import get_execution_role
import sagemaker as sage
role = get_execution_role()
sess = sage.Session()


# note that you’d have to use ECR name prefixed with your AWS account number here
cls_image = ""

cls = sage.estimator.Estimator(

Once the images are unzipped, the classifier training begins. One epoch takes about 3.5 minutes on SageMaker ml.p2.xlarge instances. To me it feels like the epoch number is a bit misleading (but that is a question to authors of the fader networks code) given that the “epoch size” is set more or less arbitrarily to 50000 by default). After a little while of training you should have a classifier that does a relatively decent job telling smiling and non-smiling faces apart. The default attribute is “Smiling”, but you could change that via command-line argument –attr. Note that it’ll have to go into the Dockerfile_classify, last line that forms “train” script. Naturally, if you made any changes to the training routine, don’t forget to rebuild and push the images.


Sometimes the notebook gets disconnected from the pipe with training logs (which is strange, given that SageMaker notebooks are also hosted on AWS). After that happens, you can still access the logs of our job by navigating to the job in the SageMaker console (note pagination – old jobs may be in the latter pages), and clicking “View logs” in the “Monitor” section.


Okay, so now we have a (at least partially) trained classifier (just like the README on the github page of fader network code specifies, classifier does not have to be great, it just has to be ok).  Next step is training the fader network.


1.2. Training fader network with SageMaker

Naturally, we’ll follow the same procedure as in the previous section – create the docker image, upload it to ECR (in fact, ./push_to_ecr script creates both classifier and training container and pushes them both as well),  run training. One problem is that while SageMaker saves model folder (where the weights and the architecture of the classifier network were written) on s3, but as a tar archive. This is a bit of a nasty surprise – dealing with tar files from python code is possible, but definitely low on my list of favorite things – but we can easily work around it by including the decompression command into the ENTRYPOINT in the docker image. Also, the classifier is trained to classify smiling vs non-smiling faces (i.e. for attribute “Smiling”) only, but training is set for both attributes “Smiling” and “Male”. We’ll simply set –attr=Smiling for the training to be able to use the classifier from step 2.1. Resulting Dockerfile will look something like this:



ENTRYPOINT cd /opt/ml/input/data/classifier &&  \
           tar --verbose -xf model.tar.gz && \
           cd /opt/code && \
           python --eval_clf=/opt/ml/input/data/classifier/best.pth --attr=Smiling


Next, we configure training. Note, however, the unlike logs, the output of the classifier (i.e. the saved classifier model) is not piped continuously into the proper s3 location, and is only being saved in the end (which means in order to see how the model is doing midway you’ll have to stop the job – very quantum mechanical approach :)). Using two data channels (one for data, one for model) is just as straightforward as using one (I keep using the same notebook as was used for classifier training):


# again, use your ECR prefix
trn_image = ""
# job name from the previous step has to go into this path
cls_location = 

fader = sage.estimator.Estimator(
){'training':data_location, "classifier": cls_location})

1.3. Why so serious? Attribute interpolations with SageMaker

The interpolation is fairly small job, so we could do it directly from a notebook… except the notebook only gets 4.5 Gb of disk storage (+ 8Gb temp storage). The decompressed tensor with images takes more disk space than that, so this won’t work. We can modify the code such that the image tensor is not saved to the disk (besides, we only need few images to run interpolation on). Two other options is to generate yet another container with the interpolation job, or download files and generate interpolations locally. The latter approach is boring from infrastructure point of view (and has very similar steps as 2.4) so I’ll simply defer to the latter, and try the former. The sequence of steps should familiar by now – we define a Dockerfile (Dockerfile-int) that looks like this:



ENTRYPOINT cd /opt/ml/input/data/fader &&  \
           tar --verbose -xf model.tar.gz && \
           cd /opt/code && \
           python  --model_path=/opt/ml/input/data/fader/best_rec_ae.pth
--output_path=/opt/ml/model/output.png --alpha_min=5 --alpha_max=5


Just like in case of training we are mounting classifier model at /opt/ml/input/data/classifier, for interpolation we’ll mount trained fadernetwork checkpoint folder at /opt/ml/input/data/fader, untar the archive and use the “best_rec_ae.pth” checkpoint (the one that provides best reconstruction, from what I understand). Another point to note is that we crank up the alpha channel for extra visual effect (–alpha_min=5 –alpha_max=5 argument). This docker image needs to be built and pushed to the ECR (both are handled by script).


We then initiate interpolation job from the notebook:


int_image = ""
trn_location = 

interpolator = sage.estimator.Estimator(
){'training':data_location, "fader": trn_location})


Once the interpolation is done, it would be cool to display image in our notebook (are we using jupyter or not, after all? Where are pretty pictures?) While somewhat more natural way is probably downloading the file from s3 using awscli, unpacking it via bash command, and then reading it from python, where’s the fun in that :). More elegant and pythonic way would be pipe s3 file into python tarfile and then into Image object.


import boto3
import tarfile
from PIL import Image
import StringIO
fader_job = 'fadernetwork-int-2018-02-25-12-15-32-595'
s3file = boto3.resource('s3').Object('peterz-sagemaker-east', 
with, mode='r|*') as tarobj:
    for tarmember in tarobj:
        # go through all members in the tar and extract the first regular file
        if tarmember.isreg():
            imagedata = tarobj.extractfile(tarmember).read()
            image =


The jupyter notebook with all the steps can be found here (note, however, that it will not work locally, only on sagemaker notebook instance):


2.1. Training classifier with Studio.ML

With the assumption is you should be able to test the code locally, before proceeding with cloud training. As such, I would recommend to install pytorch locally. Also, I would highly recommend using a clean virtualenv for the local fadernetwork experiments, because conflicting python dependencies is something that haunts me in nightmares after less than a year of python development. Okay, say you have pytorch and studioml and extra requirements installed in a clean virtualenv. Can we just run `python` locally to see if things work? Yes, but we need to tell it where the data is. You’ll need to download file locally, and set IMG_ZIP_PATH environment variable to point to it. For instance:


export IMG_ZIP_PATH=/path/to/


After that, indeed, `python` should start image preprocessing and classifier training locally. If your local machine has not been consuming steroids, the training may take a while, so you can simply stop it via Ctrl-C.


Okay, so where does the studio come in? We can run the same script locally (and abstract away data location from local) via:


studio run 


Command-line option

–capture-once= tells studio to present file at as an artifact “data” to the code (within the code, location of the data can can be accessed in a execution context independent way via studio.fs_tracker.get_artifact(“data”) call).  s3:// – uri can also be used, provided you have access to the file.


If prompted about logging in, run studio ui in another window and go to http://localhost:5000 in your browser to log in.


Studio handles download and caching of the data locally, so, while downloading 1.3G first time will take a little while, next time you run it studio will used cached. You can also point to the data using s3:// link. But, again, locally may be slow, and the central theme here is cloud compute anyways. To run in the cloud with studio ml, add the –cloud=[ec2|gcloud] –gpus=1 flags after studio run (make sure that you have proper credentials to the respective cloud – aws access keys configured or google application credentials json file being pointed at by GOOGLE_APPLICATION_CREDENTIALS env variable). The whole command line now reads:


studio run --cloud=ec2 --gpus=1 --hdd=256g 


This will reserve a on-demand instance (we could use spot with –cloud=ec2spot, but the training code is not preemption-tolerant, so let’s not do that) with 1 gpu and 256g of hard drive space, and run classifier training on it.


The results of training (once it starts) can be seen either at http://localhost:5000 (if you have not stopped the ui server) or by going to the central experiment repo at (I have also added –experiment=fader_celeba_classifier_smiling option to the studio run call to give experiment a human-readable name). The results of my experiment can be found at

The output of the experiment will be shown if you click on the experiment key (for comparison, the results of my run can be found at


In practice, when training a first few times on ec2, I’d recommend adding –ssh_keypair=<your_aws_key_pair>. This will let you ssh into the actual instance running the training and may you some trouble – for instance. when accidentally saving models in the wrong place it will help salvage and debug things (just speaking from experience here – it was really painful when I ran a container in a SageMaker for a day just to find that the env variables defining where should the data be saved had wrong names in the Dockerfile, so no models have been exported). When running in google cloud compute engine, ssh-ing into instances can be done through the web interface by default.


2.2 Training fader network with

Once the the classifier is trained sufficiently, we are ready for some actual fader network training. How do we pass the classifier model to the training routine though? The hacky way would be download the model and pass it as data; however, allows you to pass artifacts saved by one experiment into another. Namely,


studio run --cloud=ec2 --gpus=1 --hdd=256g \
--reuse=fader_celeba_classifier_smiling/modeldir:eval_clf \
--capture-once=s3://peterz-sagemaker-east/data/ \ 
--eval_clf=best.pth --attr=Smiling


should do the trick and start the training. The –reuse command-line option will make modeldir artifact of experiment fader_celeba_classifier_smiling accessible to with name eval_clf (i.e. from within a python call “fs_tracker.get_artifact(‘eval_clf’)” will return local path to the artifact). –eval_clf=best.pth option tells the training code to pick classifier model with highest validation accuracy.


Again, by navigating to local ui or you’ll be able to track the progress of your experiment once it starts. Total training time for 1000 epochs on google cloud with K80s took ~6 days.


2.3. Attribute interpolations with Studio.ML

Time to make celebrities smile! We’ll run interpolations locally (I have modified the code a little bit so that it does not require cuda and will work on the machines without GPU if env variable FADER_NOCUDA is set; to set it, run export FADER_NOCUDA=true in the command line). Using the exact same recipe as above for training, we compose studio run command line to point it to the data and trained fader network model:


studio run --reuse=fader_celeba_train_gc/modeldir:model 
ta --model_path=best_rec_ae.pth --output_path=$(pwd)/output.png --alpha_min=5 


If you have not run classifier or trainer before, first run may take a little while due to the dataset download. The dataset is cached locally, so any subsequent request of the data with the same url will go fast. Option –output_path=$(pwd)/output.png ensures that the output file shows up in the current directory ( may decide to run the script in a copy of the workspace even for local runs; in which case output.png will be in that copy of the workspace by default)


Finally, the result of long labours:

The first column is the original image, the second – the decoder output, and third till last are results of interpolation of smiling attribute from -5 to 5.


3. Conclusion

We have tried reproducing cutting-edge-research neural network using two modern frameworks for ML experiment management and cloud computing – Amazon SageMaker and Studio.ML. Technically, both required small code changes to the original code (mainly having to do with passing in the location of data). Pros and cons of each (as a result of this particular exercise, not discussing full feature list of each) are summarized below.



  • Well-Integrated with AWS
  • Easy job monitoring
  • Models can be pushed directly into production-grade services (we did not use the that really here, but that may be a next logical step – to make a “make your buddy smile” service)
  • Containerization – requires extra step in porting the code; local testing may be tricky (need to mount the data manually)
  • Passing the parameters via command line is difficult (have to go into containers)
  • Experiment pipelines (i.e. re-using artifacts of one experiment in another) are hacky to implement
  • No direct access to instances, debugging / salvaging experiment is difficult
  • Limited storage space on notebook instances (not really a limitation with proper discipline, but who has that :))



  • Open-source, detached from a particular cloud (now can run AWS and Google Cloud, Azure coming soon)
  • No explicit containerization required
  • Experiment pipelines are supported naturally
  • Can be run locally without extra steps
  • Less mature than SageMaker (less support, has bugs)
Posted in