Tutorial

This tutorial is mainly for Ubuntu 16.04 but any other OS installable docker can be used.

This tutorial focuses on distributed training and deploying trained model through cluster of cryptocurrency mining systems(aka miner).

1. Background

Open source Horovod is used for deep learning distributed training.

In this tutorial, we will distribute train MNIST Classification Example that is based on Keras, Horovod and deploy the result of training as a Flask App.

2. Requirements

Let’s start off by installing requirements.

3. Run User Docker Container

sudo docker run -it aicrypto/user

Then, docker would automatically download aicrypto/user docker image if it’s not on the system.

4. Login

aicctrl login

Then, enter metamask Etherium wallet address.

5. Create Deep Learning Cluster

We will create a Deep Learning Cluster to distributed tran deep learning.

aicctrl create   

Run the command above, and the screen will change like below, telling you to choose a Gateway.

[aicctrl] gateway lists below.  
   1. name: test_gateway, available miners: 8  
[aicctrl] choose gateway :   

Gateway is a network access point that is consisted of Miners. Distribute training is impossible for miners that have different gateways. This is beta mode, and we provide only 1 Gateway at the moment, so write 1 and press enter to select Gateway 1.

[aicctrl] Input cluster name :   

Choose a cluster name.

[aicctrl] Input miner counts(max == 12) :   

Enter the number of miners you wish to include in your cluster. max == 12 means that yu can select up to 12 miners. Let’s choose 8 for now.

[aicctrl] No ssh key pairs.  
[aicctrl] Input ssh key name to be created :   

Then, deep learning cluster need SSH Key Pair. When you enter your desired SSH Key Pair name, it will automatically be created and be connected. Input a name.

[aicctrl] No data storage.  
[aicctrl] Input data storage name to be created :   

Then, enter any name to create a data storage that you will mount to your cluster.

[aicctrl] Now 8 miners are creating.  
[aicctrl] Successfully created 8 miners  

Then, you are finished. You have now created a DL Cluster that has 8 miners.
(It will take approximately 1 min for the real miner to be ready.)

6. Let’s connect to miner on DL Cluster

aicctrl connect  

A screen like below will show up.

[aicctrl] cluster lists below.  
   1. name: test-cluster, storage: test-storage, minerCount: 8  
[aicctrl] choose cluster :   

When you input the number 1 and tap enter,

index, hostname  
1 m6  
2 m16  
3 m4  
4 m13  
5 m17  
6 m5  
7 m8  
8 m9  
[aicctrl] enter miner index to connect :   

a screen like above will pop up. Let’s input any number and connect. We will input number 4 and connect to m13 miner.

Warning: Permanently added '[125.188.51.150]:43531' (ECDSA) to the list of known hosts.  
Welcome to Ubuntu 16.04.5 LTS (GNU/Linux 4.15.0-43-generic x86_64)  
  
 * Documentation:  https://help.ubuntu.com  
 * Management:     https://landscape.canonical.com  
 * Support:        https://ubuntu.com/advantage  
  
The programs included with the Ubuntu system are free software;  
the exact distribution terms for each program are described in the  
individual files in /usr/share/doc/*/copyright.  
  
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by  
applicable law.  
  
root@m13:~#   

A screen like above will show up. You havve successfully connected to m13 miner SSH.

exit  

When you enter the command above, your ssh connection will be over.

Let’s try connecting SSH directly without using CLI. For manual SSH connection, you need an SSH Key. By using the command below, you can download an SSH Key from the server.

aickey save  

A screen like below will show up.

[aickey] ssh key pair lists below.  
   1. name: test-key  
[aickey] choose ssh key pair :   

Down load a test-key by inputting number 1.

[aickey] Successfully saved ssh key pair, filename .test-key on current folder.  

When a screen shows up like above, you are finished. On the route where you have ran the current CLI, it is downloaded as .keyname .

aicctrl list

View the miner’s IP and Port by using the command above, and connect like below.

ssh root@125.188.51.150 -p 43531 -i .test-key  

7. Upload Horovod examples to the remote storage

Download an Horovod example.

apt-get install -y --no-install-recommends subversion  
svn checkout https://github.com/uber/horovod/trunk/examples  
rm -rf examples/.svn  

On the examples directory of the current path, your example will be downloaded. Among the examples that you have downloaded, we will upload keras_mnist.py to the remote data storage. Input the command below.

aicdata upload  

Then, select 1 + Enter.

[aicdata] ssh key pair lists below.  
    1. name: test-storage  
[aicdata] choose data storage :  

Then, enter examples/keras_mnist.py.

[aicdata] Input source path on your system :   

Then,

Warning: Permanently added '[125.188.51.150]:5500' (ECDSA) to the list of known hosts.  
[aicdata] Uploaded.  

Let’s check if it is correctly uploaded. Input command below.

aicdata download  

Then, Select 1.

[aicdata] ssh key pair lists below.  
    1. name: test-storage  
[aicdata] choose data storage :  

When you input . + Enter , it will download to the current working directory.

[aicdata] Input download path on your system :  

Then, input . + Enter to download all the files that are on the root path of Data Stroage.

[aicdata] Input source path on your storage :  

Then,

[aicdata] Downloaded.  

8. Train

Discribed train can be executed through the command below.

aicrun keras_mnist.py  

(aicrun takes the script argument. Script argument must be the script path on the Data Storage.)

Then, press 1 to select the Cluster created before.

[aicrun] Start aicrun  
[aicrun] cluster lists below.  
    1. name: test-cluster, storage: test-storage, minerCount: 8  
[aicrun] choose cluster :   

Then

[aicrun] How many miners that you use on this training? (all == 0)(max == 8) :   

If you input, then all the minres in the cluster will be used for training. If you input 5, only 5 will be used to train. For now, we will input 0 to train on all miners.

Input 0 + Enter, then

[aicrun] Input gpu counts that you use per miner :   

On the current beta, all the miners will have 6 GPUs. When you input 3, each miners will use 3 GPUs for training. We will input 3 + Enter.
We have previously selected 8 miners, so total of 24 GPUs will be used.

[aicrun] Start machine learning  

Then, training will begin.

[aicrun] End machine learning  
[aicrun] Deeplearning execution time -> 0:5:11 seconds.  
  
[aicrun] creating training  
[aicrun] done  
[aicrun] Finished  

When finished, it will show the total amount of execution time.

9. Download the result

Despite finishing training above, nothing has changed on the current system. This is because keras_mnist.py saves the result on the first miner. Let’s connect to the miner to copy the result file to the remote storage and download it to the local system.

Let’s connect to the first miner by the command below.

aicctrl connect  

Then,

[aicctrl] cluster lists below.  
    1. name: test-cluster, storage: test-storage, minerCount: 8  
[aicctrl] choose cluster :   

Select 1,

index, hostname  
1 m6  
2 m16  
3 m4  
4 m13  
5 m17  
6 m5  
7 m8  
8 m9  
[aicctrl] enter miner index to connect :   

and select first miner. Then,

Welcome to Ubuntu 16.04.5 LTS (GNU/Linux 4.15.0-43-generic x86_64)  
  
 * Documentation:  https://help.ubuntu.com  
 * Management:     https://landscape.canonical.com  
 * Support:        https://ubuntu.com/advantage  
root@m6:~#   

upon successful connection, screen will show up like above. Let’s check the files through with the below command.

ls  

Results will show like below.

checkpoint-1.h5  storage  

By the command below, you can move checkpoint-1.h5 file to the storage directory.

mv checkpoing-1.h5 storage/  

Then, use exit command to terminate ssh connection.

Use the command below to download the result file.

aicdata download  

Select 1 to choose test-storage.

[aicdata] ssh key pair lists below.  
    1. name: test-storage  
[aicdata] choose data storage :   

Then, input . + Enter to download all the files on the remote data storage.

[aicdata] Input download path on your system :  

Then, input . which means the current directory.

[aicdata] Input source path on your storage :   

A message like below will show up when the file is successfully downloaded.

[aicdata] Downloaded.  

10. Deploy

We will first remove the cluster that we have created above.

aicctrl remove  

Then, select 1 + Enter.

[aicctrl] cluster lists below.  
    1. name: test-cluster, storage: test-storage, minerCount: 8  
[aicctrl] choose cluster :  

aicdeploy automatically deploys Flask Project that has structure like below via Nginx, Uwsgi.

project/
    app.py             (Flask app file must have a name app.py.)
    requirements.txt   (Write the required Python Package according to the style of requirements.txt)

Let’s make the Flask App to deploy the trained model above.

First, create a directory named deploy_project, and move the result file named checkpoint-1.h5 to the directory.

mkdir deploy_project  
mv checkpoint-1.h5 deploy_project/  

Then, create app.py and paste the below codes.

import tempfile  
  
import numpy as np  
  
from flask import Flask, jsonify, request, Response  
from PIL import Image  
from keras.models import load_model  
  
model = load_model('checkpoint-1.h5')  
model._make_predict_function()  
  
app = Flask(__name__)  
  
  
@app.route('/', methods=['POST'])  
def predict():  
    if request.method == 'POST':  
        if 'image' not in request.files:  
            return Response('image file not attached.', status=400)  
  
        with tempfile.TemporaryFile() as tmp:  
            tmp.write(request.files['image'].read())  
            image = Image.open(tmp)  
            image_arr = np.array(image) \  
                .reshape(1, 28, 28, 1) \  
                .astype('float32')  
            image_arr /= 255  
  
        return jsonify({  
            'result': int(model.predict(image_arr)[0].argmax(axis=0))  
        })  
  
    else:  
        return 'only post is permitted'  
  
  
if __name__ == '__main__':  
    app.run()  
  

Let’s deploy. Enter the command below.

aicdeploy create  

Then, select 1.

[aicdeploy] gateway lists below.  
    1. name: test_gateway, available miners: 11  
[aicdeploy] choose gateway :   

Now, input the name of the cluster. We will use test-deploy.

[aicdeploy] Input cluster name :   

Enter the number of miners. We will choose 3.

[aicdeploy] Input miner counts(max == 11) :   

Then, input the path of the Flask Project. The path can be absolute or relative.
For now, we will enter deploy_project.

[aicdeploy] Input deploy project path on your system :   

You have succeeded when the screen shows up like below.

[aicdeploy] Now compressing project on /root/deploy_project  
[aicdeploy] Now 3 miners are creating.  
[aicdeploy] Successfully created 3 miners  
[aicdeploy] API address : http://125.188.51.150:31782  

Let’s figure out if the deploy was successful.

First, download the test image.

wget https://pool.aicrypto.ai/images/test.jpg  

Then, install requests python package.

pip install requests  

Input the following code below on test.py.

import requests

r = requests.post('http://125.188.51.150:31782', files={'image': open('test.jpg'), })

if not r.ok:
   if r.status_code == 502:
      print('Miner is now activating. Please try again 10 secs later.')
   else:
      r.raise_for_status()

print(r.json())


When you execute test.py, you can check that the deploy was successful.

If an error occurs, you can print uwsgi log via the command below.

aicdeploy logs