ML + DevOps Automation (Task 3)

Original article was published on Deep Learning on Medium


Jenkins Jobs

  • Job 1 : This job will first fetch the codes from Git-Hub as soon as the codes are committed by the developer and download it in its workspace using “GITScm polling”.

Then Jenkins will execute a series of commands which will first check for a docker container dl_train_env , if it is running then it will copy the codes from workspace to the target folder for the container and then run the Main_build.py code for training of model which has some default low hyperparameters or we can say the first build. If the container is not running then it will start the container then copy the codes to the target folder and then execute the Main_build.py . So this job will finally start the first build or training of our model.

  • Job 2 : This job will first check the status of the container dl_train_env and if the container is running then it will execute the Rebuild.py python code which will initiate the second build for our model with extra dense layer and tweaked hyperparameters necessary for achieving the desired accuracy for our model. If the container is not running then it will first start a new container and then execute Rebuild.py . This Job is triggered by Job “job_mail_main_fail” which you will understand eventually.
  • Job 3 : This job’s only task is to continuously monitor the docker container dl_train_env every minute.

For monitoring I have used a little trick in this job using exit command. So this gob will check the status of the container every minute , if the container is running then it will execute exit 1 command which returns a Fail value to the Jenkins so the job will keep on failing and this is good for us because that means our container is running properly. As soon as container stops or fails it will run exit 0 command which will return a Success value to the Jenkins so this job will finish successfully for the first time.

As the job finish successfully it will execute Job_1 which will start the whole process again.

  • Job_mail_main_success : This job is executed by a remote trigger that is run by the Main_build.py python program and this program will execute this job only if we achieve the desired accuracy for our model. It will execute a python program that will send a multimedia mail to the developer regarding the status of training of our model with some graph images about the model training.
  • Job_mail_main_fail : This job is executed by a remote trigger that is run by the Main_build.py python program and this program will execute this job if we were not able to achieve the desired accuracy for our model. It will execute a python program that will send a multimedia mail to the developer regarding the status of training of our model with some graph images about the model training and also trigger job 2 using post build action thus starting the rebuilds.
  • Job_mail_rebuild_success : This job is executed by a remote trigger that is run by the Rebuild.py python program and this program will execute this job only if we achieve the desired accuracy for our model. It will execute a python program that will send a multimedia mail to the developer regarding the status of training of our model with some graph images about the model training.
  • Job_mail_rebuild_fail : This job is executed by a remote trigger that is run by the Rebuild.py python program and this program will execute this job if we have not achieve the desired accuracy for our model due to some other reason instead of layer and hyperparameters tweaking. It will execute a python program that will send a multimedia mail to the developer regarding the status of training of our model with some graph images about the model training.

Extra Step only in job_mail_main_fail…