Access Data from AWS S3-Python

Original article was published on Artificial Intelligence on Medium


Access Data from AWS S3-Python

Download data from cloud storage for your python based applications like machine learning

Someone, somewhere said Data is the new oil…’ .

It stands true about how it would run our lives in the near future with man-made accessible solutions in the form of machines. Thus said, with increasing dependancy on data to build future solutions, the volume, quality, interpretability, complexity and relatability of data also plays a major role in the successful deployment of these solutions. Volume is definitely a forerunner among them, and as we have lots of data flowing in daily with increasing sources it means increased cost and maintenance of storing it.This is where cloud based storage services like Amazon S3, Microsoft BLOB,Google Cloud Storage come to the rescue of humanity.

Background

Recently I was working on one of my own Python based Machine Learning projects based in Computer Vision, where I had to fetch voluminous image based data from AWS S3 to build my model. Considering the volume and approach for future incoming data, I had to store the data in cloud based storage services to fetch in whenever the data is needed to build the model. Initially I had the confusion of choosing between AWS S3 and GCP storage, but later decided to proceed with S3 due to the availability of free tier. Since I am working with Google Colab to build the model (not a great idea to integrate cross platforms given the difficulty of use, but still cost matters) I had to use some SDK to access S3.

A little introduction to S3

S3 is the cloud based amazon platform for storing your data. AWS enables free tier access for users upto 1 year for many of their services which includes S3 as well. The free tier for S3 includes the following services at free of cost each month upto one year.

  • 5GB storage per month
  • 20000 GET requests
  • 2000 PULL requests
  • 15GB data transfer in and 15GB data transfer out

S3 is an object based storage service. It contains buckets(assume like folders) in which the objects(data) can be stored.

Boto3

When you are looking to access a cloud based service like S3 from Colab, it becomes really difficult to access the APIs because of their cross platform which is where I stumbled upon this SDK named Boto3. Boto3 is a SDK for AWS based services, that allows users to access and retrieve data for Python based applications. It helps us to integrate Python based applications with the AWS services.

Installing boto 3

pip install 

Below is a snippet of the code, that will allow you install boto3 and use it to download objects from your S3 bucket and feed it to your applications as and when required.

#Initialising 
l=[]
ACCESS_KEY='your access key generated from AWS'
SECRET_ACCESS_KEY='your secret id generated from AWS'
#Code to download the images from S3 using Boto3s3=boto3.resource('s3',aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY)bucket=s3.Bucket('your bucket name here')for i in bucket.objects.filter(Prefix='your folder name here/'):
l.append(i.key)

print('Total Images in S3 bucket: '+str(len(l)))
total=0for key in l:
if total==len(l):
break
else:
name='img_'+str(total)
obj=bucket.Object(key)
tmp = tempfile.NamedTemporaryFile()
with open(tmp.name, 'wb') as f:
obj.download_fileobj(f)
img=im.imread(tmp.name)