Integrating Boto3 To Streamlit Apps

over 3 years ago

I have been experimenting with Boto3 a little bit more with goal of integrating its capabilities with Streamlit Apps. Boto3 is an AWS SDK for python and allows us to create, configure, and manage AWS services programatically. In my case, I am using it for S3, which allows to store objects/files in the cloud and can be accessed programmatically within Apps. To learn more about Boto3, feel free to read my previous post on the topic - Getting Started With Boto3 - AWS SDK for Python.

Streamlit is a framework that makes deploying python projects as web Apps very easy without a need to extensive skills or efforts in web development. Some Streamlit projects may involve working with various files.

In this experiment let's consider our Streamlit App will be used by multiple team members who share their files and data in these files will need to be programmatically extracted, analyzed, and used in creating new report files to be used with the same team members. Since files are located in different machines, we need a central hub to store and access them by all members. There are other solutions for such purposes already. However, these files need further data processing that is done by our Streamlit App. That's why having one cloud storage that can be access by the App can be useful.

We are trying to achieve 2 goals in this project. First, to be able to store uploaded files in our S3 cloud bucket/storage. Second, ability to access and read these files by our Streamlit App. Once we can achieve these goals, we can integrate these functions into our main Streamlit Apps. In this experiment we will focus on 'pdf' format files, although we can work with any other extensions as well.

import streamlit as st
import boto3
import pdfplumber
from io import BytesIO

access_key_id = 'XXXXXXXXXXXXXXXXXXXXX'
secret_access_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
bucket_name = 'project_bucket'

To get started we need to import our dependencies. First two are self-explanatory, streamlit and boto3. pdfplumber module will be used later to extract data from pdf files. ByteIO will be need to read the files from S3 in a way so it can be processed by pdfplumber.

Secret keys will have to be hidden. One of the ways is to use environment variables or Config Vars in heroku. Since we are in testing mode, for simplicity secrets are hardcoded.

Next, we start with our main section of the code and starting our streamlit app.

def main():
    fl = st.sidebar.file_uploader('Upload new pdf file:', type='pdf', accept_multiple_files=False)

    if fl is not None:
        save_file(fl)

    s3_filename = get_files_list()

    if st.sidebar.button('Get File - ' + s3_filename):
        get_file(s3_filename)

if __name__ == '__main__':
    st.set_page_config(page_title="Reports", layout="wide", initial_sidebar_state="expanded")
    main()

The script starts with if __name__ == '__main__': block of code. I like to have initial configuration of the Streamlit app here and them keep the main sections of the code within main() function. When configuring the Streamlit app, we can give it a name that will be displayed in the browser tabs, layout of the app, and wether we would like the sidebar to be expanded when app is launched. We can also include our custom icon, otherwise it will used Streamlit's default icon.

In the main function, we can see how easy it is to upload a file from a local machine. If new file is selected, then we will use save_file() function to store it in our S3 cloud storage.

Then we will use get_files_list() function to retrieve all uploaded files from S3. This will be displayed in a radio list format, so we can choose one of the files for further processing.

Last function we will use is get_file(filename) to extract the data from the file.

We will also need a helper function that will create an S3 instance that will be used within other functions.

def get_s3():
    s3 = boto3.resource('s3', aws_access_key_id=access_key_id, aws_secret_access_key=secret_access_key)  
    return s3

This will create an S3 resource instance that connects with our secret credentials.

Now let's take a look at the functions that make things happen.

def save_file(fl):
    file_name = fl.name
    s3 = get_s3()
    s3.Bucket(bucket_name).put_object(Key=file_name, Body=fl)
    st.write('Success! File Saved!')

This function receives fl as an argument, which is a file object what we uploaded. As soon as we select the file, save_file() function is called. It then connects to S3 and saves the files in the our project_bucket.

Next, we want to read all the files in our S3 bucket and list them as a radio option for us to select from.

def get_files_list():
    s3= get_s3()
    file_names = []
    bucket = s3.Bucket(bucket_name)
    for s3_obj in bucket.objects.all():
        file_names.append(s3_obj.key)
    s3_filename = st.sidebar.radio('Select a file', file_names, index=0)
    return s3_filename

We are connecting to our S3 again, then creating a bucket instance to access the bucket we need in our S3. Once we have our bucket instance, then we can iterate through all the objects stored in our bucket. s3_obj.key represents the file_name of the object and will be used to either read the object/file or download it, depending what are planning to do with the object/file.

In our case, we are just getting the filenames, storing a simple list, then using this list to create a radio option list object in Streamlit. This radio list then displayed in our app with default choice of first filename. The reason we are using radio options list is because we only want to choose one file for further processing. Streamlit provides other lists options like checkbox as well, if we need to work with multiple options or files.

With these two simple functions already, we can see how easy it is to upload and save files in our S3 cloud storage and then read the files and filenames. If you are interested in knowing how to download files, please read my previous post on getting started with Boto3.

However, for this App, we are not interested in downloading the files from S3. What we really want to do is, to read the files so we can apply further processing. That takes us to our last function - get_file(filename).

This part I actually had some difficulty getting to work in first try. I had to try multiple different ways, and see if anybody else had similar issues and how they were able to solve them.

In this function we are going to use pdfplumber module to read the contents of a pdf file to make sure everything is working as intended. Normally, when we upload a file to our Streamlit app, pdfplumber has no issues reading and extracting the data. But reading pdf files from S3, and perhaps other formats as well, we will need to use ByteIO() from io module.

def get_file(s3_filename):
    s3 = get_s3()
    obj = s3.Object(bucket_name, s3_filename)
    fl = obj.get()['Body'].read()
    with pdfplumber.open(BytesIO(fl)) as pdf:
        pages = pdf.pages
        page = pages[0]
        text = page.extract_text().split('\n')
        for txt in text:
            st.write(txt)

We create an S3 instance with our secret keys and connect to our cloud storage. Then we access our object using our bucket name and the filename. Using Object's .get() method we can get the object/file. But this returns a dictionary. To access the actual contents of the file, we need to access 'Body' key in the dictionary. Just using .read() method is not enough, but still needed.

To be able to read the contents fo the pdf file we will use pdfplumber. However we need to use BytesIO(fl) to properly retrieve the data. Then everything is very simple to extract the text from the pdf file and display all the text contents.

Now that we have everything working with uploading and reading S3 files from our account without Streamlit, all we will need to do is pass the files/objects to other functions of our Streamlit app to process the data we have. This is how we can integrate Boto3 to our Streamplit apps.

Posted Using LeoFinance ^Beta

python coding dev boto3 vyb proofofbrain neoxian ctp leofinance

0.000

14 comments

@themarkymark 81

over 3 years ago

These posts are great for STEMGeeks, if you use the tag #technology it will fall under STEMGeeks as well and you will earn STEM tokens in addition to your other tokens.

0.000

@geekgirl 80

over 3 years ago

Thank you for reminding. I keep forgetting.

0.000

@steemit-fairy 51

over 3 years ago

Good to see another great blog from you, Need to know more about python because its getting popular as programming language. Thanks for sharing mam @geekgirl

0.000

@poshtoken 86

over 3 years ago

https://twitter.com/geekjen/status/1546731320885125120
_{The rewards earned on this comment will go directly to the people( @geekgirl ) sharing the post on Twitter as long as they are registered with @poshtoken. Sign up at https://hiveposh.com.}

0.000

@jfang003 77

over 3 years ago

This is nice to know but I wonder if you are just trying to collect text data. Would you really need to dump contents into a pdf file rather than just making a text file?

Posted Using LeoFinance ^Beta

0.000

@geekgirl 80

over 3 years ago

In this example original files are in pdf which contain a lot of data. This text is extracted to do more work with the data to create more useful reports.

0.000

@india-leo 59

over 3 years ago

This post has been manually curated by @bhattg from Indiaunited community. Join us on our Discord Server.

Do you know that you can earn a passive income by delegating your Leo power to @india-leo account? We share 100 % of the curation rewards with the delegators.

_{Please contribute to the community by upvoting this comment and posts made by @indiaunited.}