Automating Text Extraction from PDFs, TIFFs, and Images Using AWS Textract and Saving Output to S3 and Metadata to DynamoDB

Introduction

In today’s data-driven world, extracting text from various document formats is essential for businesses to analyze and utilize information effectively. AWS Textract is a powerful service that automatically extracts text and data from scanned documents, such as PDFs, TIFF files, and images. By setting up an automated pipeline, we can streamline the text extraction process and save the output to Amazon S3 while storing metadata in Amazon DynamoDB for easy retrieval and management.

In this blog post, we’ll walk through the steps to set up this automated pipeline using AWS Textract, S3, and DynamoDB.

Prerequisites

Before we dive into the setup, ensure you have the following:

AWS Account: An active AWS account with appropriate permissions.
AWS CLI: Installed and configured on your local machine.
IAM Role: With permissions for Textract, S3, and DynamoDB.

Step 1: Create an S3 Bucket

First, we need an S3 bucket to store the input files and the extracted text. Follow these steps to create an S3 bucket:

Open the S3 console in the AWS Management Console.
Click on “Create bucket.”
Enter a unique bucket name and choose the appropriate region.
Configure options as per your requirements and click “Create bucket.”

Step 2: Set Up DynamoDB Table

Next, we’ll create a DynamoDB table to store the metadata of the processed documents.

Open the DynamoDB console.
Click on “Create table.”
Enter a table name (e.g., DocumentMetadata) and set the primary key (e.g., DocumentID).
Configure the table settings and click “Create.”

Step 3: Create an IAM Role

Create an IAM role that allows Textract to access the S3 bucket and DynamoDB table.

Open the IAM console.
Click on “Roles” and then “Create role.”
Select “Textract” as the service that will use this role.
Attach the policies for S3 and DynamoDB access.
Complete the role creation process.

Step 4: Upload Files to S3

Upload the documents (PDFs, TIFFs, and images) you want to process to the S3 bucket created in Step 1.

Step 5: Set Up Lambda Function for Automation

We will use an AWS Lambda function to automate the text extraction when a new file is uploaded to the S3 bucket.

Open the Lambda console.
Click on “Create function.”
Choose “Author from scratch,” give your function a name, and select the runtime (e.g., Python 3.8).
Under “Permissions,” choose the IAM role created in Step 3.
Click “Create function.”

Step 6: Configure S3 Event Trigger

Configure the S3 bucket to trigger the Lambda function whenever a new object is created.

In the S3 console, navigate to the bucket.
Go to the “Properties” tab and find the “Event notifications” section.
Click “Create event notification” and configure it to trigger on “All object create events.”
Select the Lambda function created in Step 5.

Step 7: Write Lambda Function Code

Add the following code to your Lambda function to handle the text extraction and save the output and metadata.

import json

import boto3

import os

s3 = boto3.client(‘s3’)

textract = boto3.client(‘textract’)

dynamodb = boto3.client(‘dynamodb’)

def lambda_handler(event, context):

bucket = event[‘Records’][0][‘s3’][‘bucket’][‘name’]

document = event[‘Records’][0][‘s3’][‘object’][‘key’]

# Call Textract to extract text

response = textract.analyze_document(

Document={‘S3Object’: {‘Bucket’: bucket, ‘Name’: document}},

FeatureTypes=[“TABLES”, “FORMS”]

)

# Process Textract response and extract text

text = ”

for item in response[“Blocks”]:

if item[“BlockType”] == “LINE”:

text += item[“Text”] + “\n”

# Save extracted text to S3

output_bucket = ‘your-output-bucket’

output_key = f’extracted/{os.path.splitext(document)[0]}.txt’

s3.put_object(Bucket=output_bucket, Key=output_key, Body=text)

# Save metadata to DynamoDB

dynamodb.put_item(

TableName=’DocumentMetadata’,

Item={

‘DocumentID’: {‘S’: document},

‘S3Bucket’: {‘S’: bucket},

‘S3Key’: {‘S’: document},

‘OutputS3Bucket’: {‘S’: output_bucket},

‘OutputS3Key’: {‘S’: output_key}

}

)

return {

‘statusCode’: 200,

‘body’: json.dumps(‘Text extraction completed successfully.’)

}

Step 8: Test the Setup

Upload a new document to the S3 bucket and verify that the Lambda function processes the file. Check the output S3 bucket for the extracted text file and the DynamoDB table for the metadata.

Conclusion

By setting up this automated pipeline, you can efficiently extract text from PDFs, TIFFs, and images using AWS Textract, store the extracted text in S3, and manage metadata in DynamoDB. This solution enhances document processing workflows, saving time and improving data accessibility.