Building an Automated Data Pipeline to Ingest Multi-Page PDF Documents from S3 and Process Them Using Textract, Lambda, and Step Functions

Set up FREE Consultation

In today’s data-driven world, leveraging an automated data pipeline without human intervention is crucial to process and extract valuable information from documents efficiently. AWS offers a powerful combination of services to create an automated data pipeline for ingesting multi-page PDF documents from an S3 bucket and processing them using Amazon Textract, AWS Lambda, and AWS Step Functions. This blog post will guide you through setting up this automated data pipeline step-by-step.

Prerequisites

Before we begin, ensure you have the following:

An AWS account
Basic knowledge of AWS services (S3, Lambda, Step Functions, Textract)
AWS CLI installed and configured

Step 1: Set Up the S3 Bucket

First, create an S3 bucket to store your PDF documents. This bucket will act as the source from which our pipeline will ingest documents.

Go to the S3 console.
Click on “Create bucket.”
Enter a unique bucket name and choose a region.
Configure the bucket settings as per your requirements.
Click “Create bucket.”

Upload your multi-page PDF documents to this S3 bucket.

Step 2: Create an IAM Role for Lambda

Create an IAM role that grants necessary permissions to the Lambda function to access S3 and Textract.

Go to the IAM console.
Click on “Roles” and then “Create role.”
Select “Lambda” as the trusted entity and click “Next.”
Attach the following policies: AmazonS3ReadOnlyAccess, AmazonTextractFullAccess, and AWSLambdaBasicExecutionRole.
Review and create the role. Note the role ARN for later use.

Step 3: Create a Lambda Function

Create a Lambda function that will be triggered when a PDF is uploaded to the S3 bucket. This function will start the Textract job to extract data from the PDF.

Go to the Lambda console.
Click on “Create function.”
Choose “Author from scratch.”
Enter a function name and select Python as the runtime.
Choose the IAM role created in the previous step.
Click “Create function.”

In the function code, add the following to trigger Textract:

import boto3

import json

import os

textract = boto3.client(‘textract’)

s3 = boto3.client(‘s3’)

def lambda_handler(event, context):

bucket = event[‘Records’][0][‘s3’][‘bucket’][‘name’]

document = event[‘Records’][0][‘s3’][‘object’][‘key’]

response = textract.start_document_text_detection(

DocumentLocation={

‘S3Object’: {

‘Bucket’: bucket,

‘Name’: document

}

)

return {

‘statusCode’: 200,

‘body’: json.dumps(‘Textract job started successfully!’)

}

Step 4: Set Up S3 Event Notifications

Configure your S3 bucket to trigger the Lambda function whenever a new PDF document is uploaded.

Go to the S3 console and select your bucket.
Navigate to the “Properties” tab.
Scroll to “Event notifications” and click “Create event notification.”
Enter a name for the notification, select “All object create events,” and specify the Lambda function created earlier.

Step 5: Create a Step Function

Create a Step Function to manage the workflow of processing the Textract results and performing further actions.

Go to the Step Functions console.
Click on “Create state machine.”
Choose “Author with code snippets” and select “Next.”
Define your state machine JSON definition. Here is a basic example:

{

“Comment”: “A simple AWS Step Functions state machine that processes Textract results”,

“StartAt”: “CheckTextractJob”,

“States”: {

“CheckTextractJob”: {

“Type”: “Task”,

“Resource”: “arn:aws:states:::lambda:invoke”,

“Parameters”: {

“FunctionName”: “arn:aws:lambda:region:account-id:function:function-name”,

“Payload.$”: “$”

},

“End”: true

}

Replace arn:aws:lambda:region:account-id:function:function-name with your Lambda function ARN that checks Textract job status and processes results.

Step 6: Integrate Step Function with Lambda

Modify your Lambda function to include the logic for checking the Textract job status and processing the extracted text.

import boto3

import json

textract = boto3.client(‘textract’)

def lambda_handler(event, context):

job_id = event[‘job_id’]

response = textract.get_document_text_detection(JobId=job_id)

if response[‘JobStatus’] == ‘SUCCEEDED’:

# Process the extracted text

for block in response[‘Blocks’]:

if block[‘BlockType’] == ‘LINE’:

print(block[‘Text’])

return {

‘statusCode’: 200,

‘body’: json.dumps(‘Textract job processed successfully!’)

}

else:

return {

‘statusCode’: 200,

‘body’: json.dumps(‘Textract job is still in progress…’)

}

Step 7: Test the Data Pipeline

Upload a multi-page PDF document to your S3 bucket and monitor the execution of the Lambda function and Step Function. Check the Textract results and ensure the extracted data is processed as expected.

Conclusion

Following these steps, you have successfully set up an automated data pipeline to ingest multi-page PDF documents from an S3 bucket and process them using Textract, Lambda, and Step Functions. Based on your specific use case, this solution can be further enhanced to include more complex processing, data storage, and analysis.