Introduction to Amazon Textract

Amazon Textract is a powerful machine learning service provided by AWS that automatically extracts text, handwriting, and data from scanned documents. Unlike traditional OCR (Optical Character Recognition) solutions, Textract goes beyond simple text extraction to identify the contents of fields in forms and information stored in tables. This makes it an invaluable tool for businesses looking to digitize and automate their document processing workflows.

Practical Use Case

Imagine a company receiving thousands of invoices and purchase orders in PDF format every month. Manually processing these documents to extract relevant data, such as dates, amounts, and line items, can be time-consuming and error-prone. By using Amazon Textract, the company can automate this process, significantly reducing manual effort and improving accuracy. This guide will walk you through the steps to implement table data extraction using Amazon Textract in a Node.js application.

Detailed Implementation Steps

Setting Up Your AWS Account

Before using Amazon Textract, you need to set up an AWS account if you don’t already have one. Here’s a quick guide to get you started:

  1. Go to the AWS Management Console.
  2. Click on “Create a new AWS account.”
  3. Follow the on-screen instructions to complete the registration process.
  4. Once your account is set up, sign in to the AWS Management Console.

 

AWS Configuration

After setting up your AWS account, you need to configure your AWS credentials. You can do this by installing the AWS CLI and running the aws configure command:

$ aws configure

You’ll be prompted to enter your AWS Access Key ID, Secret Access Key, region, and output format. Your Node.js application will use these credentials to interact with AWS services.

Command Line Application Development

We’ll build a command line application in Node.js that uses Amazon Textract to extract table data from documents. Let’s start by creating a new Node.js project:

$ mkdir textract-app

$ cd textract-app

$ npm init -y

Installing Required Packages

Next, install the required packages:

$ npm install aws-sdk fs

The aws-sdk package allows your Node.js application to interact with AWS services, and fs is a core Node.js module for working with the file system.

Document Scanning Process

Now, create a script to scan a document and extract table data using Amazon Textract. Create a new file called index.js:

const AWS = require(‘aws-sdk’);

const fs = require(‘fs’);

// Configure AWS Textract

const textract = new AWS.Textract({ region: ‘us-west-2’ });

// Function to extract table data from document

const extractTableData = async (filePath) => {

    // Read the document from the file system

    const document = fs.readFileSync(filePath);

    // Convert document to base64

    const documentBase64 = Buffer.from(document).toString(‘base64’);

    // Create parameters for Textract

    const params = {

        Document: {

            Bytes: document

        },

        FeatureTypes: [‘TABLES’]

    };

    // Call Textract

    try {

        const response = await textract.analyzeDocument(params).promise();

        console.log(‘Textract Response:’, JSON.stringify(response, null, 2));

    } catch (error) {

        console.error(‘Error:’, error);

    }

};

// Path to your document

const documentPath = ‘./path/to/your/document.pdf’;

// Extract table data from the document

extractTableData(documentPath);

 

Modifying Your index.js File

To make our application more flexible, we can modify the index.js file to accept the document path as a command-line argument:

const AWS = require(‘aws-sdk’);

const fs = require(‘fs’);

// Configure AWS Textract

const textract = new AWS.Textract({ region: ‘us-west-2’ });

// Function to extract table data from document

const extractTableData = async (filePath) => {

    // Read the document from the file system

    const document = fs.readFileSync(filePath);

    // Create parameters for Textract

    const params = {

        Document: {

            Bytes: document

        },

        FeatureTypes: [‘TABLES’]

    };

    // Call Textract

    try {

        const response = await textract.analyzeDocument(params).promise();

        console.log(‘Textract Response:’, JSON.stringify(response, null, 2));

    } catch (error) {

        console.error(‘Error:’, error);

    }

};

// Get the document path from command line arguments

const documentPath = process.argv[2];

// Extract table data from the document

extractTableData(documentPath);

Now, you can run your application from the command line and pass the document path as an argument:

$ node index.js ./path/to/your/document.pdf

Testing and Validation

To test your implementation, use a sample document containing tables. Run the application and verify that the extracted table data is correct. Then, process the JSON response to extract specific data fields and integrate them into your workflow.

Conclusion

Following this guide, you’ve learned how to set up AWS Textract, configure your Node.js application, and extract table data from documents. This implementation can be expanded to handle different document formats, integrate with databases, and automate various aspects of document processing.

References

Automatically extract text and structured data from documents with Amazon Textract

Automatically extract content from PDF files using Amazon Textract