Introduction

In today’s digital age, protecting Personal Health Information (PHI) is paramount. With regulations like HIPAA, organizations must ensure that PHI is secure and de-identified when shared. AWS Comprehend provides a powerful solution for detecting and de-identifying PHI from text, enabling organizations to maintain compliance while leveraging valuable data. This blog post will guide you through using AWS Comprehend to detect and de-identify PHI, ensuring your data remains secure and compliant.

What is AWS Comprehend?

AWS Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. It can identify entities, key phrases, language, and sentiments. For healthcare organizations, AWS Comprehend offers a specialized feature, Comprehend Medical, designed to detect and extract health data from text.

Understanding PHI

Personal Health Information (PHI) refers to any information in a medical record that can be used to identify an individual and that was created, used, or disclosed during the provision of a health care service. This includes a wide range of information, such as:

  • Names: Full names, initials, or any part of the name.
  • Geographic data: Addresses smaller than a state, including street address, city, county, precinct, ZIP code, and equivalent geocodes.
  • Dates: Birthdates, admission and discharge dates, date of death, and all ages over 89.
  • Phone numbers: Contact numbers, including mobile, home, and work numbers.
  • Fax numbers
  • Email addresses
  • Social Security numbers
  • Medical record numbers
  • Health plan beneficiary numbers
  • Account numbers
  • Certificate/license numbers
  • Vehicle identifiers and serial numbers, including license plate numbers.
  • Device identifiers and serial numbers
  • Web URLs
  • IP addresses
  • Biometric identifiers: Including finger and voice prints.
  • Full-face photos and comparable images
  • Any other unique identifying number, characteristic, or code

Why De-identify PHI?

De-identification of PHI is crucial for several reasons:

  1. Compliance: Adhering to regulations like HIPAA to avoid legal repercussions.
  2. Privacy: Protecting patient information from unauthorized access.
  3. Data Utilization: Safely sharing data for research and analytics without compromising privacy.

Steps to Detect and De-identify PHI Using AWS Comprehend

Step 1: Setting Up AWS Comprehend

  1. Create an AWS Account: If you don’t already have one, sign up at AWS.
  2. Access AWS Comprehend: Navigate to the AWS Management Console and access AWS Comprehend.

Step 2: Detecting PHI

  1. Upload Your Text Data: You can upload text data to AWS S3 for processing.
  2. Use Comprehend Medical: Use the DetectEntitiesV2 API to identify PHI entities in your text. This API will return entities like names, dates, medical conditions, medications, and more.

import boto3

client = boto3.client(‘comprehendmedical’)

response = client.detect_entities_v2(Text=”Patient John Doe was diagnosed with diabetes on 2020-01-01 and prescribed Metformin.”)

entities = response[‘Entities’]

for entity in entities:

    print(f”Entity: {entity[‘Text’]}, Category: {entity[‘Category’]}, Type: {entity[‘Type’]}”)

Step 3: De-identifying PHI

  1. Redact Identified PHI: Replace the detected PHI with non-identifiable placeholders or remove it altogether.
  2. Example Code:

def deidentify_text(text, entities):

    deidentified_text = text

    for entity in entities:

        deidentified_text = deidentified_text.replace(entity[‘Text’], ‘[REDACTED]’)

    return deidentified_text

original_text = “Patient John Doe was diagnosed with diabetes on 2020-01-01 and prescribed Metformin.”

deidentified_text = deidentify_text(original_text, entities)

print(deidentified_text)

Benefits of Using AWS Comprehend for PHI De-identification

  1. Accuracy: High accuracy in detecting various types of PHI.
  2. Scalability: Process large volumes of text data efficiently.
  3. Compliance: Helps maintain compliance with regulations like HIPAA.
  4. Security: AWS provides robust security features to protect your data.

Conclusion

Using AWS Comprehend to detect and de-identify PHI ensures that healthcare organizations can leverage valuable data while maintaining compliance and protecting patient privacy. Following the steps outlined in this blog post, you can implement an effective solution for managing PHI in your text data.