Introduction: What is Write-Audit-Publish (WAP)?

The Write-Audit-Publish (WAP) pattern is a data management approach that ensures data quality and consistency in production environments. Using WAP, you can write data into an isolated branch, perform thorough audits and quality checks, and then publish the data to the production table only if it passes all validations. Apache Iceberg, a high-performance table format for massive analytic datasets, supports WAP natively, making it an excellent choice for managing data lakes on AWS.

Prerequisites for Using WAP

Before implementing WAP with Apache Iceberg on AWS, ensure you have the following:

  1. An AWS account with appropriate permissions.
  2. Apache Iceberg is set up in your AWS environment.
  3. A working knowledge of Spark and its configuration.
  4. SparkSession is configured to work with Iceberg.
  5. Data Quality (DQ) tools for auditing data.

SparkSession Configuration for Iceberg

To get started with Iceberg, configure your SparkSession. This configuration enables Spark to use Iceberg as the table format.

from pyspark.sql import SparkSession

spark = SparkSession.builder \

    .appName(“IcebergWAP”) \

    .config(“spark.sql.extensions”, “org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions”) \

    .config(“spark.sql.catalog.spark_catalog”, “org.apache.iceberg.spark.SparkCatalog”) \

    .config(“spark.sql.catalog.spark_catalog.type”, “hadoop”) \

    .config(“spark.sql.catalog.spark_catalog.warehouse”, “s3://your-warehouse-path/”) \

    .getOrCreate()

Understanding Spark Configurations for Iceberg

It’s essential to understand the Spark configurations specific to Iceberg. These configurations define how Spark interacts with Iceberg tables and handles various aspects of data management.

Key configurations include:

  • spark.sql.extensions: Enables Iceberg extensions in Spark.
  • spark.sql.catalog.spark_catalog: Specifies the catalog implementation.
  • spark.sql.catalog.spark_catalog.type: Defines the catalog type (e.g., hadoop, hive).
  • spark.sql.catalog.spark_catalog.warehouse: Sets the warehouse path for table storage.

Creating a Production Iceberg Table

Create a production table in Iceberg to store your data.

CREATE TABLE spark_catalog.db.production_table (

  id bigint,

  data string,

  ts timestamp

) USING iceberg;

Reading Data from Source Table

Read data from the source table you want to write to the Iceberg table.

source_df = spark.read.format(“parquet”).load(“s3://your-source-path/”)

WAP Implementation

Writing Data in a Branch

Start by writing the data to an isolated branch to prevent affecting the production data.

source_df.writeTo(“spark_catalog.db.production_table”)

    .option(“snapshot-id”, “branch_name”)

    .append()

Verifying Data in Production Table

Verify that the data has been written to the branch without affecting the main production table.

branch_df = spark.read.format(“iceberg”).load(“spark_catalog.db.production_table”).option(“snapshot-id”, “branch_name”)

branch_df.show()

Setting spark.wap.branch in SparkSession

Set the branch in SparkSession to ensure subsequent writes use the WAP branch.

spark.conf.set(“spark.wap.branch”, “branch_name”)

Writing Source Data into Production Table

Write the source data into the production table under the specified branch.

source_df.writeTo(“spark_catalog.db.production_table”).append()

Auditing Data

Perform data quality checks on the data written to the audit branch.

Running DQ Checks on Audit Branch Data

Use your preferred DQ tools to run checks on the audit branch data. Ensure the data meets all quality standards before publishing.

Publishing Data

If the data passes all audits, publish the data to the production table.

CALL spark_catalog.system.commit(spark_catalog.db.production_table, “branch_name”)

Fast Forwarding the Audit Branch to the Main Branch

Fast forward the audit branch to the main branch to keep branches in sync.

CALL spark_catalog.system.fast_forward(spark_catalog.db.production_table, “branch_name”)

Unsetting Configuration/Property for WAP

Unset the WAP branch configuration to revert to the default behavior.

spark.conf.unset(“spark.wap.branch”)

Dropping Audit Branch

Drop the audit branch after publishing to clean up.

CALL spark_catalog.system.drop_branch(spark_catalog.db.production_table, “branch_name”)

Conclusion

Implementing the Write-Audit-Publish (WAP) pattern with Apache Iceberg on AWS provides a robust mechanism for ensuring data quality and consistency in your production environments. Following the steps outlined in this guide, you can effectively manage and audit your data before it reaches production, minimizing the risk of data quality issues.

References

Using Apache Iceberg on AWS

Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and AWS Glue