Introduction: What is Write-Audit-Publish (WAP)?
The Write-Audit-Publish (WAP) pattern is a data management approach that ensures data quality and consistency in production environments. Using WAP, you can write data into an isolated branch, perform thorough audits and quality checks, and then publish the data to the production table only if it passes all validations. Apache Iceberg, a high-performance table format for massive analytic datasets, supports WAP natively, making it an excellent choice for managing data lakes on AWS.
Prerequisites for Using WAP
Before implementing WAP with Apache Iceberg on AWS, ensure you have the following:
- An AWS account with appropriate permissions.
- Apache Iceberg is set up in your AWS environment.
- A working knowledge of Spark and its configuration.
- SparkSession is configured to work with Iceberg.
- Data Quality (DQ) tools for auditing data.
SparkSession Configuration for Iceberg
To get started with Iceberg, configure your SparkSession. This configuration enables Spark to use Iceberg as the table format.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName(“IcebergWAP”) \
.config(“spark.sql.extensions”, “org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions”) \
.config(“spark.sql.catalog.spark_catalog”, “org.apache.iceberg.spark.SparkCatalog”) \
.config(“spark.sql.catalog.spark_catalog.type”, “hadoop”) \
.config(“spark.sql.catalog.spark_catalog.warehouse”, “s3://your-warehouse-path/”) \
.getOrCreate()
Understanding Spark Configurations for Iceberg
It’s essential to understand the Spark configurations specific to Iceberg. These configurations define how Spark interacts with Iceberg tables and handles various aspects of data management.
Key configurations include:
- spark.sql.extensions: Enables Iceberg extensions in Spark.
- spark.sql.catalog.spark_catalog: Specifies the catalog implementation.
- spark.sql.catalog.spark_catalog.type: Defines the catalog type (e.g., hadoop, hive).
- spark.sql.catalog.spark_catalog.warehouse: Sets the warehouse path for table storage.
Creating a Production Iceberg Table
Create a production table in Iceberg to store your data.
CREATE TABLE spark_catalog.db.production_table (
id bigint,
data string,
ts timestamp
) USING iceberg;
Reading Data from Source Table
Read data from the source table you want to write to the Iceberg table.
source_df = spark.read.format(“parquet”).load(“s3://your-source-path/”)
WAP Implementation
Writing Data in a Branch
Start by writing the data to an isolated branch to prevent affecting the production data.
source_df.writeTo(“spark_catalog.db.production_table”)
.option(“snapshot-id”, “branch_name”)
.append()
Verifying Data in Production Table
Verify that the data has been written to the branch without affecting the main production table.
branch_df = spark.read.format(“iceberg”).load(“spark_catalog.db.production_table”).option(“snapshot-id”, “branch_name”)
branch_df.show()
Setting spark.wap.branch in SparkSession
Set the branch in SparkSession to ensure subsequent writes use the WAP branch.
spark.conf.set(“spark.wap.branch”, “branch_name”)
Writing Source Data into Production Table
Write the source data into the production table under the specified branch.
source_df.writeTo(“spark_catalog.db.production_table”).append()
Auditing Data
Perform data quality checks on the data written to the audit branch.
Running DQ Checks on Audit Branch Data
Use your preferred DQ tools to run checks on the audit branch data. Ensure the data meets all quality standards before publishing.
Publishing Data
If the data passes all audits, publish the data to the production table.
CALL spark_catalog.system.commit(spark_catalog.db.production_table, “branch_name”)
Fast Forwarding the Audit Branch to the Main Branch
Fast forward the audit branch to the main branch to keep branches in sync.
CALL spark_catalog.system.fast_forward(spark_catalog.db.production_table, “branch_name”)
Unsetting Configuration/Property for WAP
Unset the WAP branch configuration to revert to the default behavior.
spark.conf.unset(“spark.wap.branch”)
Dropping Audit Branch
Drop the audit branch after publishing to clean up.
CALL spark_catalog.system.drop_branch(spark_catalog.db.production_table, “branch_name”)
Conclusion
Implementing the Write-Audit-Publish (WAP) pattern with Apache Iceberg on AWS provides a robust mechanism for ensuring data quality and consistency in your production environments. Following the steps outlined in this guide, you can effectively manage and audit your data before it reaches production, minimizing the risk of data quality issues.
References
Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and AWS Glue