Project report

The customer

The customer is a financial sector company with approximately 5,000 employees that processes and analyzes large amounts of data from various sources to support business decisions.

The challenge

The customer was faced with the challenge of implementing effective ETL (Extract, Transform, Load) processes to efficiently process data and prepare it for analysis. The specific challenges in this context were:

Data extraction: Regularly retrieving data from different sources such as databases, APIs or files posed a significant challenge. It had to be ensured that the data was retrieved completely, up-to-date and consistently in order to create a reliable basis for further processing steps.

Data cleansing and transformation: After extraction, the raw data needed to be converted into a format suitable for analysis. This included cleaning up incomplete or incorrect data and standardizing different data formats. This step was often complex and required special rules and methods to transform the data correctly.

Data loading: Finally, another challenge was to load the transformed data into an RDS (Relational Database Service) that acted as a data warehouse or data mart. This required careful planning to efficiently store the data and optimize access for analysis purposes.

The solution

Data stored in S3 buckets, Aurora databases and FTPS shares should be processed at regular intervals. The management and orchestration of the data processing processes takes place in MWAA Environment (managed Airflow environment). The data is finally in one Data Warehousewhich is the destination of the data transport, is stored. Within the data warehouse lies the PostgreSQL databasein which the transformed data is stored.

The DAGs (Directed Acyclic Graphs), which are responsible for controlling data processing, are used via GitLab in the S3 Bucket loaded. CloudWatch monitors the entire process and provides log data to track status and possible errors. The access data and passwords required to access the various data sources and destinations are provided in the AWS Secret Manager stored encrypted. About one Application Load Balancer (ALB) it is possible to access the Airflow UI and manage the environment.

How it works

The data processing process is triggered daily by the Airflow DAG. First will be the extract_data function which loads the data from the various sources (S3 Bucket, Aurora, FTPS). Then prepares the transform_data Function the raw data, for example by removing headings or converting data types. Finally, it loads load_data function the transformed data into the data warehouse, where it is stored in the PostgreSQL database.

The benefits

The solution offers the customer several advantages by using various AWS services:

  • Automation: The MWAA Environment automates recurring ETL (extraction, transformation, loading) processes.
  • Scalability: Airflow is easy to scale, eliminating the need for manual adjustments as data volumes increase.
  • Low maintenance: Since MWAA, CloudWatch, S3 and Secret Manager are managed, maintenance and administration efforts are eliminated, allowing the focus to be placed on core processes.
  • Surveillance: CloudWatch enables detailed monitoring to detect anomalies early and improve system reliability.
  • Integration: The architecture supports various data sources (S3 Bucket, Aurora, FTPS), which enables flexible and efficient data processing.
  • Modularity: The architecture is modular so that it can be easily expanded to include new functions or data sources.

Overall, the architecture provides a cost-effective, secure and flexible solution that simplifies and automates the data processing process while ensuring reliable monitoring and troubleshooting.

Source: https://www.protos-technologie.de/2024/11/05/apache-airflow-und-aws-eine-leistungsstarke-kombination-fuer-datenpipelines/

Leave a Reply