The purpose of this cpastone project is to demonstrate various data engineering skills acquired with the nanodegree. This capstone project mainly focuses on the following key areas:
- Developing ETL/ELT pipelines using Apache Airflow
- Constructing Data Lakes in S3 for raw data or for staging purposes
- Using AWS EMR and pyspark to transform raw data
- Constructing Redshift DWH from the staging/schematized data in AWS S3
- Analyzing the Data using pyspark in a Jupyter Notebook
This project is for demonstration purposes and will make use of the following data sources
- I94 Immigration Data: This data comes from the US National Tourism and Trade Office found here. Each report contains international visitor arrival statistics by world regions and select countries (including top 20), type of visa, mode of transportation, age groups, states visited (first intended address only), and the top ports of entry (for select countries)
- U.S. City Demographics Data: This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. Dataset comes from OpenSoft found here.
- Airport Codes: This is a simple table of airport codes and corresponding cities. The airport codes may refer to either IATA airport code, a three-letter code which is used in passenger reservation, ticketing and baggage-handling systems, or the ICAO airport code which is a four letter code used by ATC systems and for airports that do not have an IATA airport code (from wikipedia). It comes from here
I have enriched the dataset using the following data sources to make the analysis more useful and more informative
- Port of Entry/Nationality Codes: This dataset contains information about various of port of entry codes which would be used to join the data in the I94 immigration data. The nationality codes contain the country abbreviations Port Codes; Nationality Codes And Port of Entry codes
- US Visa Types: This data is extracted from the US DHS and Wikipedia which would give information on various visa types offered by US US Non-immigrant Visa Types and US Immigrant Visa Types
- Airline Codes: This data source consists of airlines IATA abbreviations and the country of origin Airline Codes
The following project architecture is used for developing the Data Lake in AWS
Using the data from the Data Lake in S3 which is considered as Staging Bucket here the following DWH Star schema is finalized:
In this project I have decided to use AWS for building up the Data Lake and DWH. I have used AWS because of the availabiliy of large number of options
and also having elastic resources at scale for appropriate pricing, this made me to choose AWS.
I have used AWS EMR
for processing massive data and AWS S3
for storing large amounts of data
In order to run this project the following resources are needed:
AWS EC2
- am4.xlarge
instance is needed for executingApache Airflow
Apache Airflow
- for orchestrating the ETL pipelineAWS RDS
- for Airflow to store its metadataAWS S3
- for creating the Data LakeAWS EMR
- cluster withApache Spark
andApache Livy
are needed to performTransformation
tasks on the raw data
To make the provisioning of the above resources easier, I have used the AWS Cloudformation
to create
the stack needed for running this project.
The resources needed are already configured in the airflow_server.yaml yaml file
To start creating the necessary resources the following steps are to be performed:
- Create an
AWS
account and provide necessary billing information (if needed) - Create a
user
fromAWS IAM
- For the
user
make sure the following permissions are provided in IAM - Finish the
user
creation step and download theAWS KEY ID
andAWS SECRET
into acsv
file - Create an
EC2 Key Pair
with nameairflow_key_pair
for accessing the EC2 instance for usingAirflow
-
Create an S3 bucket for raw data and upload airflow_server.yaml file into the bucket
-
Fill the dwh.cfg file with your AWS details and also fill the dwh_airflow.cfg
-
Run the
python create_resources.py
to create the Airflow EC2 instance- This script uploads raw data bucket into the bucket with name for eg.
dend-capstone-data
(created in step 1.) - Python scripts are also uploaded into the raw data bucket. (Look at the S3 folder structure below)
- Creates a Staging Bucket with name for eg.
dend-capstone-staging
in S3 - Creates a AWS EC2 instance with
Airflow
installed in it
- This script uploads raw data bucket into the bucket with name for eg.
-
Once the EC2 instance is created, log in to the instance using PuTTY and then run
bash ~/start_airflow.sh
to start airflow at<ec2_instance_ip>:8080
-
Use your browser to open airflow with address
<ec2_instance_ip>:8080
and Airflow Home would list the following dagsmetadata_transform_dag
- This DAG will extract all the information for creating the dimension tables in the DWHimmigration_transform_dag
- This DAG will extract theI94 immigration SAS data
and stores the processed files in parquet format in S3create_DWH_dag
- This DAG will use the staging bucket in S3 to create a DWH in AWS Redshift
-
Fill in the new
aws_credentials
inConnections
page like the picture below:
The following files are used in the data pipeline:
Filename | Location | Purpose | Execution Platform |
---|---|---|---|
create_resources.py | Udacity | uploads data to S3 | Udacity |
extras/data_scraper.py | Udacity | scarpes data from internet to add additional data | Udacity |
extras/nearest_city.py | Udacity | adds nearest city to the airports csv file | Udacity |
dags/lib/emr_cluster_provider.py | AWS EC2 | Provisions EMR cluster | AWS EC2 |
dags/lib/emr_session_provider.py | AWS EC2 | Provides Livy REST Session | AWS EMR |
dags/transforms/transforms_codes.py | AWS EC2 | Transforms all the csv files with -codes suffix |
AWS EMR + Livy Session |
dags/transforms/tranforms_demographics.py | AWS EC2 | Transforms demographics file | AWS EMR + Livy Session |
dags/transforms/transforms_i94_meta_data.py | AWS EC2 | Transforms all the meta data extracted from SAS label file | AWS EMR + Livy Session |
dags/transforms/tranforms_immigration.py | AWS EC2 | Transforms immigration SAS data | AWS EMR |
dags/transforms/tranforms_visa.py | AWS EC2 | Transforms csv files related to visa | AWS EMR + Livy Session |
dags/transforms/data_lake_quality_check.py | AWS EC2 | Performs Data Lake Quality check | AWS EMR + Livy Session |
dags/capstone_meta_data_dag.py | AWS EC2 | ELT pipeline for transforming CSV meta data to S3 parquet format | Airflow |
dags/capstone_immigration_dag.py | AWS EC2 | ELT pipeline for transforming SAS immigration to S3 parquet format with partition over month+year | Airflow |
dags/capstone_DWH_dag.py | AWS EC2 | ETL pipeline for building DWH in Redshift with Staging Data in S3 | Airflow |
emr_bootstrap/boostrap_action.sh | AWS S3 | copies transform_immigration.py and data_lake_quality_check.py to EMR when cluster bootstraps |
AWS EMR |
plugins/helpers/dwh_airflow.cfg | AWS EC2 | Used by airflow to get all the resource paths | Airflow |
plugins/helpers/sql_queries.py | AWS EC2 | SQL queries for DWH | Airflow |
plugins/operators/copy_redshift.py | AWS EC2 | Copies Parquet data from S3 staging to AWS Redshift | Airflow |
plugins/operators/create_table.py | AWS EC2 | Creates empty tables in AWS Redshift | Airflow |
plugins/operators/DWH_data_quality.py | AWS EC2 | Checks data quality in all the tables in AWS Redshift | Airflow |
plugins/operators/emr_add_steps.py | AWS EC2 | Adds emr steps to execute transformations | Airflow |
plugins/operators/S3_Data_check.py | AWS EC2 | Checks if Data in present in S3 | Airflow |
This is S3 folder structure for both RAW and STAGING buckets
<RAW DATA BUCKET>
│ airflow_server_v3.yaml
│
└───awsemr/
│ │ bootstrap_action.sh
│
└───python_apps/
│ │ data_lake_quality_check.py
│ │ transform_immigration.py
│
└───raw/
│ └───codes/
│ │ airlines-codes.csv
│ │ airport-codes.csv
│ │ nationality-codes.csv
│ │ port-of-entry-codes.csv
│ └───i94_immigration_data/
│ │ *.sas7bdat
│ └───sas_meta_data/
│ │ I94_SAS_Labels_Descriptions.SAS
│ └───us-demogrpahics/
│ │ us-cities-demographics.csv
│ └───us-visa/
│ │ visa-type.csv
│ │ visa-issuing-ports.csv
<STAGING DATA BUCKET>
└───lake/
│ └───codes/
│ │ airlines-codes/*.parquet
│ │ airport-codes/*.parquet
│ │ nationality-codes/*.parquet
│ │ port-of-entry-codes/*.parquet
│ └───demogrpahics/
│ │ *.parquet
│ └───i94_meta_data/
│ │ ... extracted sas labels *.parquet
│ └───immigration/
│ │ month_year=apr_16/*.parquet
│ │ .....
│ └───visa-type/
│ │ *.parquet
│ └───visa-issue-post/
│ │ *.parquet
- Use the Airflow UI to trigger dags
- First trigger the DAG -
metadata_transform_dag
. This DAG will spin upEMR
cluster and useApache Livy
REST interface to perform the necessary transformations and the data quality checks
- Then execute the
immigration_transform_dag
. This DAG will also spin upEMR
cluster but will use theEMR add steps
method to basically add the transformation steps defined for obtaining transformed SAS immigration data. This DAG has amonthly
schedule and can start at a particular point of time and finish at another
- Once both the DAGS are completed then
S3 Staging area (Data Lake)
is ready. Then theDWH
can be created in the Redshift - Execute the
capstone_DWH_dag
to fill the DWH with the tables and data - The Data Dictionary for the DWH tables can be found here
Using the Redshift DWH we can check the Stats from the SAS data as follows:
- The following plot shows the number of visitors to California in 2016 by various transport modes
SELECT a.entry_month,b.transportation_mode,COUNT(cicid) AS count
FROM immigration a INNER JOIN i94mode b ON a.mode_id=b.mode_id
WHERE a.state_code = 'CA'
GROUP BY a.entry_month, b.transportation_mode, a.entry_month
ORDER BY COUNT(cicid) DESC;
- All the airports around the world can be viewed using geopandas as follows
- Top 10 port of entries in the US used by immigrants entering into the US
SELECT TOP 10 b.port_city, b.state_code_or_country, COUNT(cicid) AS count
FROM immigration a INNER JOIN i94ports b ON a.port_id=b.port_code
GROUP BY b.port_city, b.state_code_or_country
ORDER BY COUNT(cicid) DESC;
- Visa type distribution based on gender in 2016
SELECT im.gender, im.visa_type, count(im.cicid)
FROM immigration as im
JOIN visa_type ON im.visa_type = visa_type.visa_type
GROUP BY im.gender, im.visa_type;
The project rubric asked how alternative scenarios may be tackled. A discussion of these is included below.
Data increased by 100x:
- I have used AWS EMR (elastic) with Spark this helps when processing of huge data throughput is needed.
- The SAS data is already partitioned by month and year, this would help with querying in Spark.
- The remaining meta data is stored in parquet format which can be processed with spark in a distributed way
- I am using parquet format for all the tables in S3 Data Lake, therefore from staging to DWH the transfer would be faster.
- I have used S3 for storing Staging and Raw data, the main advantage is S3's scalability and accessibility with other AWS infrastructure.
The pipelines would be run on a daily basis by 7 am every day:
- In this scenario an additional Airflow DAG should be implemented to transfer the data from Udacity workspace to AWS S3 to meet the SLA. The existing Airflow DAGs would have to change the schedule interval from
monthly
todaily
.
Database needed to be accessed by 100+ people:
- I have already chosen Redshift as a DWH which have auto-scaling capabilities and good read performance. To manage efficient user access different
user
groups should be defined accordingly. This would restrict write access for the users who would need only read access and foradmin
users full access is provided
This capstone project was used to demonstrate the steps involved in building a Data Lake and DWH in AWS
- Creating AWS resources using AWS Cloudformation - IaC
- Creating S3 buckets, Redshift Cluster, IAM Roles, Security groups
- Developing an ELT Pipeline that transforms raw data from S3 buckets into staging buckets in S3 using AWS EMR
- Developing an ETL pipeline to copy data from S3 staging area to AWS Redshift
- Using Airflow and Python to automate ETL pipelines
- Writing custom operators to perform tasks such as S3 Data checks, copying data from S3 to Redshift and EMR add steps
- Transforming data from various sources into a star schema optimized for the immigration analysts use cases.