udacity capstone project data engineer github

  • No suggested jump to results
  • Notifications

Udacity Data Engineering Nanodegree Capstone Project


Name already in use.

Use Git or checkout with SVN using the web URL.

Work fast with our official CLI. Learn more about the CLI .

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Data engineering capstone project, project summary.

The objective of this project was to create an ETL pipeline for I94 immigration, global land temperatures and US demographics datasets to form an analytics database on immigration events. A use case for this analytics database is to find immigration patterns to the US. For example, we could try to find answears to questions such as, do people from countries with warmer or cold climate immigrate to the US in large numbers?

Data and Code

All the data for this project was loaded into S3 prior to commencing the project. The exception is the i94res.csv file which was loaded into Amazon EMR hdfs filesystem.

In addition to the data files, the project workspace includes:


The project follows the following steps:

Step 1: scope the project and gather data, step 2: explore and assess the data, step 3: define the data model.

Project Scope

To create the analytics database, the following steps will be carried out:

The technology used in this project is Amazon S3, Apache Sparkw. Data will be read and staged from the customers repository using Spark.

Refer to the jupyter notebook for exploratory data analysis

3.1 Conceptual Data Model

Database schema

The country dimension table is made up of data from the global land temperatures by city and the immigration datasets. The combination of these two datasets allows analysts to study correlations between global land temperatures and immigration patterns to the US.

The us demographics dimension table comes from the demographics dataset and links to the immigration fact table at US state level. This dimension would allow analysts to get insights into migration patterns into the US based on demographics as well as overall population of states. We could ask questions such as, do populous states attract more visitors on a monthly basis? One envisions a dashboard that could be designed based on the data model with drill downs into gradular information on visits to the US. Such a dashboard could foster a culture of data driven decision making within tourism and immigration departments at state level.

The visa type dimension table comes from the immigration datasets and links to the immigaration via the visa_type_key.

The immigration fact table is the heart of the data model. This table's data comes from the immigration data sets and contains keys that links to the dimension tables. The data dictionary of the immigration dataset contains detailed information on the data that makes up the fact table.

3.2 Mapping Out Data Pipelines

The pipeline steps are as follows:

Step 4: Run Pipelines to Model the Data

4.1 create the data model.

Refere to the jupyter notebook for the data dictionary.

4.2 Running the ETL pipeline

The ETL pipeline is defined in the etl.py script, and this script uses the utility.py and etl_functions.py modules to create a pipeline that creates final tables in Amazon S3.

spark-submit --packages saurfang:spark-sas7bdat:2.0.0-s_2.10 etl.py


An automated ETL data pipeline for immigration, temperature and demographics information

This project is maintained by HakbilenBerk

Udacity Data Engineering Capstone Project: Automated-Data-Pipeline

Project by Berk Hakbilen

Data pipeline for immigration,temperature and demographics information

Goal of the project.

In this project the immigration information from the US is extracted from SAS files along with temperature and demographics information of the cities from csv files. The datasets are cleaned and rendered to JSON datasets on AWS S3. Later on JSON data is loaded on to staging tables on Redshift and then transformed to a star schema tables which also reside on Redshift. The whole pipeline is automated using Airflow.

The database schema was constructed for following usecases:

Database Model


The data is stored on a Redshift cluster on AWS. The Redshift contains the staging tables which serve the purpose of hold the data from JSON files in S3 Bucket. Redshift cluster also contains a fact table called immigrations where immigration/travel information of individuals are listed. I used immigrations as the fact table where through the city attribute the queries can be extraploted to other dimension tables such as demographics, temperature and visa details to obtain further correlations between the immigration/travel information. I opted for the star schema to optimize the database for Online Analytical Processing (OLAP).

Database Dictionary

Please refer to “Airflow/data/I94_SAS_Labels_Descriptions.SAS” for abbreviations.

Immigration Fact Table

Visa details dimension table, temperature dimension table, demographics dimension table, tools and technologies used.

The tools and technologies used:

Source Datasets

The datasets used and sources include:

Airflow: DAG Representation


The convert_to_json.py file used to convert the source data to JSON data on S3 bucket is thought of as a one time run process to convert the data. Therefore, it is not implemented in the DAG as a task.


Firstly, clone this repo.

Create a virtual environment called venv and activate it

You can access Airflow UI on your browser by going to:

Use user= user and password= password to login to Airflow.

Fill in your AWS access and secret keys in the aws.cfg file under Airflow/ directory.

Create an S3 Bucket on AWS and fill in the S3_Bucket in aws.cfg with your bucket address (s3a://your_bucket_name)

Create a new folder called “data_sas” under airflow/data. Download the SAS data files from here and copy them in the folder you just created. Also download “GlobalLandTemperaturesByCity.csv” from here and copy it under airflow/data along with other csv files (Because these data files are too large, they are not included in this repo. Therefore you need to download them manually).

Admin tab

Under Connections, select Create.



How to handle possible scenarios

DEV Community

DEV Community

Gavi Schneider

Posted on Nov 12, 2020 • Originally published at Medium on Nov 12, 2020

My Capstone Project for Udacity’s Cloud DevOps Engineer Nanodegree

udacity capstone project data engineer github

After three months of various DevOps related courses and smaller projects, I had reached the end of my Nanodegree, and it was time to build out my capstone project.

My project can be broken down into two parts: the application itself, and the infrastructure that deploys and hosts it.

The Application: Random Song

Random Song is a simple web app built using TypeScript, Node.js and Express. It serves as a web service that can send you a random song, using the Musixmatch API. To test out the app, simply go to the /random route, and you'll receive a random song object in JSON.

Going to the / route will return:

And going to the /random route will return a random song:

The Infrastructure

After the application was built, the next task was deploying it. In this project I decided to go with a Rolling Deployment. My goal was to write out the necessary configuration files and required build commands, and then create a pipeline to automate the process of actually building the application and deploying the infrastructure. This way, it could be executed in the exact same manner every time I added new code or infrastructure to the project. I needed a server to host Jenkins, my CI/CD technology of choice for this project. After provisioning an AWS EC2 instance and installing Jenkins, it was time to start defining the tasks that I’d want Jenkins to run. After accessing my application’s code, here are the tasks that I created for Jenkins to run:

In the end, the app is deployed to the cluster and accessible to users. Random songs for days.

Unfortunately, the app is not currently deployed due to EKS not being a cheap service for a student to continuously pay for. However, I’m planning on taking the Random Song application and turning it into something that will be more permanently hosted in a future project. As far as the infrastructure goes, these are also things that can be repurposed in future projects — Docker containers, Kubernetes clusters and Jenkins pipelines are tools that can help build any software related project.

If you’d like to see the code, you can take a look at the project’s repo on GitHub .

Top comments (0)


Templates let you quickly answer FAQs or store snippets for re-use.

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink .

Hide child comments as well

For further actions, you may consider blocking this person and/or reporting abuse

padcom profile image

Working with streams in Node.js

Matthias Hryniszak - Apr 9

leggetter profile image

How to get started with the TERN stack

Phil Leggetter - Apr 27

areebdev profile image

Linked In profile Scraping

Muhammad Areeb - Apr 18

sam585456525 profile image

A Better Way to Design a Loading Button Component

Neil Chen - Apr 17

Once suspended, gschnei will not be able to comment or publish posts until their suspension is removed.

Once unsuspended, gschnei will be able to comment and publish posts again.

Once unpublished, all posts by gschnei will become hidden and only accessible to themselves.

If gschnei is not suspended, they can still re-publish their posts from their dashboard.

Once unpublished, this post will become invisible to the public and only accessible to Gavi Schneider.

They can still re-publish the post if they are not suspended.

Thanks for keeping DEV Community safe. Here is what you can do to flag gschnei:

gschnei consistently posts content that violates DEV Community's code of conduct because it is harassing, offensive or spammy.

Unflagging gschnei will restore default visibility to their posts.

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Notes on Neural Nets


Udacity Nanodegree Capstone Project

The Udacity Self-Driving Car Nanodegree has been a great experience. Together with the Intro to Self-Driving Car I have used the last 9 months learning all about Computer Vision, Convolutional Nets, Bayesian probability, and Sensor Fusion. The method used by Udacity was much to my liking, with a series of projects where you learn by doing as you go. The last project was especially in-depth, using ROS to implement a full autonomous vehicle. A big part of the fun in this last project was doing it in a group, meeting great people and learning from them in lively discussions.

Our autonomous car is designed to follow pre-defined waypoints along a road, recognize the traffic light status from camera images, stop on red and restart driving on green. This system is tested on the Udacity simulator and was tested on the real Udacity self-driving car (Carla) on project delivery.

Members of Team “Always straight to the point!”

Table of contents.

1. Overview

In order to complete the project we program in Python the different ROS nodes. The basic structure is well described in the Udacity Walkthrough, and our implementations follows the given guidelines. This implements the basic functionality of loading the Waypoints that the car has to follow, controlling the car movement along these Waypoints, and stop the car upon encountering a red traffic light. The details of the ROS system is described in System architecture .

After laying out the basic ROS functionality, much focus was given to implement the traffic light detection from images collected by the camera, both on the simulator and on the real testing lot. We decided in favor to use a Tensorflow model pre-trained on the general task of object detection. To prepare for this part of the project we read the previous work of Alex Lechner on the Udacity Nanodegree, as well as the Medium post of Vatsal Srivastava . We also used their datasets for test and validation.

To fine-tune this model to our task of recognizing traffic lights (red, yellow, and green) we generated thousands of labeled training images. This process is described in Datasets

The training on those images was done using the Tensorflow Object Detection API and Google Cloud Platform, as described in Traffic Light Classifier .

The integration of our Tensorflow Traffic Light Classifier into the ROS system is described in Final Integration .

However, before getting into the details we describe a workaround we needed to use to finish our tests satisfactorily, and solve a latency problem in the simulator when the camera is switched on.

2. Workaround to avoid simulator latency issue with camera on

After implementing the basic ROS functionality the car can complete a full lap in the simulator without issues. However, to fully implement the traffic light recognition with a classifier we need to activate the camera in the simulator. With this the simulator begins to send images data to the /image_color topic. This data processing seems to overload our system, and a latency appears delaying the updating of the waypoints relative to the position of our car. The waypoints begin to appear on the back of the car and, as the car tries to follow these waypoints, the control subsytem get erratic and the car drives off the road.

We found this problem both in the virtual machine as in a native Linux installation. It is also observed by many Udacity Nanodegree participants, as seen in these GitHub issues: Capstone Simulator Latency #210 and turning on the camera slows car down so auto-mode gets messed up #266 . An example of the issue at our side is shown below:

udacity capstone project data engineer github

We implemented a workaround by a little modification in one of the files provided by Udacity in the ROS System, bridge.py . This module builds the node styx_server , that creates the topics responsible to transmit different data out of the simulator. We tried first to only process some of the images received via /image_color but it seemed the origin of the delay was the presence of these images in the topic in the first place. Thus, we implemented the skipping logic in the topic itself, and the issue got finally solved. The code modifies the publish_camera() function:

Furthermore, the waypoints queue was reduced from 200 to 20, which also proved to speed up the simulator considerably before implementing this workaround.

However, this method only allowed us to get rid of the latency in a native Linux installation. On a Virtual Machine under Windows and on the Ucacity Web Workspace, the latency got better, maybe with increased values of skipped frames, but still showed after some time.

3. ROS System Architecture

The ROS system can be divided in three main subsystems:

The diagram below shows the subsystem division, as well as the ROS nodes and topics.

udacity capstone project data engineer github

i. Perception (tl_detector.py)

This node subscribes to four topics:

This node will find the waypoint of the closest traffic light in front of the car. This point will be described by its index counted from the car (e.g.: the number 12 waypoint ahead of the car position). Then, the state of the traffic light will be acquired from the camera image in /image_color using the classifier implementation in tl_classifier.py . If the traffic light is red, it will publish the waypoint index into the /traffic_waypoint topic. This information will be taken by the Planning subsystem to define the desired velocity at the next sequence of waypoints.

ii. Planning (waypoint_updater.py)

This node subscribes to the topics:

It publishes a list of waypoints in front of our car to the topic /final_waypoints . The data in waypoints also includes the desired velocity of the car at the given waypoint. If a red traffic light is detected in front of the car, we modify the desired velocity of the /final_waypoints up to it in a way that the car slowly stops at the right place.

The number of waypoints is defined by the parameter LOOKAHEAD_WPS . If this parameter is too big, there is a big latency updating the waypoints, in a way that the car gets ahead of the list of way points. This confuses the control of the car, which tries to follow the waypoints. We set for a value of 20, to get rid of this latency while still having enough data to properly control the car.

iii. Control (dbw_node.py)

In the control subsystem, Udacity provides an Autoware software waypoint_follower.py . After publishing /final_waypoints this software publishes twist commands to the /twist_cmd topic, that contain the desired linear and angular velocities.

dbw_node.py subscribes to /twist_cmd , /current_velocity , and /vehicle/dbw_enabled . It passes the messages in these nodes to the Controller class from twist_controller.py . We implemented here the control of the car, using the provided Yaw Controller, PID Controller, and LowPass Filter.

It is important to perfom the control only when /vehicle/dbw_enabled is true. When this topic message is false, it means the car is on manual control. In this condition the PID controller would mistakenly accumulate error.

The calculated throttle, brake, and steering are published to the topics:

4. Datasets

(By Andrei Sasinovich )

After we got our dataset of images from the simulator and rosbag we started to think how to label it. The first option was label it by hand but when we look into the number of collected images (more than 1k) we decided that it’s not a good way as we have other work to do 😊

We decided to generate a dataset. We cut by 10-15 traffic lights of each color.


Using OpenCV generate thousands of images by resizing traffic lights, changing contrast and brightness. As every traffic light was applied on a background by our script we could generate coordinates of drawn bounding boxes as well.

udacity capstone project data engineer github

TFRecord file was created on the fly by packing all our info into TF format using tensorflow function tf.train.Example

5. Traffic Light Classifier

The state of the traffic light in front of the car has to be extracted from the camera’s images, both on the simulator and at the real site. Different methods of image recognition can be used. We decided to use Deep Learning in the form of a model pre-trained on the general task of object detection. While previously, in this Udacity nanodegree, we defined a model from scratch and trained it for traffic sign classification, object detection also includes the capability of locating an object within an image and delimiting its position on a bounding box. Only this way can we extract from the camera image the state of one or several traffic light within the landscape in front of us.

Several Deep Learning methods for object detection have been developed by researchers. Two of the most popular methods are R-CNN (Regions with CNN), and SSD (Single Shot Detector). While R-CNN performs with higher accuracy than SSD, the latter is faster. Improved versions have been developed (Fast R-CNN, Faster R-CNN) but they are still slower than SSD.

The Google’s Tensorflow Object Detection API provides a great framework to implement our traffic light classifier. This is a collection of pre-trained models and high level subroutines that facilitate the use and fine-tuning of these models. The models are compiled in Tensorflow detection model zoo , belonging mainly to the SSD and Faster R-CNN methods.

Although the goal of the API is to facilitate the fine-tune training of these model, there are still a lot of installation and configuration steps that are not trivial at all. Actually, by the time you have fully trained a model for your purposes you will have gone through a convoluted series of steps, and probably several errors. There is extensive information on the API Readme . However, this information is general and in some parts lacks detail for our concrete task. So, we find useful to include below a detailed tutorial describing our experience

On a high level, the steps to take are:

i. Tensorflow Object Detection API Installation

Ii. choose and test a model from the model zoo, iii. configure the pipeline.config file, iv. test the training process locally, v. train with gpus using google cloud platform (gcp), vi. export and test the final graph.

(You find the official reference here)

Install TensorFlow:

pip install tensorflow

Create a new directory tensorflow

Clone the entire models GitHub repository from the tensorflow directory.

This will take 1.2 GB on disk, as it contains models for many different tasks (NLP, GAN, ResNet…). Our model is found in

and many of the commands below will be input from `/tensorflow/models/research/

Once installed, the API provides the following tools and scripts that we will use to fine-tune a model with our data:

In the model zoo you find a list of pre-trained models to download, as well as some basic stats regarding accuracy and speed. These models are pre-trained with datasets like the COCO dataset or the Open Images dataset . The COCO dataset, for example, consists of more than 200K images, with 1.5 object instances labeled within them, belonging to 80 different object categories.

Each pre-trained model contains:

The pre-trained model can already be tested for inference. As it is not fine-tuned for our requirements (detect traffic lights and classify them into red, yellow, or green), the results will not be satisfactory for us. However is a good exercise to get familiarized with the inference script in the Object Detection Demo Jupyter notebook. This notebook downloads the model automatically for you. If you download it manually to a directory of your choice, as you will need to work with it when fine-tuning, you can comment out the lines in the “Download Model” section and input the correct local path in

As it happens that the COCO dataset includes “Traffic Light” as an object category, when we run the inference script with one of our images, this class will probably be recognized. However, the model as it is will not be able to classify the traffic light state. Below you can see the result on a general picture and on one of our pictures out of the simulator.

The parameters to configure the training and evaluation process are described in the pipeline.config file. When you download the model you get the configuration corresponding to the pre-training.

The pipeline.config file has five sections: model{…}, train_config{…}, train_input_reader{…}, eval_config{…}, and eval_input_reader{…}. These sections contain parameters pertaining to the model training (dropout, dimensions…), and the training and evaluation process and data.

This can be used for our fine-tuning with some modifications:

If you follow this tutorial, you will first set your paths to your local folders to train the model locally. However, when you go to train the model on the cloud, you have to remember to change the paths to you GCP buckets, as described below.

Training without a GPU will take really too long to be practical and, even though I have a GPU on my computer, I didn’t figure out how to use it with this API. Anyway, running locally is useful to test your setup, as there are lots of thing that can go wrong and the latency when sending a work to the cloud can delay a lot your debugging.

Training is done using the script model_main.py in

The script needs the training and evaluation data, as well as the pipeline.config , as described above. You will send some parameters to the script in the command line. So, first set the following environment variables from the terminal.

MODEL_DIR points to a new folder where you want to save your new fine-tuned model. The path to the pre-trained model is already specified in your pipeline.config under the fine_tune_checkpoint parameter.

Feel free to set a lower number for NUM_TRAIN_STEPS , as you will not have the patience to run 50000. In my system 1000 was a good number for testing purposes.

Finally, you can run the script using Python from the command line:

Although you can use any cloud service like Amazon’s AWS or Microsoft’s Azure, we though Google Cloud would have better compatibility with Google’s Tensorflow API. To use it you first need to setup your own GCP account. You will get 200$ of credit with your new GCP account, which will be more than enough to do all the work in this tutorial.

The details of setting up your GCP account (not trivial) are out of the scope of this tutorial, but you basically need to create a project, where your work will be executed, and a bucket, where your data will be stored. Check the following official documentation:

After this, running a training work on the cloud is very similar to running it locally, with the following additional steps:

You are now almost ready to test your fine-trained model! First download the new model in gs://${MODEL_DIR} to your computer. From this model you will create the frozen graph frozen_inference_graph.pb that will be the new input to the Object Detection Demo Jupyter notebook.

Exporting is done using the script export_inference_graph.py in

This script also needs the following parameters to be sent in the command line.

You have now in your ${EXPORT_DIR} the frozen graph frozen_inference_graph.pb . This file, together with you new label_map.pbtxt , is the input to the Jupyter notebook as described in section ii. Choose and test a model from the Model Zoo . We got the following results:

As you can see, now instead of “traffic light” we get the traffic light status as defined in our label_map.pbtxt :)

6. Final Integration

The ROS system provided by Udacity reserves a space to implement the classifier. You find our implementation in tl_classifier.py .

The classifier can be implemented here with the same logic as the Object Detection Jupyter Notebook discussed above. However our implementation resembles more closely the Udacity Object Detection Lab . The two implementations are equivalent, but the latter is simpler, easier to read, and quicker.

The fine-tuned model outputs several bounding boxes, classes, and scores corresponding to the different objects detected in the image. The scores reflects the confidence level of the detected object. We first filter the object using a confidence threshold of 70% applied to the scores. Later we decide for the remaining box with the highest score as the traffic light state present in the picture.

Two libraries were equally used to process images in the dataset generation and in the ROS system: PIL and CV2. These libraries use different images formats: PIL in RGB, and CV2 in BGR. To correct this discrepancy we interchange the dimensions in the image numpy array.

Our code also includes a conditional section to save the annotated images to disk for debugging purposes.

i. Model Evaluation

We trained three different models from the model zoo: ssd_mobilenet_v1_coco, ssd_inception_v2_coco, and faster_rcnn_inception_v2_coco. As detailed in the model zoo table the more accurate the model is, the longer the evaluation time is. In our case SSD-Mobilenet is the fastest and Faster-RCNN the most accurate.

We trained our models both for the simulator environment and for the real images at the Udacity site contained in the Rosbag file.

For the simulator images the SSD-Mobilenet model was quite accurate and, being the fastest, we chose it for our frozen graph. A final video with our results on the simulator and the annotated images, as well as the ROS console output, is included in Udacity_Capstone_video .

For the real site however, the accuracy was not as high and finally we decided to use Fastest-RCNN despite the higher evaluation time. To speed up the evaluation we processed only the higher half of the image, as the lower part contains the car’s hood, not necessary for the task. Example videos for the classification accuracy in the different models are included in SSD-Mobilenet , SSD-Inception , and Fastest-CRNN .


udacity capstone project data engineer github

Un sito che tratta di data science, machine learning, big data e applicazioni varie.

... udacity data engineering capstone project.

In questo lungo post vi presento il progetto che ho sviluppato per il Data Engineering Nanodegree (DEND) di Udacity. Cosa sviluppare era libera scelta dello sviluppatore posto che alcuni criteri fossero soddisfatti, per esempio lavorare con un database di almeno 3 milioni di records.

Questa è il primo notebook del progetto, nel secondo ci sono esempi di queries che possono essere eseguite sul data lake.

Data lake with Apache Spark ¶

Data engineering capstone project ¶, project summary ¶.

The Organization for Tourism Development ( OTD ) want to analyze migration flux in USA, in order to find insights to significantly and sustainably develop the tourism in USA.

To support their core idea they have identified a set of analysis/queries they want to run on the raw data available.

The project deals with building a data pipeline, to go from raw data to the data insights on the migration flux.

The raw data are gathered from different sources, saved in files and made available for download.

The project shows the execution and decisional flow, specifically:

1. Scope of the Project ¶

The OTD want to run pre-defined queries on the data, with periodical timing.

They also want to maintain the flexibility to run different queries on the data, using BI tools connected to an SQL-like database.

The core data is the dataset provided by US governative agencies filing request of access in the USA (I94 module).

They also have other lower value data available, that are not part of the core analysis, whose use is unclear, therefore are stored in the data lake for a possible future use.

1.1 What data ¶

Following datasets are used in the project:

1.2 What tools ¶

Because of the nature of the data and the analysis that must be performed, not time-critical analysis, monthly or weekly batch, the choice fell on a cheaper S3-based data lake with on-demand on-the-fly analytical capability: EMR cluster with Apache Spark , and optionally Apache Airflow for scheduled execution (not implemented here).

The architecture shown below has been implemented.


1.3 The I94 immigration data ¶

The data are provided by the US National Tourism and Trade Office . It is a collection of all I94 that have been filed in 2016.

1.3.1 What is an I94? ¶

To give some context is useful to explain what an I94 file is.

From the government website : “The I-94 is the Arrival/Departure Record, in either paper or electronic format, issued by a Customs and Border Protection (CBP) Officer to foreign visitors entering the United States.”

1.3.2 The I94 dataset ¶

Each record contains these fields:

More details in the file I94_SAS_Labels_Descriptions.SAS

1.3.3 The SAS date format ¶

Represent any date D0 as the number of days between D0 and the 1th January 1960

1.3.4 Loading I94 SAS data ¶

The package saurfang:spark-sas7bdat:2.0.0-s_2.11 and the dependency parso-2.0.8 are needed to read SAS data format.

To load them use the config option spark.jars and give the URL of the repositories, as Spark itself wasn’t able to resolve the dependencies.

1.4 World temperature data ¶

The dataset is from Kaggle. It can be found here .

The dataset contains temperature data:

land temp

1.5 Airport codes data ¶

This is a table of airport codes, and information on the corresponding cities, like gps coordinates, elevation, country, etc. It comes from Datahub website .

airport codes

1.6 U.S. City Demographic Data ¶

The dataset comes from OpenSoft. It can be found here .

us city demo

2. Data Exploration ¶

In this chapter we proceed identifying data quality issues, like missing values, duplicate data, etc.

The purpose is to identify the flow in the data pipeline to programmatically correct data issues.

In this step we work on local data.

2.1 The I94 dataset ¶

2.2 I94 SAS data load ¶

To read SAS data format I need to specify the com.github.saurfang.sas.spark format.

The most columns are categorical data, this means the information is coded, for example I94CIT=101 , 101 is the country code for Albania.

Other columns represent integer data.

It appears clear that there is no need to have data that are defined as double => let’s change those fields to integer

Verifying the schema is correct.

These fields come in a simple string format. To be able to run time-based queries they are converted to date type

A date in SAS format is simply the number of days between the chosen date and the reference date (01-01-1960)

2.3 Explore I94 data ¶

I want to know the 10 most represented nations

The i94res code 135, where the highest number of visitors come from, corresponds to the the United Kingdom, as can be read in the accompanying file I94_SAS_Labels_Descriptions.SAS

New York City port registered the highest number of arrivals.

2.4 Cleaning the I94 dataset ¶

These are the steps to perform on the I94 database:

The number of nulls equal the number of rows. It means there is at least one null on each row of the dataframe.

There are many nulls in many columns.

The question is, if there is a need to correct/fill those nulls.

Looking at the data, it seems like some field have been left empty for lack of information.

Because these are categorical data there is no use, at this step, in assigning arbitrary values to the nulls.

The nulls are not going to be filled apriori, but only if a specific need comes up.

Dropping duplicate row

Cheching if the number changed

No row has been dropped => no duplicated row

This gives confidence on the consistence of the data

2.5 Store I94 data as parquet ¶

I94 data are stored in parquet format in an S3 bucket, they are partinioned using the fields: year, month

2.6 The Airport codes dataset ¶

A snippet of the data

How many records?

There are no duplicates

We discover there are some null fields:

The nulls are in these colomuns:

No action taken to fill the nulls

Finally, let’s save the data in parquet format in our temporary folder mimicking the S3 bucket.

3. The Data Model ¶

The core of the architecture is a data lake , with S3 storage and EMR processing.

The data are stored into S3 in raw and parquet format.

Apache Spark is the tool elected for analytical tasks, therefore all data are loaded into Spark dataframe using a schema-on-read approach.

For SQL queries style on the data, Spark temporary views are generated.

3.1 Mapping Out Data Pipelines ¶

data lineage

4. Run Pipeline to Model the Data ¶

4.1 provision the aws s3 infrastructure ¶.

Reading credentials and configuration from file

Create the bucket if it’s not existing

4.2 Transfer raw data to S3 bucket ¶

Transfer the data from current shared storage (currently Udacity workspace) to S3 lake storage.

A naive metadata system is implemented. It uses a json file to store basic information on each file added to the S3 bucket:

These dataset are moved to the S3 lake storage:

4.3 EMR cluster on EC2 ¶

An EMR cluster on EC2 instances with Apache Spark preinstalled is used to perform the ELT work.

A 3-nodes cluster of m5.xlarge istances is configured by default in the config.cfg file.

If the performance requires it, the cluster can be scaled up to use more nodes and/or bigger instances.

After the cluster has been created, the steps to execute spark cleaning jobs are added to the EMR job flow, the steps are in separate .py files. These steps are added:

The cluster is set to auto-terminate by default after executing all the steps.

4.3.1 Provision the EMR cluster ¶

Create the cluster using the code emr_cluster.py [Ref. 3] and emr_cluster_spark_submit.py and and set the steps to execute spark_script_1 and spark_script_2 .

These scripts have already been previously uploaded to a dedicated folder in the project’s S3 bucket, and are accessible from the EMR cluster.

The file spark_4_emr_codes_extraction.py contains the code for following paragraphs 4.3.1

The file spark_4_emr_I94_processing.py contains the code for following paragraphs 4.3.2, 4.3.3, 4.3.4

4.3.2 Coded fields: I94CIT and I94RES ¶

I94CIT, I94RES contain codes indicating the country where the applicant is born (I94CIT), or resident (I94RES).

The data is extracted from I94_SAS_Labels_Descriptions.SAS . This can be done sporadically or every time a change occurred, for example a new code has been added.

The conceptual flow below was implemented.

data transform

First steps are define credential to access S3a then load the data in a dataframe, in a single row

Find the section of the file where I94CIT and I94RES are specified.

It start with I94CIT & I94RES and finish with the semicolon character.

To match the section, it is important to have the complete text in a single row, I did this using the option wholetext=True in the previous dataFrame read operation

Now I can split in a dataframe with multiple rows

I filter the rows with structure \ = \

And then create 2 differents columns with code and country

I can finally store the data in a single file in json format

4.3.3 Coded field: I94PORT ¶

Similarly to extract the I94PORT codes

The complete code for codes extraction is in spark_4_emr_codes_extraction.py

4.3.4 Data cleaning ¶

The cleaning steps have already been shown in section 2, here are only summarized

4.3.5 Save clean data (parquet/json) to S3 ¶

The complete code, refactorized and modularized, is in **spark_4_emr_I94_processing.py**

As a side note, saving the test file as parquet takes about 3 minute on the provisioned cluster. The complete script execution takes 6 minutes.

4.3.6 Loading, cleaning and saving airport codes ¶

4.4 querying data on-the-fly ¶.

The data in the data lake can be queried on-place. That is the Spark cluster on EMR is directly operating on S3 data.

There are two possible ways to query the data:

We see example of both programming styles.

These are some typical queries that are run on the data:

The queries are collected in the Jupyter notebook Capstone project 1 – Querying the data lake.ipynb

4.5 Querying data using the SQL querying style ¶

4.6 data quality checks ¶.

The query-in-place concept implemented here uses a very short pipeline, data are loaded from S3 and after a cleaning process are saved as parquet. Quality of the data is guaranteed by design.

5. Write Up ¶

The project has been set up with scalability in mind. All components used, S3 and EMR, offer higher degree of scalability, either horizontal and vertical.

The tool used for the processing, Apache Spark, is the de facto tool for big data processing.

To achieve such a level of scalability we sacrified processing speed. A data warehouse solution with a Redshift database or an OLAP cube would have been faster answering the queries. Anyway nothing forbids to add a DWH to stage the data in case of a more intensive, real-time responsive, usage of the data.

An important part of an ELT/ETL process is automation. Although it has not been touched here, I believe the code developed here is prone to be automatized with a reasonable small effort. A tool like Apache Airflow can be used for the purpose.

Scenario extension ¶

In an increased data scenario, the EMR hardware needs to be scaled up accordingly. This is done by simply changing configuration in the config.cfg file. Apache Spark is the tool for big data processing, and is already used as the project analityc tool.

In this case an orchestration tool like Apache Airflow is required. A DAG that trigger Phython scripts and Spark jobs executions, needs to be scheduled for daily execution at 7am.

The results of the queries for the dashboard can be saved in a file.

A proper database wasn’t used, on the contrary Amazon S3 is used to store data and queries them in-place. S3 is designed to massive scale in mind, it is able to handle sudden traffic spikes. Therefore, accessing the data by many people shouldn’t be an issue.

The programming used in the project, provision an EMR cluster for any user that plan to run it’s queries. 100+ EMRs is probably going to be expensive for the company. A more efficient sharing of processing resources must be realized.

6. Lessons learned ¶

Emr 5.28.1 use python 2 as default ¶.

Adding jars packages to Spark ¶

For some reason adding the packages in the Python programm when instantiating the sparkSession doesn’t work (error message package not found). This doesn’t work:

The packages must be added in the spark-submit:

Debugging Spark on EMR ¶

While evrything work locally, it doesn’t necessarily means that is going to work on the EMR cluster. Debugging the code is easier with SSH on EMR.

Reading an S3 file from Python is tricky ¶

While reading with Spark is straightforward, one just needs to give the address s3://…., with Python boto3 must be used.

Transfering file to S3 ¶

During the debbuging phase, when the code on S3 must be changed many time using the web interface is slow and unpractical ( permanently delete ). Memorize this command: aws s3 cp <local file> <s3 folder>

Removing the content of a directory from Python ¶

import shutil dirPath = 'metastore_db' shutil.rmtree(dirPath)

7. References ¶

Leave a Reply Cancel Reply

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *

You may use these HTML tags and attributes:

udacity capstone project data engineer github

University of Washington Information School

MSIM students (from left) Raymond Su, Frank Lai, Jia Jia Yu, Isabella Eriksen, Max Lieberman, Ken Masumoto and Alana Montoya.

MSIM Capstone teams wrangle messy state data

Washington State’s open data portal provides citizens an abundance of free, high-value information, from interactive maps and crime statistics to budget proposals, campaign contribution sources, health provider credentials and electric vehicle registration numbers … or is that electric vehicles registration?

That little “s” points to a messy problem. The portal, www.data.wa.gov , started as a largely decentralized, crowd-sourced, do-your-own thing collection site, with individual publishers at different agencies creating an estimated 1,200 tags to help people access government information. After more than 10 years, the keyword tags have become riddled with inconsistent singular/plurals, misspellings, duplications and term variations that can make it harder for users to find needed information.

Washington Technology Solutions (WaTech), the state IT agency that operates the portal, wanted a tag cleanup and some guiding rules for agency publishers, as well as an overall portal performance update. Managers enlisted two separate Capstone teams from the iSchool’s Master of Science and Information Management (MSIM) program to take on these challenges.

The four Early-Career residential students from Team {Range} tackled the tags, putting to work a mix of skills in data science, information architecture and user experience design. They also consulted with MSIM professors for expertise in data taxonomy. “Our professors were phenomenal. They were there for us,” says team member Isabella Eriksen.

After researching best practices for the cleanup, the team manually inspected each tag, cleared up typos and other inconsistencies, identified tags that needed to be improved and checked each for relevancy. “With just our initial analysis, we were able to reduce the tags by 40 percent,” says team member Alana Montoya.

Team {Range} created guidelines for portal publishers – including making terms plural for all categories. And, to make sure the same mess didn’t happen again, they developed a dashboard to help sponsors monitor tags going forward. The final task was creating recommendations for further refinements. “In a year or two, if the portal was left to its own devices, it would be the exact same problem,” says team member Ken Junichi Masumoto.

{Range}’s project sponsor describes them as a model team, in both quality of work and effective management. “We now have a really clear set of recommendations and a protocol for tags because of this team. It couldn’t have worked better,” says iSchool alumna Kathleen Sullivan, the open data librarian for the Washington State Library, which contracts with WaTech.

Team Tech Husky took on WaTech’s performance update challenge. “Washington’s open data collection has a lot of assets, but sponsors don’t know how it is performing or how many people are using it,” says Jia Jia Yu, one of three team members, all Early-Career residential students from Taiwan. Her role is data analyst, with expertise in visualizations.

To determine if the open data program was on track with its goals, they created data pipelines, a dashboard and visualizations for examining various performance indicators. The indicators measure such areas as the number of assets a particular agency publishes and which datasets are of the greatest interest to the public. “With the dashboard, our sponsors and the public can see what users are most curious about and what kind of data is most useful,” says Raymond Su, project manager.

The team dramatically increased productivity with their automatic data pipeline, which can quickly fetch information and keep the portal’s datasets constantly updated. “Before we started, WaTech was manually updating their datasets with Excel, which was challenging and time-consuming,” says Frank Lai, software engineer for the team.

Team Tech Husky’s project sponsor says she was excited to see the team figuring out these basic structures and automating the pipeline. “They were really smart, asked good questions, communicated well, and were self-organizing,” says Cathi Greenwood, open data program manager.

Adds Greenwood: “Both teams did this not just for us, but for the public good. The teams’ solutions will be published on GitHub (a code hosting platform), where any government using this type of open data portal will be able to use them for their own portals.”

Looking back at their hard work, the Capstone students say they enjoyed watching iSchool lessons play out and evolve in real-life settings. “I saw so many things we learned at the iSchool come to life,” says Team {Range} member Max Lieberman. “I realized that information management is not just about finding problems, but about establishing what practices should be put in place and finding ways to keep those practices sustainable long-term.”

Pictured at top, from left, are MSIM students Raymond Su, Frank Lai, Jia Jia Yu, Isabella Eriksen, Max Lieberman, Ken Masumoto and Alana Montoya.

Full Results

Customize your experience.

udacity capstone project data engineer github


  1. GitHub

    udacity capstone project data engineer github

  2. GitHub

    udacity capstone project data engineer github

  3. GitHub

    udacity capstone project data engineer github

  4. What I learned from finishing the Udacity Data Engineering Capstone

    udacity capstone project data engineer github

  5. Top 20 udacity data engineering capstone project github mới nhất 2022

    udacity capstone project data engineer github

  6. Udacity Capstone: System Integration

    udacity capstone project data engineer github


  1. Purwadhika

  2. MIT Cybersecurity Capstone Project Equifax Data Breach 2017

  3. Data Engineer

  4. Create an AutoML model and deploy using Azure SDK

  5. Capstone project EDA on hotel booking analysis

  6. Project Overview


  1. GitHub

    In this project, you can choose to complete the project provided for you, or define the scope and data for a project of your own design. Either way, you'll be expected to go through the same steps outlined below. Udacity Provided Project. In the Udacity provided project, you'll work with four datasets to complete the project.

  2. GitHub

    Data Engineering Capstone Project Project Summary. The objective of this project was to create an ETL pipeline for I94 immigration, global land temperatures and US demographics datasets to form an analytics database on immigration events.

  3. Udacity Data Engineering Capstone Project: Automated-Data-Pipeline

    An automated ETL data pipeline for immigration, temperature and demographics information. View On GitHub; This project is maintained by HakbilenBerk. Udacity Data Engineering Capstone Project: Automated-Data-Pipeline. Project by Berk Hakbilen. Data pipeline for immigration,temperature and demographics information Goal of the project

  4. Data Engineering

    Data Engineering is the foundation for the world of Big Data. Enroll in Udacity's data engineering with AWS course and learn essential skills to become a data engineer. ... You'll have access to Github portfolio review and LinkedIn profile optimization to help you advance your career and land a high-paying role. ... Each project will be ...

  5. Introducing the Udacity Data Engineering Nanodegree Program

    Capstone Project. In the capstone project, each project is unique to the student. ... Github portfolio, etc. as well as learn useful tips for interviewing and landing a job. Demand for data engineers has never been higher. The Udacity Data Engineer Nanodegree program's combination of world-class curriculum and excellent services is the ...

  6. Data Engineering Nanodegree v2.0.0

    In this Capstone project, students will define the scope of the project and the data they will be working with to demonstrate what they have learned in this Data Engineering Nanodegree. Project Description - Data Engineering Capstone Project. Project Rubric - Data Engineering Capstone Project. Concept 01: Project Instructions; Concept 02 ...

  7. My Capstone Project for Udacity's Cloud DevOps Engineer Nanodegree

    My Capstone Project for Udacity's Cloud DevOps Engineer Nanodegree. After three months of various DevOps related courses and smaller projects, I had reached the end of my Nanodegree, and it was time to build out my capstone project. My project can be broken down into two parts: the application itself, and the infrastructure that deploys and ...

  8. Udacity Nanodegree Capstone Project

    Udacity Nanodegree Capstone Project. 07 Sep 2019. The Udacity Self-Driving Car Nanodegree has been a great experience. Together with the Intro to Self-Driving Car I have used the last 9 months learning all about Computer Vision, Convolutional Nets, Bayesian probability, and Sensor Fusion. The method used by Udacity was much to my liking, with a ...

  9. What I learned from finishing the Udacity Data Engineering Capstone

    Intro. This is my first experience with real Data Engineering. My tasks were to start with a lot of raw data from different sources, seemingly unrelated to each other, and somehow come up with a ...

  10. Data Engineering capstone project

    Data Engineering capstone project. Answered. Kareem Hafez Nada. 23 days ago. Hello Dear, I submitted the data engineering capstone project twice and got feedback that there is plagiarism on the second try, then I modified the notebook but can't submit it again.

  11. Data Engineering Capstone Project

    First in a series wherein I complete a capstone project for my data engineering course.

  12. Udacity Data Engineering Capstone Project

    How to organize the data to satisfy the analytical needs. Why this data model has been chosen. How it is implemented. Chapter 4: Run ELT to Model the Data . Load the data from S3 into the SQL database, if any. Perform quality checks on the database. Perform example queries. Chapter 5: Complete Project Write Up.

  13. Data Engineering capstone project

    Data Engineering capstone project. Answered. Kareem Hafez Nada. April 16, 2023 22:08. Hello Dear, I submitted the data engineering capstone project twice and got feedback that there is plagiarism on the second try, then I modified the notebook but can't submit it again.

  14. MSIM Capstone teams wrangle messy state data

    Managers enlisted two separate Capstone teams from the iSchool's Master of Science and Information Management (MSIM) program to take on these challenges. The four members of Team {Range} tackled the tags, putting to work a mix of skills in data science, information architecture and user experience design.