- No suggested jump to results
- Notifications
Udacity Data Engineering Nanodegree Capstone Project

Modingwa/Data-Engineering-Capstone-Project
Name already in use.
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more about the CLI .
- Open with GitHub Desktop
- Download ZIP
Sign In Required
Please sign in to use Codespaces.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Data engineering capstone project, project summary.
The objective of this project was to create an ETL pipeline for I94 immigration, global land temperatures and US demographics datasets to form an analytics database on immigration events. A use case for this analytics database is to find immigration patterns to the US. For example, we could try to find answears to questions such as, do people from countries with warmer or cold climate immigrate to the US in large numbers?
Data and Code
All the data for this project was loaded into S3 prior to commencing the project. The exception is the i94res.csv file which was loaded into Amazon EMR hdfs filesystem.
In addition to the data files, the project workspace includes:
- etl.py - reads data from S3, processes that data using Spark, and writes processed data as a set of dimensional tables back to S3
- etl_functions.py and utility.py - these modules contains the functions for creating fact and dimension tables, data visualizations and cleaning.
- config.cfg - contains configuration that allows the ETL pipeline to access AWS EMR cluster.
- Jupyter Notebooks - jupyter notebook that was used for building the ETL pipeline.
Prerequisites
- AWS EMR cluster
- Apache Spark
- configparser python 3 is needed to run the python scripts.
The project follows the following steps:
Step 1: scope the project and gather data, step 2: explore and assess the data, step 3: define the data model.
- Step 4: Run ETL to Model the Data
- Step 5: Complete Project Write Up
Project Scope
To create the analytics database, the following steps will be carried out:
- Use Spark to load the data into dataframes.
- Exploratory data analysis of I94 immigration dataset to identify missing values and strategies for data cleaning.
- Exploratory data analysis of demographics dataset to identify missing values and strategies for data cleaning.
- Exploratory data analysis of global land temperatures by city dataset to identify missing values and strategies for data cleaning.
- Perform data cleaning functions on all the datasets.
- Create immigration calendar dimension table from I94 immigration dataset, this table links to the fact table through the arrdate field.
- Create country dimension table from the I94 immigration and the global temperatures dataset. The global land temperatures data was aggregated at country level. The table links to the fact table through the country of residence code allowing analysts to understand correlation between country of residence climate and immigration to US states.
- Create usa demographics dimension table from the us cities demographics data. This table links to the fact table through the state code field.
- Create fact table from the clean I94 immigration dataset and the visa_type dimension.
The technology used in this project is Amazon S3, Apache Sparkw. Data will be read and staged from the customers repository using Spark.
Refer to the jupyter notebook for exploratory data analysis
3.1 Conceptual Data Model

The country dimension table is made up of data from the global land temperatures by city and the immigration datasets. The combination of these two datasets allows analysts to study correlations between global land temperatures and immigration patterns to the US.
The us demographics dimension table comes from the demographics dataset and links to the immigration fact table at US state level. This dimension would allow analysts to get insights into migration patterns into the US based on demographics as well as overall population of states. We could ask questions such as, do populous states attract more visitors on a monthly basis? One envisions a dashboard that could be designed based on the data model with drill downs into gradular information on visits to the US. Such a dashboard could foster a culture of data driven decision making within tourism and immigration departments at state level.
The visa type dimension table comes from the immigration datasets and links to the immigaration via the visa_type_key.
The immigration fact table is the heart of the data model. This table's data comes from the immigration data sets and contains keys that links to the dimension tables. The data dictionary of the immigration dataset contains detailed information on the data that makes up the fact table.
3.2 Mapping Out Data Pipelines
The pipeline steps are as follows:
- Load the datasets
- Clean the I94 Immigration data to create Spark dataframe for each month
- Create visa_type dimension table
- Create calendar dimension table
- Extract clean global temperatures data
- Create country dimension table
- Create immigration fact table
- Load demographics data
- Clean demographics data
- Create demographic dimension table
Step 4: Run Pipelines to Model the Data
4.1 create the data model.
Refere to the jupyter notebook for the data dictionary.
4.2 Running the ETL pipeline
The ETL pipeline is defined in the etl.py script, and this script uses the utility.py and etl_functions.py modules to create a pipeline that creates final tables in Amazon S3.
spark-submit --packages saurfang:spark-sas7bdat:2.0.0-s_2.10 etl.py
- Jupyter Notebook 95.1%
- Python 4.9%
DataEng-Automated-Data-Pipeline-Project
An automated ETL data pipeline for immigration, temperature and demographics information
- View On GitHub
This project is maintained by HakbilenBerk
Udacity Data Engineering Capstone Project: Automated-Data-Pipeline
Project by Berk Hakbilen
Data pipeline for immigration,temperature and demographics information
Goal of the project.
In this project the immigration information from the US is extracted from SAS files along with temperature and demographics information of the cities from csv files. The datasets are cleaned and rendered to JSON datasets on AWS S3. Later on JSON data is loaded on to staging tables on Redshift and then transformed to a star schema tables which also reside on Redshift. The whole pipeline is automated using Airflow.
The database schema was constructed for following usecases:
- Get information regarding which cities and states were popular destinations for travelers/immigrants, the type of travel (sea/air etc.), age and gender information of travelers/immigrants.
- Obtain correlation between popular immigrations/travel destinations and the temperature and demographic information of the destination.
- Visa details such as visa type, visa expiration, travel purpose of the individuals.
Database Model

The data is stored on a Redshift cluster on AWS. The Redshift contains the staging tables which serve the purpose of hold the data from JSON files in S3 Bucket. Redshift cluster also contains a fact table called immigrations where immigration/travel information of individuals are listed. I used immigrations as the fact table where through the city attribute the queries can be extraploted to other dimension tables such as demographics, temperature and visa details to obtain further correlations between the immigration/travel information. I opted for the star schema to optimize the database for Online Analytical Processing (OLAP).
Database Dictionary
Please refer to “Airflow/data/I94_SAS_Labels_Descriptions.SAS” for abbreviations.
Immigration Fact Table
Visa details dimension table, temperature dimension table, demographics dimension table, tools and technologies used.
The tools and technologies used:
- Apache Spark - Spark was used to read in the large data from SAS and CSV source files, clean them rewrite them in S3 bucket as JSON files.
- Amazon Redshift - The staging tables as well as fact and dimension tables are created on Redshift.
- Amazon S3 - S3 is used to store the large amounts of JSON data created.
- Apache Airflow - Airflow was used to automate the ETL pipeline.
Source Datasets
The datasets used and sources include:
- I94 Immigration Data : This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. here is where the data comes from.
- World Temperature Data : This dataset came from Kaggle. You can read more about it here .
- U.S. City Demographic Data : This dataset came from Kaggle. You can read more about it here .
- Airport Code Table :This is a simple table of airport codes and corresponding cities. It comes from here
Airflow: DAG Representation
The convert_to_json.py file used to convert the source data to JSON data on S3 bucket is thought of as a one time run process to convert the data. Therefore, it is not implemented in the DAG as a task.
Installation
Firstly, clone this repo.
Create a virtual environment called venv and activate it
- If you are using linux to run Airflow, make sure to install all required python packages: $ pip install -r requirements.txt
- If you are using Windows, you need to launch Airflow using Docker. The required Docker configuration is included in the repo: $ docker-compose up
You can access Airflow UI on your browser by going to:
Use user= user and password= password to login to Airflow.
Fill in your AWS access and secret keys in the aws.cfg file under Airflow/ directory.
Create an S3 Bucket on AWS and fill in the S3_Bucket in aws.cfg with your bucket address (s3a://your_bucket_name)
Create a new folder called “data_sas” under airflow/data. Download the SAS data files from here and copy them in the folder you just created. Also download “GlobalLandTemperaturesByCity.csv” from here and copy it under airflow/data along with other csv files (Because these data files are too large, they are not included in this repo. Therefore you need to download them manually).
- Change to airflow/tools directory and run the convert_to_json.py. This will read in the source files into Spark, transform them to JSON and write to the given S3 bucket. python convert_to_json.py

Under Connections, select Create.
- Conn Id: Enter aws_credentials.
- Conn Type: Enter Amazon Web Services.
- Login: Enter your Access key ID from the IAM User credentials.

- Conn Id: Enter redshift.
- Conn Type: Enter Postgres.
- Host: Enter the endpoint of your Redshift cluster without the port at the end.
- Schema: Enter the Redshift database you want to connect to.
- Password: Enter the password you created when launching your Redshift cluster.

- Run main_dag under DAGS by toggling its switch from off to on.
How to handle possible scenarios
- Because the data is already large Spark was our choice to read in and process the data. However if the data increases even more, Deploying spark on an EMR cluster would be a good solution.
- Dags can be set to run according to a schedule. So the main_dag in Airflow can be set to run everyday at 7am.
- Distribute the fact table and one dimension table on their common columns using DISTKEY.
- Using elastic resize.

DEV Community

Posted on Nov 12, 2020 • Originally published at Medium on Nov 12, 2020
My Capstone Project for Udacity’s Cloud DevOps Engineer Nanodegree

After three months of various DevOps related courses and smaller projects, I had reached the end of my Nanodegree, and it was time to build out my capstone project.
My project can be broken down into two parts: the application itself, and the infrastructure that deploys and hosts it.
The Application: Random Song
Random Song is a simple web app built using TypeScript, Node.js and Express. It serves as a web service that can send you a random song, using the Musixmatch API. To test out the app, simply go to the /random route, and you'll receive a random song object in JSON.
Going to the / route will return:
And going to the /random route will return a random song:
The Infrastructure
After the application was built, the next task was deploying it. In this project I decided to go with a Rolling Deployment. My goal was to write out the necessary configuration files and required build commands, and then create a pipeline to automate the process of actually building the application and deploying the infrastructure. This way, it could be executed in the exact same manner every time I added new code or infrastructure to the project. I needed a server to host Jenkins, my CI/CD technology of choice for this project. After provisioning an AWS EC2 instance and installing Jenkins, it was time to start defining the tasks that I’d want Jenkins to run. After accessing my application’s code, here are the tasks that I created for Jenkins to run:
- Install Node dependencies Simply running npm install would do the trick.
- Build the application My application is written in TypeScript, so I needed to run npm run build to build out the JavaScript distribution folder.
- Lint the code Running npm run lint to make sure everything is up to tslint's standards.
- Build the Docker image Here Jenkins would build a Docker container based on the Dockerfile that I created. It was based off a simple Node image, and would copy my application code into the container and start it.
- Upload the container to Docker Registry After being containerized, my application would then be uploaded to the Docker Registry for further availability.
- Create the Kubernetes configuration file Here I needed to create a Kubernetes deployment file that would be used in the next step to actually deploy my application into a cluster. I used Kubernetes via AWS EKS.
- Deploy application With the help of my Kubernetes deployment file and my Docker container that I uploaded to the registry, I was now able to deploy my application to my AWS EKS cluster. I also ran a kubectl get pods and kubectl get services to make sure everything was running as expected.
In the end, the app is deployed to the cluster and accessible to users. Random songs for days.
Unfortunately, the app is not currently deployed due to EKS not being a cheap service for a student to continuously pay for. However, I’m planning on taking the Random Song application and turning it into something that will be more permanently hosted in a future project. As far as the infrastructure goes, these are also things that can be repurposed in future projects — Docker containers, Kubernetes clusters and Jenkins pipelines are tools that can help build any software related project.
If you’d like to see the code, you can take a look at the project’s repo on GitHub .
Top comments (0)

Templates let you quickly answer FAQs or store snippets for re-use.
Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink .
Hide child comments as well
For further actions, you may consider blocking this person and/or reporting abuse

Working with streams in Node.js
Matthias Hryniszak - Apr 9

How to get started with the TERN stack
Phil Leggetter - Apr 27

Linked In profile Scraping
Muhammad Areeb - Apr 18

A Better Way to Design a Loading Button Component
Neil Chen - Apr 17
Once suspended, gschnei will not be able to comment or publish posts until their suspension is removed.
Once unsuspended, gschnei will be able to comment and publish posts again.
Once unpublished, all posts by gschnei will become hidden and only accessible to themselves.
If gschnei is not suspended, they can still re-publish their posts from their dashboard.
Once unpublished, this post will become invisible to the public and only accessible to Gavi Schneider.
They can still re-publish the post if they are not suspended.
Thanks for keeping DEV Community safe. Here is what you can do to flag gschnei:
gschnei consistently posts content that violates DEV Community's code of conduct because it is harassing, offensive or spammy.
Unflagging gschnei will restore default visibility to their posts.

We're a place where coders share, stay up-to-date and grow their careers.
Notes on Neural Nets

Udacity Nanodegree Capstone Project
The Udacity Self-Driving Car Nanodegree has been a great experience. Together with the Intro to Self-Driving Car I have used the last 9 months learning all about Computer Vision, Convolutional Nets, Bayesian probability, and Sensor Fusion. The method used by Udacity was much to my liking, with a series of projects where you learn by doing as you go. The last project was especially in-depth, using ROS to implement a full autonomous vehicle. A big part of the fun in this last project was doing it in a group, meeting great people and learning from them in lively discussions.
Our autonomous car is designed to follow pre-defined waypoints along a road, recognize the traffic light status from camera images, stop on red and restart driving on green. This system is tested on the Udacity simulator and was tested on the real Udacity self-driving car (Carla) on project delivery.
Members of Team “Always straight to the point!”
Table of contents.
- Workaround to avoid simulator latency issue with camera on
- Perception (tl_detector.py)
- Planning (waypoint_updater.py)
- Control (dbw_node.py)
- Tensorflow Object Detection API Installation
- Choose and test a model from the Model Zoo
- Configure the pipeline.config file
- Test the training process locally
- Train with GPUs using Google Cloud Platform (GCP)
- Export and test the final graph
- Model Evaluation
1. Overview
In order to complete the project we program in Python the different ROS nodes. The basic structure is well described in the Udacity Walkthrough, and our implementations follows the given guidelines. This implements the basic functionality of loading the Waypoints that the car has to follow, controlling the car movement along these Waypoints, and stop the car upon encountering a red traffic light. The details of the ROS system is described in System architecture .
After laying out the basic ROS functionality, much focus was given to implement the traffic light detection from images collected by the camera, both on the simulator and on the real testing lot. We decided in favor to use a Tensorflow model pre-trained on the general task of object detection. To prepare for this part of the project we read the previous work of Alex Lechner on the Udacity Nanodegree, as well as the Medium post of Vatsal Srivastava . We also used their datasets for test and validation.
To fine-tune this model to our task of recognizing traffic lights (red, yellow, and green) we generated thousands of labeled training images. This process is described in Datasets
The training on those images was done using the Tensorflow Object Detection API and Google Cloud Platform, as described in Traffic Light Classifier .
The integration of our Tensorflow Traffic Light Classifier into the ROS system is described in Final Integration .
However, before getting into the details we describe a workaround we needed to use to finish our tests satisfactorily, and solve a latency problem in the simulator when the camera is switched on.
2. Workaround to avoid simulator latency issue with camera on
After implementing the basic ROS functionality the car can complete a full lap in the simulator without issues. However, to fully implement the traffic light recognition with a classifier we need to activate the camera in the simulator. With this the simulator begins to send images data to the /image_color topic. This data processing seems to overload our system, and a latency appears delaying the updating of the waypoints relative to the position of our car. The waypoints begin to appear on the back of the car and, as the car tries to follow these waypoints, the control subsytem get erratic and the car drives off the road.
We found this problem both in the virtual machine as in a native Linux installation. It is also observed by many Udacity Nanodegree participants, as seen in these GitHub issues: Capstone Simulator Latency #210 and turning on the camera slows car down so auto-mode gets messed up #266 . An example of the issue at our side is shown below:

We implemented a workaround by a little modification in one of the files provided by Udacity in the ROS System, bridge.py . This module builds the node styx_server , that creates the topics responsible to transmit different data out of the simulator. We tried first to only process some of the images received via /image_color but it seemed the origin of the delay was the presence of these images in the topic in the first place. Thus, we implemented the skipping logic in the topic itself, and the issue got finally solved. The code modifies the publish_camera() function:
Furthermore, the waypoints queue was reduced from 200 to 20, which also proved to speed up the simulator considerably before implementing this workaround.
However, this method only allowed us to get rid of the latency in a native Linux installation. On a Virtual Machine under Windows and on the Ucacity Web Workspace, the latency got better, maybe with increased values of skipped frames, but still showed after some time.
3. ROS System Architecture
The ROS system can be divided in three main subsystems:
- Perception: detects traffic lights and classifies them into red, yellow, green, or unknown
- Planning: loads the circuit waypoints and update the waypoint list in front of our car
- Control: makes the car to actually move along the waypoints by controlling the car’s throttle, steer, and brake using PID controller and low pass filter
The diagram below shows the subsystem division, as well as the ROS nodes and topics.
- tl_detector: in the perception subsystem.
- waypoint_updater: in the planning subsystem
- dbw_node: in the control subsystem
- topics: are named buses over which nodes send and receive messages, by subscribing or publishing to them.

i. Perception (tl_detector.py)
This node subscribes to four topics:
- /base_waypoints : provides the complete list of waypoints for the course.
- /current_pose : determines the vehicle’s location.
- /image_color : provides an image stream from the car’s camera.
- /vehicle/traffic_lights : provides the (x, y, z) coordinates of all traffic lights.
This node will find the waypoint of the closest traffic light in front of the car. This point will be described by its index counted from the car (e.g.: the number 12 waypoint ahead of the car position). Then, the state of the traffic light will be acquired from the camera image in /image_color using the classifier implementation in tl_classifier.py . If the traffic light is red, it will publish the waypoint index into the /traffic_waypoint topic. This information will be taken by the Planning subsystem to define the desired velocity at the next sequence of waypoints.
ii. Planning (waypoint_updater.py)
This node subscribes to the topics:
- /base_waypoints : list of all waypoints for the track
- /current_pose : the current position coordinates of our car
- /traffic_waypoint : waypoint list of the traffic light in our circuit
It publishes a list of waypoints in front of our car to the topic /final_waypoints . The data in waypoints also includes the desired velocity of the car at the given waypoint. If a red traffic light is detected in front of the car, we modify the desired velocity of the /final_waypoints up to it in a way that the car slowly stops at the right place.
The number of waypoints is defined by the parameter LOOKAHEAD_WPS . If this parameter is too big, there is a big latency updating the waypoints, in a way that the car gets ahead of the list of way points. This confuses the control of the car, which tries to follow the waypoints. We set for a value of 20, to get rid of this latency while still having enough data to properly control the car.
iii. Control (dbw_node.py)
In the control subsystem, Udacity provides an Autoware software waypoint_follower.py . After publishing /final_waypoints this software publishes twist commands to the /twist_cmd topic, that contain the desired linear and angular velocities.
dbw_node.py subscribes to /twist_cmd , /current_velocity , and /vehicle/dbw_enabled . It passes the messages in these nodes to the Controller class from twist_controller.py . We implemented here the control of the car, using the provided Yaw Controller, PID Controller, and LowPass Filter.
It is important to perfom the control only when /vehicle/dbw_enabled is true. When this topic message is false, it means the car is on manual control. In this condition the PID controller would mistakenly accumulate error.
The calculated throttle, brake, and steering are published to the topics:
- /vehicle/throttle_cmd
- /vehicle/brake_cmd
- /vehicle/steering_cmd
4. Datasets
(By Andrei Sasinovich )
After we got our dataset of images from the simulator and rosbag we started to think how to label it. The first option was label it by hand but when we look into the number of collected images (more than 1k) we decided that it’s not a good way as we have other work to do 😊
We decided to generate a dataset. We cut by 10-15 traffic lights of each color.

Using OpenCV generate thousands of images by resizing traffic lights, changing contrast and brightness. As every traffic light was applied on a background by our script we could generate coordinates of drawn bounding boxes as well.

TFRecord file was created on the fly by packing all our info into TF format using tensorflow function tf.train.Example
5. Traffic Light Classifier
The state of the traffic light in front of the car has to be extracted from the camera’s images, both on the simulator and at the real site. Different methods of image recognition can be used. We decided to use Deep Learning in the form of a model pre-trained on the general task of object detection. While previously, in this Udacity nanodegree, we defined a model from scratch and trained it for traffic sign classification, object detection also includes the capability of locating an object within an image and delimiting its position on a bounding box. Only this way can we extract from the camera image the state of one or several traffic light within the landscape in front of us.
Several Deep Learning methods for object detection have been developed by researchers. Two of the most popular methods are R-CNN (Regions with CNN), and SSD (Single Shot Detector). While R-CNN performs with higher accuracy than SSD, the latter is faster. Improved versions have been developed (Fast R-CNN, Faster R-CNN) but they are still slower than SSD.
The Google’s Tensorflow Object Detection API provides a great framework to implement our traffic light classifier. This is a collection of pre-trained models and high level subroutines that facilitate the use and fine-tuning of these models. The models are compiled in Tensorflow detection model zoo , belonging mainly to the SSD and Faster R-CNN methods.
Although the goal of the API is to facilitate the fine-tune training of these model, there are still a lot of installation and configuration steps that are not trivial at all. Actually, by the time you have fully trained a model for your purposes you will have gone through a convoluted series of steps, and probably several errors. There is extensive information on the API Readme . However, this information is general and in some parts lacks detail for our concrete task. So, we find useful to include below a detailed tutorial describing our experience
On a high level, the steps to take are:
i. Tensorflow Object Detection API Installation
Ii. choose and test a model from the model zoo, iii. configure the pipeline.config file, iv. test the training process locally, v. train with gpus using google cloud platform (gcp), vi. export and test the final graph.
(You find the official reference here)
Install TensorFlow:
pip install tensorflow
- Install required libraries: sudo apt-get install protobuf-compiler python-pil python-lxml python-tk pip install --user Cython pip install --user contextlib2 pip install --user jupyter pip install --user matplotlib
Create a new directory tensorflow
Clone the entire models GitHub repository from the tensorflow directory.
This will take 1.2 GB on disk, as it contains models for many different tasks (NLP, GAN, ResNet…). Our model is found in
and many of the commands below will be input from `/tensorflow/models/research/
- Install COCO API # From tensorflow/models/research/ git clone https://github.com/cocodataset/cocoapi.git cd cocoapi/PythonAPI make cp -r pycocotools <path_to_tensorflow>/models/research/
- Compile Protobuf # From tensorflow/models/research/ protoc object_detection/protos/*.proto --python_out=.
- Add libraries to PYTHONPATH # From tensorflow/models/research/ export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
- If you run without errors until here, test your installation with python object_detection/builders/model_builder_test.py
Once installed, the API provides the following tools and scripts that we will use to fine-tune a model with our data:
- An inference script in the form of a Jupyter Notebook, to detect objects on an image from a “frozen_inference_graph.pb” ( Object Detection Demo )
- Tools to create TFRecord files from original data ( dataset tools )
- A training script to fine-tune a pre-trained model with our own dataset, locally or in Google Cloud ( model_main.py )
- A script to export a new “frozen_inference_graph.pb” from a fine-tuned model ( export_inference_graph.py )
In the model zoo you find a list of pre-trained models to download, as well as some basic stats regarding accuracy and speed. These models are pre-trained with datasets like the COCO dataset or the Open Images dataset . The COCO dataset, for example, consists of more than 200K images, with 1.5 object instances labeled within them, belonging to 80 different object categories.
Each pre-trained model contains:
- a checkpoint ( model.ckpt.data-00000-of-00001 , model.ckpt.index , model.ckpt.meta )
- a frozen graph ( frozen_inference_graph.pb ) to be used for out of the box inference
- a config file ( pipeline.config )
The pre-trained model can already be tested for inference. As it is not fine-tuned for our requirements (detect traffic lights and classify them into red, yellow, or green), the results will not be satisfactory for us. However is a good exercise to get familiarized with the inference script in the Object Detection Demo Jupyter notebook. This notebook downloads the model automatically for you. If you download it manually to a directory of your choice, as you will need to work with it when fine-tuning, you can comment out the lines in the “Download Model” section and input the correct local path in
As it happens that the COCO dataset includes “Traffic Light” as an object category, when we run the inference script with one of our images, this class will probably be recognized. However, the model as it is will not be able to classify the traffic light state. Below you can see the result on a general picture and on one of our pictures out of the simulator.
The parameters to configure the training and evaluation process are described in the pipeline.config file. When you download the model you get the configuration corresponding to the pre-training.
The pipeline.config file has five sections: model{…}, train_config{…}, train_input_reader{…}, eval_config{…}, and eval_input_reader{…}. These sections contain parameters pertaining to the model training (dropout, dimensions…), and the training and evaluation process and data.
This can be used for our fine-tuning with some modifications:
- Change num_classes: 90 to the number of classes that we are going to train the model on. In our case these are the four described in our label_map.pbtxt , (‘red’, ‘green’, ‘yellow’, ‘unknown’)
- max_detections_per_class: 100 and max_total_detections: 300 to max_detections_per_class: 10 and max_total_detections: 10
- to the directory where you placed the pre-trained model: fine _tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
- num_steps: 200000 to num_steps: 20000
- Set num_examples to the number of images in your evaluation data
- PATH_TO_BE_CONFIGURED placeholders in input_path and label_map_path set to your .record files and label_map.pbtxt
If you follow this tutorial, you will first set your paths to your local folders to train the model locally. However, when you go to train the model on the cloud, you have to remember to change the paths to you GCP buckets, as described below.
Training without a GPU will take really too long to be practical and, even though I have a GPU on my computer, I didn’t figure out how to use it with this API. Anyway, running locally is useful to test your setup, as there are lots of thing that can go wrong and the latency when sending a work to the cloud can delay a lot your debugging.
Training is done using the script model_main.py in
The script needs the training and evaluation data, as well as the pipeline.config , as described above. You will send some parameters to the script in the command line. So, first set the following environment variables from the terminal.
MODEL_DIR points to a new folder where you want to save your new fine-tuned model. The path to the pre-trained model is already specified in your pipeline.config under the fine_tune_checkpoint parameter.
Feel free to set a lower number for NUM_TRAIN_STEPS , as you will not have the patience to run 50000. In my system 1000 was a good number for testing purposes.
Finally, you can run the script using Python from the command line:
Although you can use any cloud service like Amazon’s AWS or Microsoft’s Azure, we though Google Cloud would have better compatibility with Google’s Tensorflow API. To use it you first need to setup your own GCP account. You will get 200$ of credit with your new GCP account, which will be more than enough to do all the work in this tutorial.
The details of setting up your GCP account (not trivial) are out of the scope of this tutorial, but you basically need to create a project, where your work will be executed, and a bucket, where your data will be stored. Check the following official documentation:
- Getting started
- Create a Linux VM
- Create a bucket
- Train a TensorFlow Model
After this, running a training work on the cloud is very similar to running it locally, with the following additional steps:
- Packaging: The script are currently stored in your computer, not on the cloud. At running time you will send them to the cloud and run it there. This will be sent in the form of packages that you create by: # From tensorflow/models/research/ bash object_detection/dataset_tools/create_pycocotools_package.sh /tmp/pycocotools python setup.py sdist (cd slim && python setup.py sdist)
- Create YAML configuration file: this file describes the GPUs setup you will use on the cloud. You can just create a text file with the following content: trainingInput: runtimeVersion: "1.12" scaleTier: CUSTOM masterType: standard_gpu workerCount: 9 workerType: standard_gpu parameterServerCount: 3 parameterServerType: standard
- Upload your data and pre-trained model to your bucket: you can either use the command line with gsutil cp ... or the web GUI on your buckets page.
- Modify and upload your pipeline.config : change the paths for the model and data to the corresponding location in your bucket in the form gs://PRE-TRAINED_MODEL_DIR and gs://DATA_DIR
- Define or redefine the following environment variable in your terminal: PIPELINE_CONFIG_PATH={path to pipeline config file} MODEL_DIR={path to fine-tuned model directory} NUM_TRAIN_STEPS=50000 SAMPLE_1_OF_N_EVAL_EXAMPLES=1
- Send the training job to the cloud with the command: # From tensorflow/models/research/ gcloud ml-engine jobs submit training object_detection_`date +%m_%d_%Y_%H_%M_%S` \ --runtime-version 1.12 \ --job-dir=gs://${MODEL_DIR} \ --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,/tmp/pycocotools/pycocotools-2.0.tar.gz \ --module-name object_detection.model_main \ --region us-central1 \ --config ${PATH_TO_LOCAL_YAML_FILE} \ -- \ --model_dir=gs://${MODEL_DIR} \ --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH} --num_train_steps=${NUM_TRAIN_STEPS} --sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES --alsologtostderr
You are now almost ready to test your fine-trained model! First download the new model in gs://${MODEL_DIR} to your computer. From this model you will create the frozen graph frozen_inference_graph.pb that will be the new input to the Object Detection Demo Jupyter notebook.
Exporting is done using the script export_inference_graph.py in
This script also needs the following parameters to be sent in the command line.
You have now in your ${EXPORT_DIR} the frozen graph frozen_inference_graph.pb . This file, together with you new label_map.pbtxt , is the input to the Jupyter notebook as described in section ii. Choose and test a model from the Model Zoo . We got the following results:
As you can see, now instead of “traffic light” we get the traffic light status as defined in our label_map.pbtxt :)
6. Final Integration
The ROS system provided by Udacity reserves a space to implement the classifier. You find our implementation in tl_classifier.py .
The classifier can be implemented here with the same logic as the Object Detection Jupyter Notebook discussed above. However our implementation resembles more closely the Udacity Object Detection Lab . The two implementations are equivalent, but the latter is simpler, easier to read, and quicker.
The fine-tuned model outputs several bounding boxes, classes, and scores corresponding to the different objects detected in the image. The scores reflects the confidence level of the detected object. We first filter the object using a confidence threshold of 70% applied to the scores. Later we decide for the remaining box with the highest score as the traffic light state present in the picture.
Two libraries were equally used to process images in the dataset generation and in the ROS system: PIL and CV2. These libraries use different images formats: PIL in RGB, and CV2 in BGR. To correct this discrepancy we interchange the dimensions in the image numpy array.
Our code also includes a conditional section to save the annotated images to disk for debugging purposes.

i. Model Evaluation
We trained three different models from the model zoo: ssd_mobilenet_v1_coco, ssd_inception_v2_coco, and faster_rcnn_inception_v2_coco. As detailed in the model zoo table the more accurate the model is, the longer the evaluation time is. In our case SSD-Mobilenet is the fastest and Faster-RCNN the most accurate.
We trained our models both for the simulator environment and for the real images at the Udacity site contained in the Rosbag file.
For the simulator images the SSD-Mobilenet model was quite accurate and, being the fastest, we chose it for our frozen graph. A final video with our results on the simulator and the annotated images, as well as the ROS console output, is included in Udacity_Capstone_video .
For the real site however, the accuracy was not as high and finally we decided to use Fastest-RCNN despite the higher evaluation time. To speed up the evaluation we processed only the higher half of the image, as the lower part contains the car’s hood, not necessary for the task. Example videos for the classification accuracy in the different models are included in SSD-Mobilenet , SSD-Inception , and Fastest-CRNN .

- Search for:

Un sito che tratta di data science, machine learning, big data e applicazioni varie.
... udacity data engineering capstone project.
In questo lungo post vi presento il progetto che ho sviluppato per il Data Engineering Nanodegree (DEND) di Udacity. Cosa sviluppare era libera scelta dello sviluppatore posto che alcuni criteri fossero soddisfatti, per esempio lavorare con un database di almeno 3 milioni di records.
Questa è il primo notebook del progetto, nel secondo ci sono esempi di queries che possono essere eseguite sul data lake.
Data lake with Apache Spark ¶
Data engineering capstone project ¶, project summary ¶.
The Organization for Tourism Development ( OTD ) want to analyze migration flux in USA, in order to find insights to significantly and sustainably develop the tourism in USA.
To support their core idea they have identified a set of analysis/queries they want to run on the raw data available.
The project deals with building a data pipeline, to go from raw data to the data insights on the migration flux.
The raw data are gathered from different sources, saved in files and made available for download.
The project shows the execution and decisional flow, specifically:
- Describe the data and how they have been obtained
- Answer the question “how to achieve the target?”
- What infrastructure (storage, computation, communication) has been used and why
- Explore the data
- Check the data for issues, for example null, NaN, or other inconsistencies
- Why this data model has been chosen
- How it is implemented
- Load the data from S3 into the SQL database, if any
- Perform quality checks on the database
- Perform example queries
- Documentation of the project
- Possible scenario extensions
- 1. Scope of the Project
- 1.1 What data
- 1.2 What tools
- 1.3 The I94 immigration data
- 1.3.1 What is an I94?
- 1.3.2 The I94 dataset
- 1.3.3 The SAS date format
- 1.3.4 Loading I94 SAS data
- 1.4 World Temperature Data
- 1.5 Airport Code Table
- 1.6 U.S. City Demographic Data
- 2. Data Exploration
- 2.1 The I94 dataset
- 2.2 I94 SAS data load
- 2.3 Explore I94 data
- 2.4 Cleaning the I94 dataset
- 2.5 Store I94 data as parquet
- 2.6 Airport codes dataset: load, clean, save
- 3. The Data Model
- 3.1 Mapping Out Data Pipelines
- 4. Run Pipelines to Model the Data
- 4.1 Provision the AWS S3 infrastructure
- 4.2 Transfer raw data to S3 bucket
- 4.3 EMR cluster on EC2
- 4.3.1 Provision the EMR cluster
- 4.3.2 Coded fields: I94CIT and I94RES
- 4.3.3 Coded field: I94PORT
- 4.3.4 Data cleaning
- 4.3.5 Save clean data (parquet/json) to S3
- 4.3.6 Loading, cleaning and saving airport codes
- 4.4 Querying data on-the-fly
- 4.5 Querying data using the SQL querying style
- 4.6 Data Quality Checks
- Lesson learned
1. Scope of the Project ¶
The OTD want to run pre-defined queries on the data, with periodical timing.
They also want to maintain the flexibility to run different queries on the data, using BI tools connected to an SQL-like database.
The core data is the dataset provided by US governative agencies filing request of access in the USA (I94 module).
They also have other lower value data available, that are not part of the core analysis, whose use is unclear, therefore are stored in the data lake for a possible future use.
1.1 What data ¶
Following datasets are used in the project:
- I94 immigration data for year 2016 . Used for the main analysis
- World Temperature Data
- Airport Code Table
- U.S. City Demographic Data
1.2 What tools ¶
Because of the nature of the data and the analysis that must be performed, not time-critical analysis, monthly or weekly batch, the choice fell on a cheaper S3-based data lake with on-demand on-the-fly analytical capability: EMR cluster with Apache Spark , and optionally Apache Airflow for scheduled execution (not implemented here).
The architecture shown below has been implemented.

- Starting from a common storage solution (currently Udacity workspace) where both the OTD and its partners have access, the data is then ingested into an S3 bucket , in raw format
- To ease future operations, the data is immediately processed, validated and cleansed using a Spark cluster and stored into S3 in parquet format. Raw and parquet data formats coesist in the data lake.
- By default, the project doesn”t use costly Redshift cluster, but data are queried in-place on the S3 parquet data.
- The EMR cluster serves the analytical needs of the project. SQL based queries are performed using Spark SQL directly on the S3 parquet data
- A Spark job can be triggered monthly, using the Parquet data. The data is aggregated to gain insights on the evolution of the migration flows
1.3 The I94 immigration data ¶
The data are provided by the US National Tourism and Trade Office . It is a collection of all I94 that have been filed in 2016.
1.3.1 What is an I94? ¶
To give some context is useful to explain what an I94 file is.
From the government website : “The I-94 is the Arrival/Departure Record, in either paper or electronic format, issued by a Customs and Border Protection (CBP) Officer to foreign visitors entering the United States.”
1.3.2 The I94 dataset ¶
Each record contains these fields:
- CICID, unique numer of the file
- I94YR, 4 digit year of the application
- I94MON, Numeric month of the application
- I94CIT, city where the applicant is living
- I94RES, state where the applicant is living
- I94PORT, location (port) where the application is issued
- ARRDATE, arrival date in USA in SAS date format
- I94MODE, how did the applicant arrived in the USA
- I94ADDR, US state where the port is
- DEPDATE is the Departure Date from the USA
- I94BIR, age of applicant in years
- I94VISA, what kind of VISA
- COUNT, used for summary statistics, always 1
- DTADFILE, date added to I-94 Files
- VISAPOST, department of State where where Visa was issued
- OCCUP, occupation that will be performed in U.S.
- ENTDEPA, arrival Flag
- ENTDEPD, departure Flag
- ENTDEPU, update Flag
- MATFLAG, match flag
- BIRYEAR, 4 digit year of birth
- DTADDTO, date to which admitted to U.S. (allowed to stay until)
- GENDER, non-immigrant sex
- INSNUM, INS number
- AIRLINE, airline used to arrive in USA
- ADMNUM, admission Number
- FLTNO, flight number of Airline used to arrive in USA
- VISATYPE, class of admission legally admitting the non-immigrant to temporarily stay in USA
More details in the file I94_SAS_Labels_Descriptions.SAS
1.3.3 The SAS date format ¶
Represent any date D0 as the number of days between D0 and the 1th January 1960
1.3.4 Loading I94 SAS data ¶
The package saurfang:spark-sas7bdat:2.0.0-s_2.11 and the dependency parso-2.0.8 are needed to read SAS data format.
To load them use the config option spark.jars and give the URL of the repositories, as Spark itself wasn’t able to resolve the dependencies.
1.4 World temperature data ¶
The dataset is from Kaggle. It can be found here .
The dataset contains temperature data:
- Global Land and Ocean-and-Land Temperatures (GlobalTemperatures.csv)
- Global Average Land Temperature by Country (GlobalLandTemperaturesByCountry.csv)
- Global Average Land Temperature by State (GlobalLandTemperaturesByState.csv)
- Global Land Temperatures By Major City (GlobalLandTemperaturesByMajorCity.csv)
- Global Land Temperatures By City (GlobalLandTemperaturesByCity.csv)

1.5 Airport codes data ¶
This is a table of airport codes, and information on the corresponding cities, like gps coordinates, elevation, country, etc. It comes from Datahub website .

1.6 U.S. City Demographic Data ¶
The dataset comes from OpenSoft. It can be found here .

2. Data Exploration ¶
In this chapter we proceed identifying data quality issues, like missing values, duplicate data, etc.
The purpose is to identify the flow in the data pipeline to programmatically correct data issues.
In this step we work on local data.
2.1 The I94 dataset ¶
- How many files are in the I94 dataset?
- What is the size of the files?
2.2 I94 SAS data load ¶
To read SAS data format I need to specify the com.github.saurfang.sas.spark format.
- Let’s see the schema Spark applied on reading the file
The most columns are categorical data, this means the information is coded, for example I94CIT=101 , 101 is the country code for Albania.
Other columns represent integer data.
It appears clear that there is no need to have data that are defined as double => let’s change those fields to integer
Verifying the schema is correct.
- convert string columns dtadfile and dtaddto to date type
These fields come in a simple string format. To be able to run time-based queries they are converted to date type
- convert columns arrdate and depdate from SAS-date format to a timestamp type.
A date in SAS format is simply the number of days between the chosen date and the reference date (01-01-1960)
- print final schema
2.3 Explore I94 data ¶
- How many rows does the I94 database has?
- Let’s see the gender distribution of the applicants
- Where are the I94 applicants coming from?
I want to know the 10 most represented nations
The i94res code 135, where the highest number of visitors come from, corresponds to the the United Kingdom, as can be read in the accompanying file I94_SAS_Labels_Descriptions.SAS
- What port registered the highest number of arrivals?
New York City port registered the highest number of arrivals.
2.4 Cleaning the I94 dataset ¶
These are the steps to perform on the I94 database:
- Identify null and NaN values. Remove duplicates ( quality check ).
- Find errors in the records ( quality check ) for example dates not in year 2016
- Counting how many NaN in each column, excluding the date type columns dtadfile , dtadddto , arrdate , depdate because the isnan function works only on numerical types
- How many rows of the I94 database have null value?
The number of nulls equal the number of rows. It means there is at least one null on each row of the dataframe.
- Now we can count how many null there are in each row
There are many nulls in many columns.
The question is, if there is a need to correct/fill those nulls.
Looking at the data, it seems like some field have been left empty for lack of information.
Because these are categorical data there is no use, at this step, in assigning arbitrary values to the nulls.
The nulls are not going to be filled apriori, but only if a specific need comes up.
- Are there duplicated rows?
Dropping duplicate row
Cheching if the number changed
No row has been dropped => no duplicated row
- Verify that all rows have i94yr column equal 2016
This gives confidence on the consistence of the data
2.5 Store I94 data as parquet ¶
I94 data are stored in parquet format in an S3 bucket, they are partinioned using the fields: year, month
2.6 The Airport codes dataset ¶
A snippet of the data
How many records?
There are no duplicates
We discover there are some null fields:
The nulls are in these colomuns:
No action taken to fill the nulls
Finally, let’s save the data in parquet format in our temporary folder mimicking the S3 bucket.
3. The Data Model ¶
The core of the architecture is a data lake , with S3 storage and EMR processing.
The data are stored into S3 in raw and parquet format.
Apache Spark is the tool elected for analytical tasks, therefore all data are loaded into Spark dataframe using a schema-on-read approach.
For SQL queries style on the data, Spark temporary views are generated.
3.1 Mapping Out Data Pipelines ¶
- Provision the AWS S3 infrastructure
- Transfer data from the common storage to the S3 lake storage
- Provision an EMR cluster. It runs 2 steps then autoterminate, these are the 2 steps: 3.1 Run a spark job to extract codes from file I94_SAS_Labels_Descriptions.SAS and save to S3 3.2 Data cleaning. Find nan, null, duplicate. Save the clean data to parquet files
- Generate reports using Spark query on S3 parquet data
- On-the-fly queries with Spark SQL

4. Run Pipeline to Model the Data ¶
4.1 provision the aws s3 infrastructure ¶.
Reading credentials and configuration from file
Create the bucket if it’s not existing
4.2 Transfer raw data to S3 bucket ¶
Transfer the data from current shared storage (currently Udacity workspace) to S3 lake storage.
A naive metadata system is implemented. It uses a json file to store basic information on each file added to the S3 bucket:
- file name: file being processed
- added by: user logged as | aws access id
- date added: timestamp of date of processing
- modified on: timestamp of modification time
- notes: any additional information
- access granted to (role or policy): admin | anyone | I94 access policy | weather data access policy |
- expire date: 5 years (default)
These dataset are moved to the S3 lake storage:
- I94 immigration data
- airport codes
- US cities demographics
4.3 EMR cluster on EC2 ¶
An EMR cluster on EC2 instances with Apache Spark preinstalled is used to perform the ELT work.
A 3-nodes cluster of m5.xlarge istances is configured by default in the config.cfg file.
If the performance requires it, the cluster can be scaled up to use more nodes and/or bigger instances.
After the cluster has been created, the steps to execute spark cleaning jobs are added to the EMR job flow, the steps are in separate .py files. These steps are added:
- extract I94res, i94cit, i94port codes
- save the codes in a json file in S3
- load I94 raw data from S3
- change schema
- data cleaning
- save parquet data to S3
The cluster is set to auto-terminate by default after executing all the steps.
4.3.1 Provision the EMR cluster ¶
Create the cluster using the code emr_cluster.py [Ref. 3] and emr_cluster_spark_submit.py and and set the steps to execute spark_script_1 and spark_script_2 .
These scripts have already been previously uploaded to a dedicated folder in the project’s S3 bucket, and are accessible from the EMR cluster.
The file spark_4_emr_codes_extraction.py contains the code for following paragraphs 4.3.1
The file spark_4_emr_I94_processing.py contains the code for following paragraphs 4.3.2, 4.3.3, 4.3.4
4.3.2 Coded fields: I94CIT and I94RES ¶
I94CIT, I94RES contain codes indicating the country where the applicant is born (I94CIT), or resident (I94RES).
The data is extracted from I94_SAS_Labels_Descriptions.SAS . This can be done sporadically or every time a change occurred, for example a new code has been added.
The conceptual flow below was implemented.

First steps are define credential to access S3a then load the data in a dataframe, in a single row
Find the section of the file where I94CIT and I94RES are specified.
It start with I94CIT & I94RES and finish with the semicolon character.
To match the section, it is important to have the complete text in a single row, I did this using the option wholetext=True in the previous dataFrame read operation
Now I can split in a dataframe with multiple rows
I filter the rows with structure \ = \
And then create 2 differents columns with code and country
I can finally store the data in a single file in json format
4.3.3 Coded field: I94PORT ¶
Similarly to extract the I94PORT codes
The complete code for codes extraction is in spark_4_emr_codes_extraction.py
4.3.4 Data cleaning ¶
The cleaning steps have already been shown in section 2, here are only summarized
- Load dataset
- Numeric fields: double to integer
- Fields dtadfile and dtaddto : string to date
- Fields arrdate and depdate : sas to date
- Handle nulls: no fill is set by default
- Drop duplicate
4.3.5 Save clean data (parquet/json) to S3 ¶
The complete code, refactorized and modularized, is in **spark_4_emr_I94_processing.py**
As a side note, saving the test file as parquet takes about 3 minute on the provisioned cluster. The complete script execution takes 6 minutes.
4.3.6 Loading, cleaning and saving airport codes ¶
4.4 querying data on-the-fly ¶.
The data in the data lake can be queried on-place. That is the Spark cluster on EMR is directly operating on S3 data.
There are two possible ways to query the data:
- using Spark dataframe functions
- using SQL on tables
We see example of both programming styles.
These are some typical queries that are run on the data:
- For each port, in a given period, how many arrivals there are in each day?
- Where are the I94 applicants coming from, in a given period?
- In the given period, what port registered the highest number of arrivals?
- Number of arrivals in a given city for a given period
- Travelers genders
- Is there a city where the difference between male and female travelers is higher?
- Find most visited city (the function)
The queries are collected in the Jupyter notebook Capstone project 1 – Querying the data lake.ipynb
4.5 Querying data using the SQL querying style ¶
4.6 data quality checks ¶.
The query-in-place concept implemented here uses a very short pipeline, data are loaded from S3 and after a cleaning process are saved as parquet. Quality of the data is guaranteed by design.
5. Write Up ¶
The project has been set up with scalability in mind. All components used, S3 and EMR, offer higher degree of scalability, either horizontal and vertical.
The tool used for the processing, Apache Spark, is the de facto tool for big data processing.
To achieve such a level of scalability we sacrified processing speed. A data warehouse solution with a Redshift database or an OLAP cube would have been faster answering the queries. Anyway nothing forbids to add a DWH to stage the data in case of a more intensive, real-time responsive, usage of the data.
An important part of an ELT/ETL process is automation. Although it has not been touched here, I believe the code developed here is prone to be automatized with a reasonable small effort. A tool like Apache Airflow can be used for the purpose.
Scenario extension ¶
- The data was increased by 100x.
In an increased data scenario, the EMR hardware needs to be scaled up accordingly. This is done by simply changing configuration in the config.cfg file. Apache Spark is the tool for big data processing, and is already used as the project analityc tool.
- The data populates a dashboard that must be updated on a daily basis by 7am every day.
In this case an orchestration tool like Apache Airflow is required. A DAG that trigger Phython scripts and Spark jobs executions, needs to be scheduled for daily execution at 7am.
The results of the queries for the dashboard can be saved in a file.
- The database needed to be accessed by 100+ people.
A proper database wasn’t used, on the contrary Amazon S3 is used to store data and queries them in-place. S3 is designed to massive scale in mind, it is able to handle sudden traffic spikes. Therefore, accessing the data by many people shouldn’t be an issue.
The programming used in the project, provision an EMR cluster for any user that plan to run it’s queries. 100+ EMRs is probably going to be expensive for the company. A more efficient sharing of processing resources must be realized.
6. Lessons learned ¶
Emr 5.28.1 use python 2 as default ¶.
- As a consequence important Python packages like pandas are not installed by default for Python 3.
- install packages for Python 3: python 3 -m pip install \
Adding jars packages to Spark ¶
For some reason adding the packages in the Python programm when instantiating the sparkSession doesn’t work (error message package not found). This doesn’t work:
The packages must be added in the spark-submit:
Debugging Spark on EMR ¶
While evrything work locally, it doesn’t necessarily means that is going to work on the EMR cluster. Debugging the code is easier with SSH on EMR.
Reading an S3 file from Python is tricky ¶
While reading with Spark is straightforward, one just needs to give the address s3://…., with Python boto3 must be used.
Transfering file to S3 ¶
During the debbuging phase, when the code on S3 must be changed many time using the web interface is slow and unpractical ( permanently delete ). Memorize this command: aws s3 cp <local file> <s3 folder>
Removing the content of a directory from Python ¶
import shutil dirPath = 'metastore_db' shutil.rmtree(dirPath)
7. References ¶
- AWS CLI Command Reference
- EMR provisioning is based on: Github repo Boto-3 provisioning
- Boto3 Command Reference
Leave a Reply Cancel Reply
Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *
You may use these HTML tags and attributes:
University of Washington Information School

MSIM Capstone teams wrangle messy state data
Washington State’s open data portal provides citizens an abundance of free, high-value information, from interactive maps and crime statistics to budget proposals, campaign contribution sources, health provider credentials and electric vehicle registration numbers … or is that electric vehicles registration?
That little “s” points to a messy problem. The portal, www.data.wa.gov , started as a largely decentralized, crowd-sourced, do-your-own thing collection site, with individual publishers at different agencies creating an estimated 1,200 tags to help people access government information. After more than 10 years, the keyword tags have become riddled with inconsistent singular/plurals, misspellings, duplications and term variations that can make it harder for users to find needed information.
Washington Technology Solutions (WaTech), the state IT agency that operates the portal, wanted a tag cleanup and some guiding rules for agency publishers, as well as an overall portal performance update. Managers enlisted two separate Capstone teams from the iSchool’s Master of Science and Information Management (MSIM) program to take on these challenges.
The four Early-Career residential students from Team {Range} tackled the tags, putting to work a mix of skills in data science, information architecture and user experience design. They also consulted with MSIM professors for expertise in data taxonomy. “Our professors were phenomenal. They were there for us,” says team member Isabella Eriksen.
After researching best practices for the cleanup, the team manually inspected each tag, cleared up typos and other inconsistencies, identified tags that needed to be improved and checked each for relevancy. “With just our initial analysis, we were able to reduce the tags by 40 percent,” says team member Alana Montoya.
Team {Range} created guidelines for portal publishers – including making terms plural for all categories. And, to make sure the same mess didn’t happen again, they developed a dashboard to help sponsors monitor tags going forward. The final task was creating recommendations for further refinements. “In a year or two, if the portal was left to its own devices, it would be the exact same problem,” says team member Ken Junichi Masumoto.
{Range}’s project sponsor describes them as a model team, in both quality of work and effective management. “We now have a really clear set of recommendations and a protocol for tags because of this team. It couldn’t have worked better,” says iSchool alumna Kathleen Sullivan, the open data librarian for the Washington State Library, which contracts with WaTech.
Team Tech Husky took on WaTech’s performance update challenge. “Washington’s open data collection has a lot of assets, but sponsors don’t know how it is performing or how many people are using it,” says Jia Jia Yu, one of three team members, all Early-Career residential students from Taiwan. Her role is data analyst, with expertise in visualizations.
To determine if the open data program was on track with its goals, they created data pipelines, a dashboard and visualizations for examining various performance indicators. The indicators measure such areas as the number of assets a particular agency publishes and which datasets are of the greatest interest to the public. “With the dashboard, our sponsors and the public can see what users are most curious about and what kind of data is most useful,” says Raymond Su, project manager.
The team dramatically increased productivity with their automatic data pipeline, which can quickly fetch information and keep the portal’s datasets constantly updated. “Before we started, WaTech was manually updating their datasets with Excel, which was challenging and time-consuming,” says Frank Lai, software engineer for the team.
Team Tech Husky’s project sponsor says she was excited to see the team figuring out these basic structures and automating the pipeline. “They were really smart, asked good questions, communicated well, and were self-organizing,” says Cathi Greenwood, open data program manager.
Adds Greenwood: “Both teams did this not just for us, but for the public good. The teams’ solutions will be published on GitHub (a code hosting platform), where any government using this type of open data portal will be able to use them for their own portals.”
Looking back at their hard work, the Capstone students say they enjoyed watching iSchool lessons play out and evolve in real-life settings. “I saw so many things we learned at the iSchool come to life,” says Team {Range} member Max Lieberman. “I realized that information management is not just about finding problems, but about establishing what practices should be put in place and finding ways to keep those practices sustainable long-term.”
Pictured at top, from left, are MSIM students Raymond Su, Frank Lai, Jia Jia Yu, Isabella Eriksen, Max Lieberman, Ken Masumoto and Alana Montoya.
Full Results
Customize your experience.

IMAGES
VIDEO
COMMENTS
In this project, you can choose to complete the project provided for you, or define the scope and data for a project of your own design. Either way, you'll be expected to go through the same steps outlined below. Udacity Provided Project. In the Udacity provided project, you'll work with four datasets to complete the project.
Data Engineering Capstone Project Project Summary. The objective of this project was to create an ETL pipeline for I94 immigration, global land temperatures and US demographics datasets to form an analytics database on immigration events.
An automated ETL data pipeline for immigration, temperature and demographics information. View On GitHub; This project is maintained by HakbilenBerk. Udacity Data Engineering Capstone Project: Automated-Data-Pipeline. Project by Berk Hakbilen. Data pipeline for immigration,temperature and demographics information Goal of the project
Data Engineering is the foundation for the world of Big Data. Enroll in Udacity's data engineering with AWS course and learn essential skills to become a data engineer. ... You'll have access to Github portfolio review and LinkedIn profile optimization to help you advance your career and land a high-paying role. ... Each project will be ...
Capstone Project. In the capstone project, each project is unique to the student. ... Github portfolio, etc. as well as learn useful tips for interviewing and landing a job. Demand for data engineers has never been higher. The Udacity Data Engineer Nanodegree program's combination of world-class curriculum and excellent services is the ...
In this Capstone project, students will define the scope of the project and the data they will be working with to demonstrate what they have learned in this Data Engineering Nanodegree. Project Description - Data Engineering Capstone Project. Project Rubric - Data Engineering Capstone Project. Concept 01: Project Instructions; Concept 02 ...
My Capstone Project for Udacity's Cloud DevOps Engineer Nanodegree. After three months of various DevOps related courses and smaller projects, I had reached the end of my Nanodegree, and it was time to build out my capstone project. My project can be broken down into two parts: the application itself, and the infrastructure that deploys and ...
Udacity Nanodegree Capstone Project. 07 Sep 2019. The Udacity Self-Driving Car Nanodegree has been a great experience. Together with the Intro to Self-Driving Car I have used the last 9 months learning all about Computer Vision, Convolutional Nets, Bayesian probability, and Sensor Fusion. The method used by Udacity was much to my liking, with a ...
Intro. This is my first experience with real Data Engineering. My tasks were to start with a lot of raw data from different sources, seemingly unrelated to each other, and somehow come up with a ...
Data Engineering capstone project. Answered. Kareem Hafez Nada. 23 days ago. Hello Dear, I submitted the data engineering capstone project twice and got feedback that there is plagiarism on the second try, then I modified the notebook but can't submit it again.
First in a series wherein I complete a capstone project for my data engineering course.
How to organize the data to satisfy the analytical needs. Why this data model has been chosen. How it is implemented. Chapter 4: Run ELT to Model the Data . Load the data from S3 into the SQL database, if any. Perform quality checks on the database. Perform example queries. Chapter 5: Complete Project Write Up.
Data Engineering capstone project. Answered. Kareem Hafez Nada. April 16, 2023 22:08. Hello Dear, I submitted the data engineering capstone project twice and got feedback that there is plagiarism on the second try, then I modified the notebook but can't submit it again.
Managers enlisted two separate Capstone teams from the iSchool's Master of Science and Information Management (MSIM) program to take on these challenges. The four members of Team {Range} tackled the tags, putting to work a mix of skills in data science, information architecture and user experience design.