Search Submit search
DS: 401 Capstone Projects
Selected projects from 2022:.
P1: Public Health - Overdose Data Dashboard
P2: USDA Commodity Dashboard(s)
P3: Classification and Analysis Pipeline of Political Video Advertisements - Dashboard with Google Cloud
P4: CSAFE Assessing and modeling quality of 3d topographic scans of fired bullets
P5: Department of Residence - The Impact of Living on Campus on Student Success
The Public Science Collaborative at ISU is looking for advanced data science students to join a research project focusing on the engineering and visualization of public health data. The key task for the spring semester will be to build an opioid overdose data dashboard similar to this one in California . DS 401 interns will work in a supervised, collaborative team science environment to clean, analyze, and visualize data from four data sets, including a) vital statistics mortality data, b) emergency department overdose data, c) substance abuse treatment episodes data, and d) the Iowa Youth Survey dataset. We are a pluralistic coding environment and welcome students using Python, R, Stata, SAS, and other data management and analytic platforms.
Because this project is funded by an Overdose Data to Action. a grant from the Centers for Disease Control, students who are accepted to the project will have the opportunity to pair a funded research assistantship with their DS 401 internship. This opportunity would be an especially good fit for students who are interested in data visualization and data science communication.
Project advisors: Shawn F. Dorius - Associate Professor of Sociology
Students selecting this project will develop a series of dashboards using Tableau. These dashboards will utilize data from the USDA Agricultural Census to show trends in production for selected commodities (such as apples, cheese, grapes, dairy, pork, lettuce, tomatoes, potatoes, strawberries, bees, and honey or wine). Trends may also include the monthly or annual quantity, the number of producers, acres in production, total sales, and other metrics at multiple geographies (county, state nation). Students will also incorporate demographic data for selected areas of interest that highlights the potential regional market and the market and consumption profile (food expenditures, farmers' marker density, schools with farms-to-school programs, etc). Students will be provided with access to Tableau and Tableau Server and will utilize R’s TidyCensus package to acquire data from the American Community Survey (ACS).
Christopher J. Seeger, PLA, GISP - Professor, GIS Specialist and Director of Extension Indicators Program and 2022 DSPG Chair
Bailey Hanson, GISP - GIS Specialist; Leads GIS program and Data for Decision Maker program. Note her background includes a Master in Human-Computer Interaction.
Using the public data from the Google Transparency Report, this project will create a pipeline of extracting, processing, classifying, and visualizing the Political Ads data using a Google Cloud computing platform.
Campaign advertising through social media platforms has been growing at a high rate, which creates a large volume of content on the Internet. To increase transparency in federal campaign advertising, Google Inc. created Google Transparency Report (GTR) . GTR provides websites and searchable databases about federal election campaign ads aired on Google and partners’ platforms. According to GTR, political advertisers have spent around $800M on election campaigns since May 2018.
This project made a platform for a collection of video ads aired on YouTube and for automated content analysis. It's able to 1) automatically classify a video ad into either a political category or a non-political one, (2) analyze predicted political ads into one of these types of interest to political science scholars: promote, attack, or contrast, 3) extract issues of interest for political science research, and 4) determine the polarity and subjectivity of a given ad. 5) Create various visualization charts from the previous analysis.
Adisak Sukul - Associate Teaching Professor, Computer Science. Instructors for Data Science courses. Google Cloud Faculty Expert
A large part of a forensic examiner’s job is to visually compare evidence to decide whether two pieces of evidence come from the same source (e.g. bullets fired from the same barrel, prints from the same shoe, the same finger).
3d digital microscopy provides a basis to bring in algorithms in an attempt to make comparisons of evidence objective and quantify similarities (or dissimilarities). The high-resolution microscopy lab at Iowa State has acquired scans of bullet lands.
Good-quality scans are essential for assessing the similarity of the striations (the marks engraved on the bullet as it passes through the barrel).
In this project, the goal is to derive features capturing (aspects of) the quality of scans and build a model to predict a quality indicator. Ideally, this feedback will be given at the time of scanning, such that a lack of quality can be addressed immediately.
Students will work under the guidance of Dr. Heike Hofmann to derive features capturing scan quality, work on a model incorporating these scan analytics, and depending on time, design an app for giving feedback to scanning personnel.
Heike Hofmann, Professor, and Professor in Charge of the Data Science Program - Department of Statistics
Final R Package: https://github.com/heike/DS401
Project Description: The Department of Residence is interested in understanding how living on campus, both your first year and subsequent years after, impacts student success measures such as graduation and retention. We’re also looking to understand whether those impacts are the same or different for different sub-groups of students (such as students of color, first-generation students, etc.). The audience for this data would be considered a non-technical audience, with a limited background in understanding and analyzing data. The data file is already compiled and will be provided to this team. No preference for analysis software.
This project contains sensitive and private information. All of the findings from this project will remain private.
Dr. Elizabeth Housholder, serves as the Senior Research Analyst for the Department of Residence.
Learner at Simplilearn-Purdue University Data Scientist | ML | DL | AI | NLP Email: [email protected] View My LinkedIn Profile View My Tableau Profile View My Kaggle Profile
Hi, I am a Data Scientist pursuing my passion in AI, ML, DL, NLP, Computer Vision. I am Electronics & Instrumentation Engineer by qualification and have more than 18 years of domain experience in Power/ Energy/ Infra/ Railway sector. I have diversified experience in Business Development, Tendering, Bid Management, Costing & Estimation, Procurement, Operations, Team Management, Strategic Planning, Tie-ups & Joint Ventures. Some of the companies I have worked with are - Skipper Electricals India Ltd., KEC International Limited, Gepdec Infratech Limited. Presently I am working with Wipro as Principal Consultant - Technology & Implementation.
I am passionate about Data Analysis, Machine Learning, Deep Learning, Natural Language Processing and Artificial Intelligence. I have worked on some of the interesting machine learning projects for Regression, Classification, Clustering and NLP problems and open for collaboration for any interesting assignment.
Data Science, Machine Learning, Deep Learning, Natural Language Processing, Python, R Programming, SQL, HTML, Flask/ Django. Google Cloud AutoML Table, Google Cloud/ Amazon Web Services (AWS), Tableau, Dashboarding and Data Visualization
- Python for Data Science - IBM
- R Programming for Data Science - IBM
- AWS Certified Cloud Practitioner - Amazon Web Services
- PGP in Data Science - Simplilearn
Project no. 1: healthcare pgp (binary classification).
Model deployed at: https://python-flask-ml.herokuapp.com/
Introduction: NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates knowledge about and treatments for the most chronic, costly, and consequential diseases. The dataset used in this project is originally from NIDDK. The objective was to predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.
Objective: This is my first capstone project and was part of the final assessment for PGP in Data Science course from Simplilearn-Purdue University. My task was to analyze patients data from NIDDK which consists of several medical predictor variables and one target variable (Outcome). Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and more. I created and trained multiple machine learning models using various classification algorithms. Then all models were compared to evaluation their metrics on given data. Finally model was deployed on heroku.
I performed following tasks in this project:-
- Data Cleaning
- Data Transformation
- Data Modeling - various classification model
- Data Modeling - performance evaluation using various metrics
- Data Reporting - dashboarding in tableau
- Model Deployment - finally model deployed on heroku.
Tools used: Python, Pandas, Numpy, Logistic Regression, Decision Tree, Random Forest, KNN, SVM, Gradient Boosting, Scikit-learn, Matplotlib, Seaborn, Data Preprocessing, Data Transformation, Flask, Tableau.
Project No. 2: RETAIL PGP (CUSTOMER SEGMENTATION)
Introduction: Customer segmentation is the practice of segregating the customer base into groups of individuals based on some common characteristics such as age, gender, interests, and spending habits. It is a way for organizations to understand their customers. Knowing the differences between customer groups, it’s easier to make strategic decisions regarding business growth and marketing campaigns. Implementing customer segmentation leads to plenty of new business opportunities and business can do a lot of optimization in budgeting, product design, promotion, marketing, customer satisfaction etc. The opportunities to segment are endless and depend mainly on how much customer data you have at your use. There are many machine learning algorithms, each suitable for a specific type of problem. One very common machine learning algorithm that’s suitable for customer segmentation problems is the k-means clustering algorithm which I have used for this project. There are other clustering algorithms as well such as DBSCAN, Agglomerative Clustering, and BIRCH, etc.
Objective: This is my second capstone project and was part of the final assessment for PGP in Data Science course from Simplilearn-Purdue University. My job was to analyze transactional data for an online UK-based retail company and create customer segmentation so that company can create effective marketing campaign. This is a transnational data set which contains all the transactions that occurred between 01/12/2010 and 09/12/2011. The company mainly sells unique and all-occasion gifts.
- Data Modeling - RFM (Recency Frequency Monetary) model
- Data Modeling - K-means clustering algorithm
- Data Reporting - Dashboarding in tableau
K-means clustering, an unsupervised algorithms, is one of the techniques that are useful for customer segmentation. The basic concept underlying k-means is to group data into clusters that are more similar.
Tools used: Python, Pandas, Numpy, K-means clustering, Scikit-learn, Matplotlib, Seaborn, Data Preprocessing, Data Transformation, Tableau.
My Data Science Portfolio
Assignments & coursework:, project no. 1: text generation using deep learning.
Introduction: Deep Learning is the most exciting part of Data Science and the next stage after learning Machine Learning. I have worked on this project titled “Text Generation Using Deep Learning” to apply my Deep Learning for solving a major problem in Natural Language Processing known as Language Modeling . I used Keras library to create Recurrent Neural Network model and train it with our dataset to predict text for headline.
Objective: Language Modelling is the core problem for a number of of natural language processing tasks such as speech to text, conversational system, and text summarization. A trained language model learns the likelihood of occurrence of a word/ character based on the previous sequence of words/ characters used in the text. Language models can be operated at character level, n-gram level, sentence level or even paragraph level. We will create a language model for predicting next word by implementing and training state-of-the-art Recurrent Neural Networks under Deep Learning.
Tools used: Python, Pandas, Numpy, NLP, Deep Learning, Tensorflow Keras, Recurrent Neural Network (RNN), Long Short Term Memory networks (LSTM).
Project No. 2: TOPIC MODELING
Introduction: I have worked on this project titled “Topic Modeling” to fulfil mandatory criteria towards “ Natural Language Processing ” module of my PGP in Data Science course from Simplilearn. I used NLTK library to perform various text preprocessing, POS tagging, Lemmatization and Gensim Library to create LDA model for topic identification and finally pyLDAvis to visualize model created.
Objective: A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view. Perform analysis by POS tagging, topic modeling using LDA, and topic interpretation.
Tools used: Python, Pandas, Numpy, Regular Expression, NLTK, POS Tagging, Lemmatization, Gensim, LDA, pyLDAvis.
Project No. 3: WIKIPEDIA TOXICITY
Introduction: I have worked on this project titled “Wikipedia Toxicity” to fulfil mandatory criteria towards “ Natural Language Processing ” module of my PGP in Data Science course from Simplilearn. I performed text preprocessing using various functions such RegEx, Tokenization, Stopwords and Punctuation removals followed by domain stopwords removal and Lemmatization. I also performed Class balancing since datapoints in target variable were imbalanced. Hyperparameter tuning using GridSearch and StratifiedKFold was done to optimize model. SVM Classifier from Scikit-learn library was used to train model with train data and then prediction on test data.
Objective: Wikipedia is the world’s largest and most popular reference work on the internet with about 500 million unique visitors per month. It also has millions of contributors who can make edits to pages. The Talk edit pages, the key community interaction forum where the contributing community interacts or discusses or debates about the changes pertaining to a particular topic. Wikipedia continuously strives to help online discussion become more productive and respectful. My task was to help Wikipedia to build a predictive model that identifies toxic comments in the discussion and marks them for cleanup by using NLP and machine learning. Post that, help identify the top terms from the toxic comments.
Tools used: Python, Pandas, Numpy, WordCloud, Regular Expression, NLTK, TfidfVectorizer, Scikit-learn, Support Vector Machine (SVM), Hyperparameter tuning using GridSearch and StratifiedKFold
Project No. 4: INCOME QUALIFICATION
Introduction: I have worked on this project titled “Income Qualification” to fulfil mandatory criteria towards “ Machine Learning ” module of my PGP in Data Science course from Simplilearn. I used RandomForestClassifier in Python to create model for predicting income level and used GridSearchCV to improve model performance. I also used Matplotlib and Seaborn to visualize data.
Objective: Many social programs have a hard time ensuring that the right people are given enough aid. It’s tricky when a program focuses on the poorest segment of the population. This segment of the population can’t provide the necessary income and expense records to prove that they qualify. The Inter-American Development Bank (IDB) believes that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, might help improve PMT’s performance. My task was tTo identify the level of income qualification needed for the families for social welfare program in Latin America using given dataset.
Tools used: Python, Pandas, Numpy, Matplotlib, Seaborn, RandomForest Classifier, GridSearchCV.
Project No. 5: COMCAST TELECOM CONSUMER COMPLAINTS
Introduction: I worked on this project titled “Comcast Telecom Consumer Complaints” to fulfil mandatory criteria towards “ Data Science with Python ” module of my PGP course from Simplilearn. I did not create any machine learning model in this project. I performed data analysis in python using various tools such as Pandas, Numpy, Matplotlib, Seaborn, Worldcloud.
Objective: Comcast is an American global telecommunication company. The given dataset serves as a repository of public customer complaints filed against Comcast. My task was to analyse given data and provide the trend chart for the number of complaints at monthly/ daily granularity levels and present further insight.
Tools used: Python, Pandas, Numpy, Matplotlib, Seaborn, Worldcloud.
Project No. 6: RETAIL ANALYSIS WITH WALMART DATA
Introduction: I worked on this project titled “Retail Analysis with Walmart Data” to fulfil mandatory criteria towards “ Data Science with R ” module of my PGP course from Simplilearn. I have created Linear Regression model using R programming language in this project. I performed hypothesis testing and statistical analysis in R using various libraries in R.
Objective: Walmart is one of the leading retail stores in the US. The business was facing a challenge due to unforeseen demands and would run out of stock sometimes. My task was to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. The sales data is available for 45 stores of Walmart. Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge faced in this project was modelling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. My Machine Learning algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.
Tools used: R programming language, dplyr, tidyr.
- Coming soon…
Replicating research:, project no. 1: stocks market trade data analysis.
Objective: I was given a dataset (570 rows × 6 columns) containing historical trade details such as entry price, exit price, p&l and some technical indicators like atr_perc, roc and rsi. Objective was to analyze dataset in python and try to find best range of atr_perc , roc , rsi so that the sum of P&L is maximized for the range of values for atr_perc , roc , rsi.
Indicator …….. Min. Value …….. Max. Value
- RSI …….. 58.02 …….. 89.72
- ATR_perc …….. 0.274223035 …….. 0.427709112
- ROC …….. -77.72927145 …….. 2633.788326
Final P&L if we take entry only when given technical indicators were within above range: Rs. 65773
Time taken by program to process complete data and generate output: 2.03 seconds
Blogs & Vlogs:
Graphical visualization of linkedin network.
Introduction: We all use LinkedIn to make connections with professionals in our industry or even from any other industry too. On LinkedIn, we see only a list of our connections, so it’s hard to visualize the entire network of your connections. I tried to apply my python skill in Data Science to explore my connections on LinkedIn and find out some interesting information and visualize it in the form of interactive graphs.
Checkout following links to see interactive graphs showing details of my LinkedIn connections:-
Companies of my LinkedIn connections: https://datapane.com/u/mgupta/reports/J35lWVA/linkedin-connection-company/
Positions of my LinkedIn connections: https://datapane.com/u/mgupta/reports/j3LQnv7/linkedin-connection-positions/
Copyright (c) Manish Gupta
Social Network for Programmers and Developers
Data-Science-Capstone-Healthcare. Data Science Capstone Project Using Python and Tableau 10. DESCRIPTION. Problem Statement NIDDK (National Institute of
I worked on this capstone project towards completion of final assessment for PGP in Data Science course from Simplilearn-Purdue University.
Lastly the project contains a dashboard on the original data using Tableau - GitHub - Ransomk/Capstone-Data-Science-Course-Project: This was a comprehensive
The dataset used in this project is originally from NIDDK. The objective is to predict whether or not a patient has diabetes, based on certain diagnostic
Data Science Capstone Project Using Python and Tableau 10 - Data-Science-Capstone-Healthcare/health care diabetes.csv at master
Data Science Capstone Project Using Python and Tableau 10 - Data-Science-Capstone-Healthcare/Capstone Project.twbx at master
Data Science Capstone Simplilearn ... The dataset used in this project is originally from NIDDK. ... input/healthcare/health care diabetes.csv').
Selected projects from 2022: P1: Public Health - Overdose Data Dashboard P2: USDA Commodity Dashboard(s) P3: Classification and Analysis Pipeline of
AWS Certified Cloud Practitioner - Amazon Web Services; PGP in Data Science - Simplilearn. CAPSTONE PROJECTS. Project No. 1: HEALTHCARE PGP (BINARY
Data Science Capstone Project Using Python. Problem Statement. NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates