Data Science

Search Submit search

DS: 401 Capstone Projects

Selected projects from 2022:.

P1: Public Health - Overdose Data Dashboard  

P2: USDA Commodity Dashboard(s)

P3: Classification and Analysis Pipeline of Political Video Advertisements - Dashboard with Google Cloud

P4: CSAFE Assessing and modeling quality of 3d topographic scans of fired bullets

P5: Department of Residence - The Impact of Living on Campus on Student Success

The Public Science Collaborative at ISU is looking for advanced data science students to join a research project focusing on the engineering and visualization of public health data. The key task for the spring semester will be to build an opioid overdose data dashboard similar to  this one in California . DS 401 interns will work in a supervised, collaborative team science environment to clean, analyze, and visualize data from four data sets, including a) vital statistics mortality data, b) emergency department overdose data, c) substance abuse treatment episodes data, and d) the Iowa Youth Survey dataset. We are a pluralistic coding environment and welcome students using Python, R, Stata, SAS, and other data management and analytic platforms.

Because this project is funded by an  Overdose Data to Action.  a grant from the Centers for Disease Control, students who are accepted to the project will have the opportunity to pair a funded research assistantship with their DS 401 internship. This opportunity would be an especially good fit for students who are interested in data visualization and data science communication.

Project advisors: Shawn F. Dorius - Associate Professor of Sociology

DS401-S22-P1-Poster_PublicHealth.pdf

Students selecting this project will develop a series of dashboards using Tableau. These dashboards will utilize data from the USDA Agricultural Census to show trends in production for selected commodities (such as apples, cheese, grapes, dairy, pork, lettuce, tomatoes, potatoes, strawberries, bees, and honey or wine). Trends may also include the monthly or annual quantity, the number of producers, acres in production, total sales, and other metrics at multiple geographies (county, state nation). Students will also incorporate demographic data for selected areas of interest that highlights the potential regional market and the market and consumption profile (food expenditures, farmers' marker density, schools with farms-to-school programs, etc). Students will be provided with access to Tableau and Tableau Server and will utilize R’s TidyCensus package to acquire data from the American Community Survey (ACS).

Project advisors:

Christopher J. Seeger, PLA, GISP - Professor, GIS Specialist and Director of Extension Indicators Program and 2022 DSPG Chair

Bailey Hanson, GISP - GIS Specialist; Leads GIS program and Data for Decision Maker program. Note her background includes a Master in Human-Computer Interaction.

DS401-S22-P2-Poster_USDA.pdf

Using the public data from the Google Transparency Report, this project will create a pipeline of extracting, processing, classifying, and visualizing the Political Ads data using a Google Cloud computing platform.

Campaign advertising through social media platforms has been growing at a high rate, which creates a large volume of content on the Internet. To increase transparency in federal campaign advertising, Google Inc. created  Google Transparency Report (GTR) . GTR provides websites and searchable databases about federal election campaign ads aired on Google and partners’ platforms. According to GTR, political advertisers have spent around $800M on election campaigns since May 2018.

This project made a platform for a collection of video ads aired on YouTube and for automated content analysis. It's able to 1) automatically classify a video ad into either a political category or a non-political one, (2) analyze predicted political ads into one of these types of interest to political science scholars: promote, attack, or contrast, 3) extract issues of interest for political science research, and 4) determine the polarity and subjectivity of a given ad. 5) Create various visualization charts from the previous analysis. 

Adisak Sukul - Associate Teaching Professor, Computer Science. Instructors for Data Science courses. Google Cloud Faculty Expert

DS401-S22-P3-Poster_GoogleTransparencyReport.pdf

A large part of a forensic examiner’s job is to visually compare evidence to decide whether two pieces of evidence come from the same source (e.g. bullets fired from the same barrel, prints from the same shoe, the same finger).

3d digital microscopy provides a basis to bring in algorithms in an attempt to make comparisons of evidence objective and quantify similarities (or dissimilarities). The high-resolution microscopy lab at Iowa State has acquired scans of bullet lands. 

Good-quality scans are essential for assessing the similarity of the striations (the marks engraved on the bullet as it passes through the barrel). 

In this project, the goal is to derive features capturing (aspects of) the quality of scans and build a model to predict a quality indicator. Ideally, this feedback will be given at the time of scanning, such that a lack of quality can be addressed immediately.

Students will work under the guidance of Dr. Heike Hofmann to derive features capturing scan quality, work on a model incorporating these scan analytics, and depending on time, design an app for giving feedback to scanning personnel.

Preferred skills: proficiency in R, and knowledge of HTML/javascript would be a plus.

Heike Hofmann, Professor, and Professor in Charge of the Data Science Program - Department of Statistics

Final R Package:  https://github.com/heike/DS401

DS401-S22-P4-Poster_CSAFE.pdf

Project Description: The Department of Residence is interested in understanding how living on campus, both your first year and subsequent years after, impacts student success measures such as graduation and retention.  We’re also looking to understand whether those impacts are the same or different for different sub-groups of students (such as students of color, first-generation students, etc.).  The audience for this data would be considered a non-technical audience, with a limited background in understanding and analyzing data.  The data file is already compiled and will be provided to this team.  No preference for analysis software. 

This project contains sensitive and private information. All of the findings from this project will remain private.

Dr. Elizabeth Housholder, serves as the Senior Research Analyst for the Department of Residence.

DS401-S22-P5-Poster_Residence.pdf

Manish Gupta

Logo

Learner at Simplilearn-Purdue University Data Scientist | ML | DL | AI | NLP Email: [email protected] View My LinkedIn Profile View My Tableau Profile View My Kaggle Profile

GitHub Profile

Hi, I am a Data Scientist pursuing my passion in AI, ML, DL, NLP, Computer Vision. I am Electronics & Instrumentation Engineer by qualification and have more than 18 years of domain experience in Power/ Energy/ Infra/ Railway sector. I have diversified experience in Business Development, Tendering, Bid Management, Costing & Estimation, Procurement, Operations, Team Management, Strategic Planning, Tie-ups & Joint Ventures. Some of the companies I have worked with are - Skipper Electricals India Ltd., KEC International Limited, Gepdec Infratech Limited. Presently I am working with Wipro as Principal Consultant - Technology & Implementation.

I am passionate about Data Analysis, Machine Learning, Deep Learning, Natural Language Processing and Artificial Intelligence. I have worked on some of the interesting machine learning projects for Regression, Classification, Clustering and NLP problems and open for collaboration for any interesting assignment.

Data Science, Machine Learning, Deep Learning, Natural Language Processing, Python, R Programming, SQL, HTML, Flask/ Django. Google Cloud AutoML Table, Google Cloud/ Amazon Web Services (AWS), Tableau, Dashboarding and Data Visualization

Certifications:

CAPSTONE PROJECTS

Project no. 1: healthcare pgp (binary classification).

Model deployed at: https://python-flask-ml.herokuapp.com/

Introduction: NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates knowledge about and treatments for the most chronic, costly, and consequential diseases. The dataset used in this project is originally from NIDDK. The objective was to predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

Objective: This is my first capstone project and was part of the final assessment for PGP in Data Science course from Simplilearn-Purdue University. My task was to analyze patients data from NIDDK which consists of several medical predictor variables and one target variable (Outcome). Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and more. I created and trained multiple machine learning models using various classification algorithms. Then all models were compared to evaluation their metrics on given data. Finally model was deployed on heroku.

healthcare data science capstone project github

I performed following tasks in this project:-

Tools used: Python, Pandas, Numpy, Logistic Regression, Decision Tree, Random Forest, KNN, SVM, Gradient Boosting, Scikit-learn, Matplotlib, Seaborn, Data Preprocessing, Data Transformation, Flask, Tableau.

healthcare data science capstone project github

Project No. 2: RETAIL PGP (CUSTOMER SEGMENTATION)

healthcare data science capstone project github

Introduction: Customer segmentation is the practice of segregating the customer base into groups of individuals based on some common characteristics such as age, gender, interests, and spending habits. It is a way for organizations to understand their customers. Knowing the differences between customer groups, it’s easier to make strategic decisions regarding business growth and marketing campaigns. Implementing customer segmentation leads to plenty of new business opportunities and business can do a lot of optimization in budgeting, product design, promotion, marketing, customer satisfaction etc. The opportunities to segment are endless and depend mainly on how much customer data you have at your use. There are many machine learning algorithms, each suitable for a specific type of problem. One very common machine learning algorithm that’s suitable for customer segmentation problems is the k-means clustering algorithm which I have used for this project. There are other clustering algorithms as well such as DBSCAN, Agglomerative Clustering, and BIRCH, etc.

Objective: This is my second capstone project and was part of the final assessment for PGP in Data Science course from Simplilearn-Purdue University. My job was to analyze transactional data for an online UK-based retail company and create customer segmentation so that company can create effective marketing campaign. This is a transnational data set which contains all the transactions that occurred between 01/12/2010 and 09/12/2011. The company mainly sells unique and all-occasion gifts.

healthcare data science capstone project github

K-means clustering, an unsupervised algorithms, is one of the techniques that are useful for customer segmentation. The basic concept underlying k-means is to group data into clusters that are more similar.

Tools used: Python, Pandas, Numpy, K-means clustering, Scikit-learn, Matplotlib, Seaborn, Data Preprocessing, Data Transformation, Tableau.

healthcare data science capstone project github

My Data Science Portfolio

Assignments & coursework:, project no. 1: text generation using deep learning.

healthcare data science capstone project github

Introduction: Deep Learning is the most exciting part of Data Science and the next stage after learning Machine Learning. I have worked on this project titled “Text Generation Using Deep Learning” to apply my Deep Learning for solving a major problem in Natural Language Processing known as Language Modeling . I used Keras library to create Recurrent Neural Network model and train it with our dataset to predict text for headline.

Objective: Language Modelling is the core problem for a number of of natural language processing tasks such as speech to text, conversational system, and text summarization. A trained language model learns the likelihood of occurrence of a word/ character based on the previous sequence of words/ characters used in the text. Language models can be operated at character level, n-gram level, sentence level or even paragraph level. We will create a language model for predicting next word by implementing and training state-of-the-art Recurrent Neural Networks under Deep Learning.

Tools used: Python, Pandas, Numpy, NLP, Deep Learning, Tensorflow Keras, Recurrent Neural Network (RNN), Long Short Term Memory networks (LSTM).

healthcare data science capstone project github

Project No. 2: TOPIC MODELING

healthcare data science capstone project github

Introduction: I have worked on this project titled “Topic Modeling” to fulfil mandatory criteria towards “ Natural Language Processing ” module of my PGP in Data Science course from Simplilearn. I used NLTK library to perform various text preprocessing, POS tagging, Lemmatization and Gensim Library to create LDA model for topic identification and finally pyLDAvis to visualize model created.

Objective: A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view. Perform analysis by POS tagging, topic modeling using LDA, and topic interpretation.

Tools used: Python, Pandas, Numpy, Regular Expression, NLTK, POS Tagging, Lemmatization, Gensim, LDA, pyLDAvis.

healthcare data science capstone project github

Project No. 3: WIKIPEDIA TOXICITY

healthcare data science capstone project github

Introduction: I have worked on this project titled “Wikipedia Toxicity” to fulfil mandatory criteria towards “ Natural Language Processing ” module of my PGP in Data Science course from Simplilearn. I performed text preprocessing using various functions such RegEx, Tokenization, Stopwords and Punctuation removals followed by domain stopwords removal and Lemmatization. I also performed Class balancing since datapoints in target variable were imbalanced. Hyperparameter tuning using GridSearch and StratifiedKFold was done to optimize model. SVM Classifier from Scikit-learn library was used to train model with train data and then prediction on test data.

Objective: Wikipedia is the world’s largest and most popular reference work on the internet with about 500 million unique visitors per month. It also has millions of contributors who can make edits to pages. The Talk edit pages, the key community interaction forum where the contributing community interacts or discusses or debates about the changes pertaining to a particular topic. Wikipedia continuously strives to help online discussion become more productive and respectful. My task was to help Wikipedia to build a predictive model that identifies toxic comments in the discussion and marks them for cleanup by using NLP and machine learning. Post that, help identify the top terms from the toxic comments.

healthcare data science capstone project github

Tools used: Python, Pandas, Numpy, WordCloud, Regular Expression, NLTK, TfidfVectorizer, Scikit-learn, Support Vector Machine (SVM), Hyperparameter tuning using GridSearch and StratifiedKFold

healthcare data science capstone project github

Project No. 4: INCOME QUALIFICATION

healthcare data science capstone project github

Introduction: I have worked on this project titled “Income Qualification” to fulfil mandatory criteria towards “ Machine Learning ” module of my PGP in Data Science course from Simplilearn. I used RandomForestClassifier in Python to create model for predicting income level and used GridSearchCV to improve model performance. I also used Matplotlib and Seaborn to visualize data.

Objective: Many social programs have a hard time ensuring that the right people are given enough aid. It’s tricky when a program focuses on the poorest segment of the population. This segment of the population can’t provide the necessary income and expense records to prove that they qualify. The Inter-American Development Bank (IDB) believes that new methods beyond traditional econometrics, based on a dataset of Costa Rican household characteristics, might help improve PMT’s performance. My task was tTo identify the level of income qualification needed for the families for social welfare program in Latin America using given dataset.

Tools used: Python, Pandas, Numpy, Matplotlib, Seaborn, RandomForest Classifier, GridSearchCV.

healthcare data science capstone project github

Project No. 5: COMCAST TELECOM CONSUMER COMPLAINTS

healthcare data science capstone project github

Introduction: I worked on this project titled “Comcast Telecom Consumer Complaints” to fulfil mandatory criteria towards “ Data Science with Python ” module of my PGP course from Simplilearn. I did not create any machine learning model in this project. I performed data analysis in python using various tools such as Pandas, Numpy, Matplotlib, Seaborn, Worldcloud.

healthcare data science capstone project github

Objective: Comcast is an American global telecommunication company. The given dataset serves as a repository of public customer complaints filed against Comcast. My task was to analyse given data and provide the trend chart for the number of complaints at monthly/ daily granularity levels and present further insight.

Tools used: Python, Pandas, Numpy, Matplotlib, Seaborn, Worldcloud.

Project No. 6: RETAIL ANALYSIS WITH WALMART DATA

Introduction: I worked on this project titled “Retail Analysis with Walmart Data” to fulfil mandatory criteria towards “ Data Science with R ” module of my PGP course from Simplilearn. I have created Linear Regression model using R programming language in this project. I performed hypothesis testing and statistical analysis in R using various libraries in R.

healthcare data science capstone project github

Objective: Walmart is one of the leading retail stores in the US. The business was facing a challenge due to unforeseen demands and would run out of stock sometimes. My task was to predict the sales and demand accurately. There are certain events and holidays which impact sales on each day. The sales data is available for 45 stores of Walmart. Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge faced in this project was modelling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. My Machine Learning algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.

Tools used: R programming language, dplyr, tidyr.

Stand-alone Projects:

Replicating research:, project no. 1: stocks market trade data analysis.

healthcare data science capstone project github

Objective: I was given a dataset (570 rows × 6 columns) containing historical trade details such as entry price, exit price, p&l and some technical indicators like atr_perc, roc and rsi. Objective was to analyze dataset in python and try to find best range of atr_perc , roc , rsi so that the sum of P&L is maximized for the range of values for atr_perc , roc , rsi.

Indicator …….. Min. Value …….. Max. Value

Final P&L if we take entry only when given technical indicators were within above range: Rs. 65773

Time taken by program to process complete data and generate output: 2.03 seconds

Blogs & Vlogs:

Graphical visualization of linkedin network.

healthcare data science capstone project github

Introduction: We all use LinkedIn to make connections with professionals in our industry or even from any other industry too. On LinkedIn, we see only a list of our connections, so it’s hard to visualize the entire network of your connections. I tried to apply my python skill in Data Science to explore my connections on LinkedIn and find out some interesting information and visualize it in the form of interactive graphs.

Checkout following links to see interactive graphs showing details of my LinkedIn connections:-

Companies of my LinkedIn connections: https://datapane.com/u/mgupta/reports/J35lWVA/linkedin-connection-company/

Positions of my LinkedIn connections: https://datapane.com/u/mgupta/reports/j3LQnv7/linkedin-connection-positions/

Competitions:

Copyright (c) Manish Gupta

Social Network for Programmers and Developers

We're sorry but this website doesn't work properly without javascript enabled. please enable it to continue..

IMAGES

  1. Why Capstone Project Is A Key Feature While Selecting A Data Science Course

    healthcare data science capstone project github

  2. File Finder · GitHub

    healthcare data science capstone project github

  3. IBM DATA SCIENCE capstone project

    healthcare data science capstone project github

  4. Software Capstone Project Ideas

    healthcare data science capstone project github

  5. IBM- Data Science Capstone Project By Aarkesh Sharma

    healthcare data science capstone project github

  6. This is the best place for you to get data science capstone project writers https://www

    healthcare data science capstone project github

VIDEO

  1. Mobile Prise Range Prediction||AlmaBetter Capstone Project ||Collab Explain ||Almabetter 3rd Project

  2. Capstone Project Purwadhika Data Science & Machine Learning

  3. Capstone Project evaluation by Data Science Experts

  4. Creation, Creativity & Ethics in the Age of AI (DTSC-690)- By Negash Fufa

  5. Capstone Project Data Karyawan Dyah Ayu Daratika

  6. Data Science Capstone Project Spotlight: Language Detection App

COMMENTS

  1. GitHub

    Data-Science-Capstone-Healthcare. Data Science Capstone Project Using Python and Tableau 10. DESCRIPTION. Problem Statement NIDDK (National Institute of

  2. manishgupta-ind/Capstone-Project-Healthcare---PGP

    I worked on this capstone project towards completion of final assessment for PGP in Data Science course from Simplilearn-Purdue University.

  3. Capstone Data Science Course Project

    Lastly the project contains a dashboard on the original data using Tableau - GitHub - Ransomk/Capstone-Data-Science-Course-Project: This was a comprehensive

  4. rrchaubey/Capstone-Project-Data-Science

    The dataset used in this project is originally from NIDDK. The objective is to predict whether or not a patient has diabetes, based on certain diagnostic

  5. Data-Science-Capstone-Healthcare/health care diabetes.csv at master

    Data Science Capstone Project Using Python and Tableau 10 - Data-Science-Capstone-Healthcare/health care diabetes.csv at master

  6. Data-Science-Capstone-Healthcare/Capstone Project.twbx

    Data Science Capstone Project Using Python and Tableau 10 - Data-Science-Capstone-Healthcare/Capstone Project.twbx at master

  7. Data Science Capstone Simplilearn

    Data Science Capstone Simplilearn ... The dataset used in this project is originally from NIDDK. ... input/healthcare/health care diabetes.csv').

  8. DS: 401 Capstone Projects

    Selected projects from 2022: P1: Public Health - Overdose Data Dashboard P2: USDA Commodity Dashboard(s) P3: Classification and Analysis Pipeline of

  9. About me:

    AWS Certified Cloud Practitioner - Amazon Web Services; PGP in Data Science - Simplilearn. CAPSTONE PROJECTS. Project No. 1: HEALTHCARE PGP (BINARY

  10. Data Science Capstone Project Using Python

    Data Science Capstone Project Using Python. Problem Statement. NIDDK (National Institute of Diabetes and Digestive and Kidney Diseases) research creates