Friday, December 18, 2020

Interactive Tableau Viz: Heart Disease Prediction based on Logistic Regression Model

Hi Peeps,

I am glad to share my tableau viz on Heart Disease Prediction by applying machine learning model: Logistic Regression using Python Scripts.

Integrating Tableau and Python:





Happy Learning!





Tuesday, October 27, 2020

Tableau Viz: Apparels Exports to US

Hi Analysts,

Performed Analysis on USA importing apparels from different countries and how Covid-19 affects the trading business adversely. 

Did exploratory data analysis include charts like Donut, Map plots along with growth rate comparison. 

For more details, visit the below link...

https://public.tableau.com/profile/urmisha.patel8137#!/vizhome/MakeOverMonday_Week43_2020/Dashboard1



Happy Learning!


Friday, October 23, 2020

Data Viz in Tableau: Corona Virus Symptoms

 Hi Peeps,

Took a recent topic hovering over our minds "CORONA VIRUS".

Find my tableau viz on this link:

https://public.tableau.com/profile/urmisha.patel8137#!/vizhome/CoronaVirusSymptomsAnalysis/Dashboard1


Happy Learning!


Data Viz in Tableau: Mental Disorder by Sex at different stage in life

 Hello Analysts,

Today, I analyzed an interesting topic: Mental Illness by Sex. And created a dashboard including bar charts in picto format and scatterplot to find relation between various attributes. 

Visit my Tableau Public: 

https://public.tableau.com/profile/urmisha.patel8137#!/vizhome/MentalDisorderBySexMakeOverMonday_Week27_2020/MentalDisorder_Analysis


Happy Learning!


Thursday, October 15, 2020

Data Viz in Tableau: America's Top 30 Food Giants

 

Hi Peeps,

I performed exploratory data analysis on America's Top 30 Food giants in Tableau. This includes segmentation ad trendlines based on performance of food giants.

Visit the below link: 

https://public.tableau.com/profile/urmisha.patel8137#!/vizhome/AmericasFoodChain/USAChain



Happy LearningπŸ’


Monday, October 5, 2020

Oil Price 2020 Data Analysis and Forecasting in Tableau

 Hi BI Folks,

I have analyzed Oil Price_2020 dataset in Tableau, performed exploratory data analysis and did forecasting using exponential smoothing model with trends.

Visit below link:

https://public.tableau.com/profile/urmisha.patel8137#!/vizhome/OilPrice_2020/FuelPricingAnalysis




Happy Learning πŸ’



Saturday, October 3, 2020

Time Series Model(ARIMA) to forecast air passengers in Aviation industry during COVID times

                             


             In light of the rapidly spreading disease named as COVID-19, the International Civil Aviation Organization (ICAO) actively monitoring its economic impacts on civil aviation and regularly publishes updated reports and adjusted forecasts.

The analytical timeframe has now been extended from 2020 to Q1 2021. Let's see what they states:



This shows how covid-19 impacted and hit worst to the aviation industry.


It's crucial to forecast #Passengers would like to travel via flight for the survival of aviation industry. Let's try to predict the passengers about to travel in succeeding months.

I have sample data of passengers travelling between 1949 to 1960. I know, this almost 6 decades old. But, I am just trying to show how time series forecasting methods helps to predict the future and critical to make further business decisions.

Time Series Forecasting divides in 3 major parts:

1. Exploring the data

2. Preprocessing & Estimating

3. Apply Forecasting model





Let's start performing the initial step.πŸ™Œ in SAS 

Code:
options validvarname=V7;

proc import datafile='AirPassengers.csv' dbms=csv out=airpassengers replace;


data airpassengers_ts;
set airpassengers;
format Month date9.;
rename _Passengers= Number_of_Passengers Month=Date;
run;
 total observations = 144

/* Stage1 */

title 'Scatterplot Passengers vs. Date'
proc gplot data = airpassengers_ts;
plot Number_of_Passengers*Date;
run;
quit;



The above graphs shows the data is highly non-stationary, variance and mean are not constant. This definitely depicts upward trend.

proc sgplot data = airpassengers_ts;
histogram Number_of_Passengers;
density Number_of_Passengers;
title "Histogram-Density Plot on #Passengers";
run;


The graph looks right-skewed, mean > median. 

proc sgplot data = airpassengers_ts;
series X = Date Y = Number_of_Passengers;
title 'Average quarterly Passengers in Flight';
run; 


This line chart clearly depicts the presence of upward trend. With pace of time, travelers increased. Seasonality also prevails as quarterly, there is change in pattern.

data airpassengers_ts1;
set airpassengers_ts;
Month=month(Date);
Year=year(Date);
run;

proc boxplot data=airpassengers_ts1;
plot (Number_of_Passengers)*Year;
run;


Above Box-Whisker Plot shows the uptrend in number of travelers traveling via flight. In year 1953 becomes turning point in aviation industry. Median > Mean


/* To identify the correlation property, use identity---- nlag=24, bydefault */

proc arima data = airpassengers_ts1;
identify var = Number_of_Passengers nlag=24;
run;
quit;


Let's understand the autocorrelation table to check for White Noise.

In this case, the white noise hypothesis is rejected very strongly, which is expected since the 
series is nonstationary. The p-value for the test of the first six autocorrelations is printed as 
<0.0001, which means the p-value is less than 0.0001.




ACF: A plot of the autocorrelation of a time series by lag is called Autocorrelation Function.
Confidence intervals are drawn as a cone. By default, this is set to a 95% confidence interval, suggesting that correlation values outside of this code are very likely a correlation and not a statistical fluke. Correlation values between 1-15 lags are outside the confidence interval. 
PACF: A partial autocorrelation is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed. 2 & 13 lag shows relationship outside the interval.
ACF & PACF gives intuition about AR and MA in time-series by analyzing these direct and indirect correlations.
This means we would expect the ACF for the AR(k) time series to be strong to a lag of k and the inertia of that relationship would carry on to subsequent lag values, trailing off at some point as the effect was weakened. 
We would expect the ACF for the MA(k) process to show a strong correlation with recent values up to the lag of k, then a sharp decline to low or no correlation.
For the PACF, we would expect the plot to show a strong relationship to the lag and a trailing off of correlation from the lag onwards.


/* Check whether the time-series is stationary or not? */ 
proc arima data = airpassengers_ts1;
identify var = Number_of_Passengers stationarity = (ADF=(1));
run;
quit;



Null Hypothesis : Non-Stationary
Alternative Hypothesis : Stationary

There are three types by which you can calculate test statistics of dickey-fuller test.

Zero Mean - No Intercept. Series is a random walk without drift.
Single Mean - Includes Intercept. Series is a random walk with drift.
Trend - Includes Intercept and Trend. Series is a random walk with linear trend.

Here, p-value > 0.05 which means failed to reject the null hypothesis. The series is non-stationary.

/* Stage 2 */

From the above test, it is clearly signifies the series is not stationary. So to implement forecasting model, it is critical to convert non-stationary series to stationary series.

There are two methods to convert non-stationary series to stationary: Differencing & Detrending.
Below steps will perform differencing. 

proc arima data=airpassengers_ts1;
identify var=Number_of_Passengers(1);
run;

Total observations = 144. By differencing once, then 1 observation will be eliminated. Therefore, now number of observation comes 143.

Still, p < 0.0001. Therefore, series is non-stationary and there is presence of white noise too.




The above ACF shows slight change in series. The series is moving from non-stationary to stationary. Correlation values between 1-7 lags are outside the confidence interval in acf. While, in pacf  correlation values between 1-13 lags are outside the confidence interval.

Let us perform differencing again. 

proc arima data = airpassengers_ts1;
identify var = Number_of_Passengers(2);
run;
quit;



Performing second order differencing, series is not stationary. But in much better position than first order differencing. ACF plot shows there is a presence of MA. We would expect the ACF for the MA(k) process to show a strong correlation with recent values up to the lag of k, then a sharp decline to low or no correlation. 

proc arima data = airpassengers_ts1;
identify var = Number_of_Passengers(1,1);
run;
quit;




Here, First parameter defines differencing the series i.e differencing by lag 1, Second parameter defines removing the observation from seasonality by differencing by lag 1.

ACF & PACF shows variation in plot after differencing by lag 1 on seasonality component.

proc arima data = airpassengers_ts;
identify var = Number_of_Passengers(1,12);
run;
quit;



Here, First parameter differencing by lag 1, Second parameter defines removing the observation from seasonality component by differencing by lag 12. After differencing, the total observations removed from the
series is 13. Due to this, now the series looks stationary. 

Here, p < 0.05 which means the series is devoid of white noise. 


Let's check again whether series is now stationary or not? Perform Augmented Dicker Fuller Unit Testing.

proc arima data = airpassengers_ts1;
identify var = Number_of_Passengers stationarity = (ADF=(1,12,24));
run;
quit;



Examine, Single mean p value. 
For ADF(1) = 0.3240 > 0.05 Therefore, series is not stationary.
For ADF(12) = 0.0010 < 0.05 Therefore, series is stationary.
For ADF(24) = 0.0907 > 0.05 Therefore, series is not stationary.

So, this clearly defines if first order differencing to detrend series and differencing by lag 12 then seasonal components from the series can be removed. This makes the series stationary.

proc arima data=airpassengers_ts1;
identify var=Number_of_Passengers(1,12) minic scan esacf stationarity=(adf=(12,24));
run;

By differencing principle we can remove non-stationary. Proc arima also provides test 
suggestion for p, q values. SCAN or MINIC option in the IDENTIFY statement provide the BIC 
associated with the SCAN table and ESACF table recommendations the p and q values. 
Lower the BIC better is the model.
MINIC STANDS FOR Minimum Information Criterion.


This table provides vital information about best possible P and Q value for the series. 
P=1 & Q=0 gives least BIC value. This makes best fit for the model.

/* To check the presence of Outliers */
proc arima data = airpassengers_ts1;
identify var = Number_of_Passengers(1,12);
estimate P = 1 Q = 0;
outlier;


p-value > 0.05. Therefore, there is a presence of residuals in series though it looks stationary. 





QQ-plot shows linear relation. Deviation is least but there are some outliers in the series and distribution is almost normal.


There are approx. 3 outliers in the series. p<0.05. Therefore, this will not make any significant effect on the model.

/* Stage 3: Forecasting for next 12 observations */
proc arima data = airpassengers_ts1;
identify var = Number_of_Passengers(1, 12);
estimate P = 1 Q = 0;
forecast lead = 12;
run;
quit;


The above table forecast the next 12 observation values ranges within 95% confidence interval.



The above graph forecast the number of passengers in upcoming next 12 months i.e average monthly passengers for year 1961. The value ranges within 95% confidence interval.

Total observation in dataset = 144. Therefore, the next observations named as id from 145 onward.




To forecast the number of passengers for next month after analyzing trends and seasonality in data, applied ARIMA model. This helps in understanding patterns and how likely there is chance of Aviation industry to come out of huge loss. 

Conclusion: 

Understanding the travel history, there seems spike in cherry-blossom around May month. This looks increases till September month. October onwards, number of passengers decreases.
Again in December- January around Christmas & New Year again there is slight increases in rush. 
This looks, people love to travel in hot and humid climate.

From this, we can imagine and remain optimistic about the change in aviation industry on positive note during 2nd Quarter & 3rd Quarter 2021. 

Hope! This article is useful to learn time-series forecasting in SAS and applying ARIMA model.


Happy Learning!πŸ’


Friday, July 17, 2020

Survey Question solved using Analytics: Whether true mean body temperature is 98.6?

                 

Pandemics is a large-scale outbreaks of infectious disease that can greatly increase morbidity and mortality over a wide geographic area. There is significant gap and challenges exists in the global pandemic preparedness.
And its early sign of symptoms includes FEVER.


This FEVER, raised the question regarding the notion of true mean        body temperature 98.6. Whether to considered for both men and women  as same? Is this temperature an ideal for everyone?Whether we at risk? What is the true mean body temperature?


An article was written on Journal of the American Medical Association in 1992. They did survey on 65 males and 65 females to check whether the mean body temperature of female is same as for male. Based on this survey, this generalize the same pattern among overall population.

Today, using the power of statistics this problem can be addressed. 
Here, the data based on the survey itself.


Let's get started with the analysis of data....πŸ˜…


The data description as follows:  

# Id: 
# Body Temperature 
# Gender 
# Heart Rate

Numerical Variables: body_temperature,heart_rate
Categorical Variables: Gender

Glimpse of data: first 10 observations out of 130 observations.




Steps*:
  
Firstly, need to find the distribution of data, to check whether data is normally distributed or not?
Drilling down w.r.t the continuous variable, need to figure out the mean, the standard deviation and the spread of data.
Secondly, performing t-test to determine whether the mean temperature of body is 98.6. Producing confidence interval plot of Body temperature. 


Code Snipet:


%let interval=BodyTemp HeartRate;

proc univariate data=temp noprint;
    var &interval;
    histogram &interval / normal kernel;
    inset n mean std / position=ne;
    title "Interval Variable Distribution Analysis";
run;


title;



Blue colored curve shows the ideal bell shaped data distribution. This red curve looks identical to bell-shaped curve. i.e the distribution of BodyTemp.

N=130 which means no missing values.
Mean(ΞΌ) = 98.25
Standard Deviation(Stdev)=0.73


This red curve is quiet similarly resembles like a bell-shaped guassian curve.


N=130 which means no missing values.
Mean(ΞΌ) = 73.76
Standard Deviation(Stdev)=7.06

This depicts normal distribution of data.


To perform one-sample t test to determine the mean of body temperature is 98.6

Hypothesis Statement:
Null Hypothesis(H0): The mean body temperature is 98.6
Alternative Hypothesis(Ha): The mean body temperature is not 98.6


Code Snipet:

proc ttest data=temp h0=98.6
           plots(only shownull)=interval;
   var BodyTemp;
   title 'Testing Whether the Mean Body Temperature = 98.6';
run;


title;



From the above table, it clearly depicts that the range of temperature lies between 96.3 to 100.8 F.
According to 95% confidence interval, the average temperature will lies with 98.12 to 98.38.
Even, standard deviation does not show much variation in value. 
Here, alpha consider as 0.05

Degree of Freedom= N-1 =130-1=129

Here, p-value is highly significant. As p-value is low, we fail to accept the null hypothesis.

Therefore, the mean body temperature is not 98.6. 





From the above plot, confidence interval lies with the range of 98.1 to 98.4. So, 95% probability that the mean body temperature of any person must be within this range only.



According to this analysis, the true mean body temperature is 98.2 F.

Conclusion:

The average body temperature is 98.2 F But as per 95% confidence interval, your temperature might range between 98.1 F to 98.4 F. 
If your body temperature is beyond 98.4 F then you might got infected. 

Happy Learning πŸ™‹πŸ˜ƒ

                                                                                   

Monday, June 15, 2020

Which Machine Learning requires Feature Scaling(Standardization and Normalization)? And Which not?

Hi folks,

The feature scaling is the most important step in data preparation. Whether to use feature scaling or not depend upon the algorithm you are using.

Many of us, still wondering why feature scaling requires? Why we need to scale the variables?

1. Having features on same scale that can contribute equally to the result. Can enhance the performance of machine learning algorithms.

2. If you don't scale features then large scale variables will dominate the small scale features.
Example: Suppose, the dataset contains X variable(might be of 2 digit number) and Y variable(might be 5-6 digit number) variables. There is huge gap in scale. As we don't want our algorithm to be biased towards one feature.This will effect the accuracy of model or towards the performance of algorithm or might get wrong predictions.

Certain machine learning algorithms such as distance based algorithms , curve based algorithms or matrix factorization, decomposition or dimensionality reduction or gradient descent based algorithms  are sensitive towards feature scaling (standardization and normalization for numerical variables).

And there are certain tree based algorithms which are insensitive towards feature scaling as they are rule based algorithms such as Classification and Regression trees, Random Forests or Gradient Boosted decision Trees.


















Cons of Feature Scaling: You will lose the original value will transforming to other values. So, there is loss of interpretation of the values.


Standardization v/s Normalization

Standardization: 
The idea behind using standardization before applying machine learning algorithm is to transform you data such that its distribution will have mean value 0 and standard deviation as

Mu=0
Sd=1

Normalization: 
This method will scale/shift/rescale the data between the range of 0 and 1. So, this is also called as Min-Max scaling.

Cons: This method will make us to lose some of the information of the data such as outliers.

For most of the applications, Standardization method performances better than Normalization.
**Note: For the best possible results, you need to start fitting the actual whole model(default), normalized and standardized and compare the results.


Hope! This is useful.

Happy Learning!πŸ˜ŠπŸ™‹