Question 1. What Are The Essential Steps In A Predictive Modeling Project?
Answer :
It consists of the following steps:
- Establish business objective of a predictive model
- Pull Historical Data – Internal and External
- Select Observation and Performance Window
- Create newly derived variables
- Split Data into Training, Validation and Test Samples
- Clean Data – Treatment of Missing Values and Outliers
- Variable Reduction / Selection
- Variable Transformation
- Develop Model
- Validate Model
- Check Model Performance
- Deploy Model
- Monitor Model
Question 2. What Are The Applications Of Predictive Modeling?
Answer :
Predictive modeling is mostly used in the following areas –
- Acquisition – Cross Sell / Up Sell
- Retention – Predictive Attrition Model
- Customer Lifetime Value Model
- Next Best Offer
- Market Mix Model
- Pricing Model
- Campaign Response Model
- Probability of Customers defaulting on loan
- Segment customers based on their homogenous attributes
- Demand Forecasting
- Usage Simulation
- Underwriting
- Optimization – Optimize Network
Question 3. Explain The Problem Statement Of Your Project. What Are The Financial Impacts Of It?
Answer :
Cover the objective or main goal of your predictive model. Compare monetary benefits of the predictive model vs. No-model. Also highlights the non-monetary benefits (if any).
Question 4. Difference Between Linear And Logistic Regression?
Answer :
Two main difference are as follows –
Linear regression requires the dependent variable to be continuous i.e. numeric values (no categories or groups). While Binary logistic regression requires the dependent variable to be binary – two categories only (0/1). Multinomial or ordinary logistic regression can have dependent variable with more than two categories.
Linear regression is based on least square estimation which says regression coefficients should be chosen in such a way that it minimizes the sum of the squared distances of each observed response to its fitted value. While logistic regression is based on Maximum Likelihood Estimation which says coefficients should be chosen in such a way that it maximizes the Probability of Y given X (likelihood)
Question 5. How To Handle Missing Values?
Answer :
We fill/impute missing values using the following methods. Or make missing values as a separate category.
- Mean Imputation for Continuous Variables (No Outlier)
- Median Imputation for Continuous Variables (If Outlier)
- Cluster Imputation for Continuous Variables
- Imputation with a random value that is drawn between the minimum and maximum of the variable [Random value = min(x) + (max(x) – min(x)) * ranuni(SEED)]
- Impute Continuous Variables with Zero (Require business knowledge)
- Conditional Mean Imputation for Continuous Variables
- Other Imputation Methods for Continuous – Predictive mean matching, Bayesian linear regression, Linear regression ignoring model error etc.
- WOE for missing values in categorical variables
- Decision Tree, Random Forest, Logistic Regression for Categorical Variables
- Decision Tree, Random Forest works for both Continuous and Categorical Variables
- Multiple Imputation Method
Question 6. How To Treat Outliers?
Answer :
There are several methods to treat outliers –
- Percentile Capping
- Box-Plot Method
- Mean plus minus 3 Standard Deviation
- Weight of Evidence
Question 7. Explain Dimensionality / Variable Reduction Techniques?
Answer :
Unsupervised Method (No Dependent Variable)
- Principal Component Analysis (PCA)
- Hierarchical Variable Clustering (Proc Varclus in SAS)
- Variance Inflation Factor (VIF)
- Remove zero and near-zero variance predictors
- Mean absolute correlation. Removes the variable with the largest mean absolute correlation.
Supervised Method (In respect to Dependent Variable):
For Binary / Categorical Dependent Variable
- Information Value
- Wald Chi-Square
- Random Forest Variable Importance
- Gradient Boosting Variable Importance
- Forward/Backward/Stepwise – Variable Significance (p-value)
- AIC / BIC score
For Continuous Dependent Variable
- Adjusted R-Square
- Mallows’ Cp Statistic
- Random Forest Variable Importance
- AIC / BIC score
- Forward / Backward / Stepwise – Variable Significance
Question 8. What Is Multicollinearity And How To Deal It?
Answer :
Multicollinearity implies high correlation between independent variables. It is one of the assumptions in linear and logistic regression. It can be identified by looking at VIF score of variables. VIF > 2.5 implies moderate collinearity issue. VIF >5 is considered as high collinearity.
It can be handled by iterative process : first step – remove variable having highest VIF and then check VIF of remaining variables. If VIF of remaining variables > 2.5, then follow the same first step until VIF < =2.5
Question 9. How Vif Is Calculated And Interpretation Of It?
Answer :
VIF measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity. If the VIF of a predictor variable were 9 (√9 = 3) this means that the standard error for the coefficient of that predictor variable is 3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.Steps of calculating VIF
- Run linear regression in which one of the independent variable is considered as target variable and all the other independent variables considered as independent variables
- Calculate VIF of the variable. VIF = 1/(1-RSquared)
Question 10. Do We Remove Intercepts While Calculating Vif?
Answer :
No. VIF depends on the intercept because there is an intercept in the regression used to determine VIF. If the intercept is removed, R-square is not meaningful because it may be negative in which case one can get VIF < 1, implying that the standard error of a variable would go up if that independent variable were uncorrelated with the other predictors.
Question 11. What Is P-value And How It Is Used For Variable Selection?
Answer :
The p-value is lowest level of significance at which you can reject null hypothesis. In the case of independent variables, it implies whether coefficient of a variable is significantly different from zero.
Question 12. Explain Important Model Performance Statistics?
Answer :
- AUC > 0.7. No significant difference between AUC score of training vs validation.
- KS should be in top 3 deciles and it should be more than 30
- Rank Ordering. No break in rank ordering.
- Same signs of parameter estimates in both training and validation
Question 13. Explain Collinearity Between Continuous And Categorical Variables. Is Vif A Correct Method To Compute Collinearity In This Case?
Answer :
Collinearity between categorical and continuous variables is very common. The choice of reference category for dummy variables affects multicollinearity. It means changing the reference category of dummy variables can avoid collinearity. Pick a reference category with highest proportion of cases.
VIF is not a correct method in this case. VIFs should only be run for continuous variables. The t-test method can be used to check collinearity between continuous and dummy variable.
SAS Programming Interview Questions
SAS Programming Tutorial
Red Hat cluster Interview Questions
SAS DI Interview Questions
Advanced SAS Interview Questions
Base Sas Interview Questions
SAS Programming Interview Questions