Data Science I, Workshop I: Predicting interest rates at the Lending Club

Publish date: Oct 12, 2021

Q1 Exploratory data analysis and visualisation
Q2 Analysis on model 1 (questions about model 1)
Q3 Analysis on model 2-5 (questions about model 2-5)
Q4 Predictive accuracy of Model 2
Q5 Comparison between 10-fold cross validation and hold-out method
Q6 Analysis on learning curves
Q7 Comparison between OLS regression and LASSO
Q8 Prediction improvement by adding time information
Q9 Explanatory power of bond yields
Q10 Model Comparison
Q11 Further improvements
Analysis on additional data

Load and prepare the data

We start by loading the data to R in a dataframe.

lc_raw <- read_csv(here::here("data","LendingClub Data.csv"),  skip=1) %>%  #since the first row is a title we want to skip it. 
  clean_names() # use janitor::clean_names()

ICE the data: Inspect, Clean, Explore

lc_clean<- lc_raw %>%
  dplyr::select(-x20:-x80) %>% #delete empty columns
  filter(!is.na(int_rate)) %>%   #delete empty rows
  mutate(
    issue_d = mdy(issue_d),  # lubridate::mdy() to fix date format
    term = factor(term_months),     # turn 'term' into a categorical variable
    delinq_2yrs = factor(delinq_2yrs) # turn 'delinq_2yrs' into a categorical variable
  ) %>% 
  dplyr::select(-emp_title,-installment, -term_months, everything()) #move some not-so-important variables to the end.

The data is now in a clean format stored in the dataframe “lc_clean.”

Q1 Exploratory data analysis and visualisation

# Build a histogram of interest rates.
ggplot(lc_clean, aes(x=int_rate))+  
  geom_histogram(binwidth=0.01)+
  scale_x_continuous(labels = scales::percent) +
  labs(x="Interest Rate", title = "Distribution of Interest Rates")

# Build a histogram of interest rates but use different color for loans of different grades 
ggplot(lc_clean, aes(x=int_rate, fill=grade))+  
  geom_histogram(binwidth=0.01)+scale_x_continuous(labels = scales::percent)+ 
  labs(x="Interest Rate", title = "Distribution of Interest Rates by Grade")

# Produce a scatter plot of loan amount against interest rate and add visually the line of best fit
ggplot(lc_clean[seq(1, nrow(lc_clean), 10), ] , aes(y=int_rate, x=loan_amnt)) + 
  geom_point(size=0.1, alpha=0.5)+ 
  geom_smooth(method="lm", se=0) + 
  labs(y="Interest Rate", x="Loan Amount ($)", title = "Comparing Interest Rates and Loan Amounts")

# Produce a scatter plot of annual income against interest rate and add visually the line of best fit 
ggplot(lc_clean[seq(1, nrow(lc_clean), 10), ] , aes(y=int_rate, x=annual_inc)) + 
  geom_point(size=0.1)+ 
  geom_smooth(method="lm", se=0) +
  labs(y="Interest Rate", x="Annual Income ($)", title="Comparing Interest Rates and Annual Income")

# In the same axes, produce box plots of the interest rate for every value of delinquencies
ggplot(lc_clean , aes(y=int_rate, x=delinq_2yrs, colour= delinq_2yrs)) + 
  geom_boxplot()+
  # geom_jitter()+
  theme_bw()+
   scale_y_continuous(labels=scales::percent)+
  theme(legend.position = "none")+
  labs(
    title = "Do Delinquencies in the last Two Years Impact Interest Rates Charged?",
    x= "Number of Delinquecies in the last Two Years", y="Interest Rate"
  )

ggplot(lc_clean, aes (y=int_rate, x=issue_d, color = grade))+
  geom_point()+
  theme_bw()+
  labs(title="Interest Rate Evolution Over Time by Grade", x="Issue Date", y="Interest Rate")

lc_clean %>% 
  group_by (home_ownership) %>% 
  summarize (mean_int = mean(int_rate)) %>% 
  ggplot (aes(x = home_ownership, y= mean_int))+geom_col()+
  labs (title = "Relationship between Home Ownership and Interest Rates", x="Home Ownership Status", y="Interest Rate")

Estimate simple linear regression models

We start with a simple but quite powerful model.

# Use the lm command to estimate a regression model with the following variables "loan_amnt",  "term", "dti", "annual_inc", and "grade"

model1<-lm(int_rate ~ loan_amnt + term + dti + annual_inc + grade, data = lc_clean
  )
summary(model1)

## 
## Call:
## lm(formula = int_rate ~ loan_amnt + term + dti + annual_inc + 
##     grade, data = lc_clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.118827 -0.007035 -0.000342  0.006828  0.035081 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.169e-02  1.689e-04 424.363  < 2e-16 ***
## loan_amnt    1.475e-07  8.284e-09  17.809  < 2e-16 ***
## term60       3.608e-03  1.419e-04  25.431  < 2e-16 ***
## dti          4.328e-05  8.269e-06   5.234 1.66e-07 ***
## annual_inc  -9.734e-10  9.283e-10  -1.049    0.294    
## gradeB       3.554e-02  1.492e-04 238.248  < 2e-16 ***
## gradeC       6.016e-02  1.658e-04 362.783  < 2e-16 ***
## gradeD       8.172e-02  1.906e-04 428.746  < 2e-16 ***
## gradeE       9.999e-02  2.483e-04 402.660  < 2e-16 ***
## gradeF       1.195e-01  3.673e-04 325.408  < 2e-16 ***
## gradeG       1.355e-01  6.208e-04 218.245  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01056 on 37858 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9197 
## F-statistic: 4.34e+04 on 10 and 37858 DF,  p-value: < 2.2e-16

Q2 Analysis on model 1 (questions about model 1)

Are all variables statistically significant?

All the values are statistically significant except for annual income. The P values for all variables are less than 0.05, but the P value for annual income is larder than 0.05.

Interpret all the coefficients in the regression.

For all the coefficients in this equation, its interpretation is that if the loan amount increases by 100,000 the interest rate will increase by 1.5%.
Additionally, if the loan term is 60 months, the interest rate of the loan will be higher than that of a 36 month loan by on average 0.36%. Futhermore, a one unit increase to the debt-to-income ratio will increase on average the loan interest rate by 0.0043%.
Annual income is insignificant. Finally, comparing to Grade A, lowering the grade by each additional letter mean that the on average the interest rate will increase by 3.55% for Grabe B, 6.02% for Grade C, 8.17% for Grade D, 10% for Grade E, 12% for Grade F, and 14% for Grade G.

How much explanatory power does the model have?

The model has an adjusted R squared of .9197. This means that the model can explain about 92% of the varibility of the interest rates with this data.

How wide would the 95% confidence interval of any prediction based on this model be?

The confidence interval width is 4%, which is a high number when trying to predit interest rates.

Feature Engineering

Then we built progressively more complex models, with more features.

#Add to model 1 an interaction between loan amount and grade. Use the "var1*var2" notation to define an interaction term in the linear regression model. This will add the interaction and the individual variables to the model. 

model2 <- lm(int_rate ~ loan_amnt*grade + term +dti + annual_inc, data = lc_clean)
summary(model2)

## 
## Call:
## lm(formula = int_rate ~ loan_amnt * grade + term + dti + annual_inc, 
##     data = lc_clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.119807 -0.007230 -0.000057  0.006588  0.037460 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.171e-02  2.345e-04 305.762  < 2e-16 ***
## loan_amnt         1.528e-07  2.028e-08   7.537 4.91e-14 ***
## gradeB            3.623e-02  2.732e-04 132.619  < 2e-16 ***
## gradeC            6.198e-02  2.988e-04 207.452  < 2e-16 ***
## gradeD            8.082e-02  3.483e-04 232.034  < 2e-16 ***
## gradeE            9.633e-02  4.625e-04 208.293  < 2e-16 ***
## gradeF            1.143e-01  7.735e-04 147.699  < 2e-16 ***
## gradeG            1.327e-01  1.551e-03  85.580  < 2e-16 ***
## term60            3.793e-03  1.424e-04  26.636  < 2e-16 ***
## dti               3.836e-05  8.250e-06   4.649 3.34e-06 ***
## annual_inc       -1.224e-09  9.249e-10  -1.324  0.18564    
## loan_amnt:gradeB -6.617e-08  2.441e-08  -2.710  0.00673 ** 
## loan_amnt:gradeC -1.704e-07  2.621e-08  -6.500 8.14e-11 ***
## loan_amnt:gradeD  6.703e-08  2.798e-08   2.395  0.01662 *  
## loan_amnt:gradeE  2.209e-07  3.016e-08   7.323 2.47e-13 ***
## loan_amnt:gradeF  2.779e-07  4.136e-08   6.720 1.85e-11 ***
## loan_amnt:gradeG  1.265e-07  7.260e-08   1.743  0.08140 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01052 on 37852 degrees of freedom
## Multiple R-squared:  0.9204, Adjusted R-squared:  0.9204 
## F-statistic: 2.735e+04 on 16 and 37852 DF,  p-value: < 2.2e-16

#Add to the model we just created above the square and the cube of annual income. Use the poly(var_name,3) command as a variable in the linear regression model. 

model3 <-  lm(int_rate ~ loan_amnt*grade + term +dti + poly(annual_inc,3), data = lc_clean)
summary(model3)

## 
## Call:
## lm(formula = int_rate ~ loan_amnt * grade + term + dti + poly(annual_inc, 
##     3), data = lc_clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.119814 -0.007238 -0.000065  0.006594  0.037471 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.161e-02  2.279e-04 314.200  < 2e-16 ***
## loan_amnt             1.556e-07  2.047e-08   7.600 3.03e-14 ***
## gradeB                3.623e-02  2.732e-04 132.611  < 2e-16 ***
## gradeC                6.198e-02  2.988e-04 207.398  < 2e-16 ***
## gradeD                8.081e-02  3.483e-04 231.999  < 2e-16 ***
## gradeE                9.633e-02  4.625e-04 208.283  < 2e-16 ***
## gradeF                1.143e-01  7.736e-04 147.695  < 2e-16 ***
## gradeG                1.327e-01  1.551e-03  85.573  < 2e-16 ***
## term60                3.786e-03  1.426e-04  26.546  < 2e-16 ***
## dti                   3.761e-05  8.294e-06   4.535 5.79e-06 ***
## poly(annual_inc, 3)1 -1.589e-02  1.117e-02  -1.423   0.1549    
## poly(annual_inc, 3)2  6.296e-03  1.086e-02   0.580   0.5621    
## poly(annual_inc, 3)3 -8.981e-03  1.084e-02  -0.828   0.4076    
## loan_amnt:gradeB     -6.633e-08  2.442e-08  -2.717   0.0066 ** 
## loan_amnt:gradeC     -1.703e-07  2.621e-08  -6.498 8.23e-11 ***
## loan_amnt:gradeD      6.709e-08  2.798e-08   2.398   0.0165 *  
## loan_amnt:gradeE      2.209e-07  3.016e-08   7.324 2.45e-13 ***
## loan_amnt:gradeF      2.780e-07  4.136e-08   6.722 1.82e-11 ***
## loan_amnt:gradeG      1.270e-07  7.260e-08   1.750   0.0801 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01052 on 37850 degrees of freedom
## Multiple R-squared:  0.9204, Adjusted R-squared:  0.9204 
## F-statistic: 2.431e+04 on 18 and 37850 DF,  p-value: < 2.2e-16

#Continuing with the previous model, instead of annual income as a continuous variable break it down into quartiles and use quartile dummy variables. 
  
lc_clean <- lc_clean %>% 
  mutate(quartiles_annual_inc = as.factor(ntile(annual_inc, 4)))

model4 <-lm(int_rate ~ loan_amnt*grade + term +dti + quartiles_annual_inc, data = lc_clean)
summary(model4)

## 
## Call:
## lm(formula = int_rate ~ loan_amnt * grade + term + dti + quartiles_annual_inc, 
##     data = lc_clean)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.119767 -0.007203 -0.000064  0.006596  0.037581 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            7.187e-02  2.414e-04 297.700  < 2e-16 ***
## loan_amnt              1.615e-07  2.048e-08   7.889 3.13e-15 ***
## gradeB                 3.622e-02  2.733e-04 132.524  < 2e-16 ***
## gradeC                 6.196e-02  2.989e-04 207.272  < 2e-16 ***
## gradeD                 8.079e-02  3.484e-04 231.877  < 2e-16 ***
## gradeE                 9.632e-02  4.625e-04 208.273  < 2e-16 ***
## gradeF                 1.143e-01  7.735e-04 147.723  < 2e-16 ***
## gradeG                 1.327e-01  1.551e-03  85.570  < 2e-16 ***
## term60                 3.781e-03  1.425e-04  26.532  < 2e-16 ***
## dti                    3.672e-05  8.259e-06   4.446 8.78e-06 ***
## quartiles_annual_inc2 -2.619e-04  1.548e-04  -1.692  0.09068 .  
## quartiles_annual_inc3 -3.992e-04  1.590e-04  -2.510  0.01208 *  
## quartiles_annual_inc4 -5.382e-04  1.692e-04  -3.181  0.00147 ** 
## loan_amnt:gradeB      -6.638e-08  2.441e-08  -2.719  0.00655 ** 
## loan_amnt:gradeC      -1.699e-07  2.621e-08  -6.483 9.10e-11 ***
## loan_amnt:gradeD       6.770e-08  2.798e-08   2.419  0.01555 *  
## loan_amnt:gradeE       2.203e-07  3.016e-08   7.304 2.85e-13 ***
## loan_amnt:gradeF       2.766e-07  4.136e-08   6.687 2.30e-11 ***
## loan_amnt:gradeG       1.267e-07  7.259e-08   1.746  0.08089 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01052 on 37850 degrees of freedom
## Multiple R-squared:  0.9204, Adjusted R-squared:  0.9204 
## F-statistic: 2.432e+04 on 18 and 37850 DF,  p-value: < 2.2e-16

#Compare the performance of these four models using the anova command
anova(model1, model2, model3, model4)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
3.79e+04	4.22
3.79e+04	4.19	6	0.0332	50	1.64e-61
3.78e+04	4.19	2	0.000107	0.484	0.616
3.78e+04	4.19	0	0.000915

Q3 Analysis on model 2-5 (questions about model 2-5)

Which of the four models has the most explanatory power in sample?

When looking at the anova output, you can see that the R squared is increasing from each additional model, ending with an R-Squared of 92.04% for model 4, which we believe to have the most explanatory power.
Also, when looking at the variables, model 4 has the most number of significant variables, while model 1,2 and 3 have a higher number of insignificant variables. This is partly due to the fact that the quartiles for annual income are more significant than annual income alone and the polynomial of annual income.

In model 2, how can we interpret the estimated coefficient of the interaction term between grade B and loan amount?

The effect of increasing the loan amount is different for each grade. Using this interaction, we can see how the different Grades effect the interest rates taking into account the loan amount.

The problem of multicollinearity describes situations in which one feature is correlated with other features (or with a linear combination of other features). If the goal is to use the model to make predictions, should we be concerned about multicollinearity? Why, or why not?

When making predictions, multicollinearity is not an issue we need to worry about. All we care about is how good our prediction is, not how each variable effects the output.

Out of sample testing

We want to check the predictive accuracy of model2 by holding out a subset of the data to use as a testing data set. This method is sometimes referred to as the hold-out method for out-of-sample testing.

# split the data in dataframe called "testing" and another one called  "training". The "training" dataframe should have 80% of the data and the "testing" dataframe 20%.

set.seed(1235)

train_test_split <- initial_split(lc_clean, prop = 0.80) # split the dataset into training (80% of data points) and testing (20% of data points) datasets
testing <- testing(train_test_split)
training <- training(train_test_split)

# Fit model2 on the training set 
model2_training<-lm(int_rate ~ loan_amnt + term+ dti + annual_inc + grade +grade:loan_amnt, training)

# Calculate the RMSE of the model in the training set (in sample)
rmse_training<-sqrt(mean((residuals(model2_training))^2))

print(rmse_training)

## [1] 0.01051954

# Use the model to make predictions out of sample in the testing set
pred<-predict(model2_training,testing)
# Calculate the RMSE of the model in the testing set (out of sample)
rmse_testing<- RMSE(pred,testing$int_rate)

print(rmse_testing)

## [1] 0.01049407

print(rmse_training - rmse_testing) # comparing the training and testing result of RMSE

## [1] 2.546534e-05

Q4 Predictive accuracy of Model 2

The predictive accuracy changes by .0001, which is basically zero. This is sensitive to the random seed chosen because RMSE changes when changeing the random seed. Finally, our RMSE is very similar in out training and testing set, giving us confidence that there is no evidence of overfitting.

k-fold cross validation

We can also do out of sample testing using the method of k-fold cross validation. Using the caret package this is easy.

#the method "cv" stands for cross validation. We re going to create 10 folds.  

control <- trainControl (
    method="cv",
    number=10,
    verboseIter=TRUE) #by setting this to true the model will report its progress after each estimation

#we are going to train the model and report the results using k-fold cross validation

plsFit<-train(
    int_rate ~ loan_amnt + term+ dti + annual_inc + grade +grade:loan_amnt ,
    lc_clean,
   method = "lm",
    trControl = control
   )

## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

summary(plsFit)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.119807 -0.007230 -0.000057  0.006588  0.037460 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.171e-02  2.345e-04 305.762  < 2e-16 ***
## loan_amnt           1.528e-07  2.028e-08   7.537 4.91e-14 ***
## term60              3.793e-03  1.424e-04  26.636  < 2e-16 ***
## dti                 3.836e-05  8.250e-06   4.649 3.34e-06 ***
## annual_inc         -1.224e-09  9.249e-10  -1.324  0.18564    
## gradeB              3.623e-02  2.732e-04 132.619  < 2e-16 ***
## gradeC              6.198e-02  2.988e-04 207.452  < 2e-16 ***
## gradeD              8.082e-02  3.483e-04 232.034  < 2e-16 ***
## gradeE              9.633e-02  4.625e-04 208.293  < 2e-16 ***
## gradeF              1.143e-01  7.735e-04 147.699  < 2e-16 ***
## gradeG              1.327e-01  1.551e-03  85.580  < 2e-16 ***
## `loan_amnt:gradeB` -6.617e-08  2.441e-08  -2.710  0.00673 ** 
## `loan_amnt:gradeC` -1.704e-07  2.621e-08  -6.500 8.14e-11 ***
## `loan_amnt:gradeD`  6.703e-08  2.798e-08   2.395  0.01662 *  
## `loan_amnt:gradeE`  2.209e-07  3.016e-08   7.323 2.47e-13 ***
## `loan_amnt:gradeF`  2.779e-07  4.136e-08   6.720 1.85e-11 ***
## `loan_amnt:gradeG`  1.265e-07  7.260e-08   1.743  0.08140 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01052 on 37852 degrees of freedom
## Multiple R-squared:  0.9204, Adjusted R-squared:  0.9204 
## F-statistic: 2.735e+04 on 16 and 37852 DF,  p-value: < 2.2e-16

#the method "cv" stands for cross validation. We re going to create 10 folds.  

control <- trainControl (
    method="cv",
    number=5,
    verboseIter=TRUE) #by setting this to true the model will report its progress after each estimation

#we are going to train the model and report the results using k-fold cross validation

plsFit<-train(
    int_rate ~ loan_amnt + term+ dti + annual_inc + grade +grade:loan_amnt ,
    lc_clean,
   method = "lm",
    trControl = control
   )
  

summary(plsFit)

#the method "cv" stands for cross validation. We re going to create 10 folds.  

control <- trainControl (
    method="cv",
    number=15,
    verboseIter=TRUE) #by setting this to true the model will report its progress after each estimation

#we are going to train the model and report the results using k-fold cross validation

plsFit<-train(
    int_rate ~ loan_amnt + term+ dti + annual_inc + grade +grade:loan_amnt ,
    lc_clean,
   method = "lm",
    trControl = control
   )
  

summary(plsFit)

Q5 Comparison between 10-fold cross validation and hold-out method

The RMSE of the hold out method for our training set was 0.01052 while the RMSE for the 10-fold cross validation is 0.01
10-fold cross is more reliable of a method as it uses all the data set as training and testing. The hold-out method reduces the data we use for training as it sets aside some data for the testing method. So, if we have a small data set, the data set we hold out can wrongly represent the whole data set. This gives us the option of using all the data points. A drawback for 10-fold is the time it takes to run it K amount of times.
When running the 5-fold and 15-fold models, we get the same RMSE and R squared as the 10-fold. This means that the model is very robust.

Sample size estimation and learning curves

We can use the hold out method for out-of-sample testing to check if we have a sufficiently large sample to estimate the model reliably.
The idea is to set aside some of the data as a testing set. From the remaining data draw progressively larger training sets and check how the performance of the model on the testing set changes. If the performance no longer improves with larger training sets we know we have a large enough sample. The code below does this. Examine it and run it with different random seeds.

#select a testing dataset (25% of all data)
set.seed(12)

train_test_split <- initial_split(lc_clean, prop = 0.75)
remaining <- training(train_test_split)
testing <- testing(train_test_split)

#We are now going to run 30 models starting from a tiny training set drawn from the training data and progressively increasing its size. The testing set remains the same in all iterations.

#initiating the model by setting some parameters to zero
rmse_sample <- 0
sample_size<-0
Rsq_sample<-0

for(i in 1:30) {
#from the remaining dataset select a smaller subset to training the data
set.seed(100)
sample

  learning_split <- initial_split(remaining, prop = i/200)
  training <- training(learning_split)
  sample_size[i]=nrow(training)
  
  #traing the model on the small dataset
  model3<-lm(int_rate ~ loan_amnt + term+ dti + annual_inc + grade + grade:loan_amnt, training)
  #test the performance of the model on the large testing dataset. This stays fixed for all iterations.
  pred<-predict(model3,testing)
  rmse_sample[i]<-RMSE(pred,testing$int_rate)
  Rsq_sample[i]<-R2(pred,testing$int_rate)
}
plot(sample_size,rmse_sample)

plot(sample_size,Rsq_sample)

Q6 Analysis on learning curves

When looking at the RMSE and R squared graphs, we can see that the RMSE decreases and the R squared increases a lot more before reaching the sample size of 2,000.
This means that we would need approximately 2,000 in our sample size in order to estimate model 3 reliably. Once we reach this sample size, we can reduce the prediction error by changing the variables in our model (feature engineering), but not increasing sample size.

Regularization using LASSO regression

If we are in the region of the learning curve where we do not have enough data, one option is to use a regularization method such as LASSO.

Let’s try to estimate a large and complicated model (many interactions and polynomials) on a small training dataset using OLS regression and hold-out validation method.

#split the data in testing and training. The training test is really small.
set.seed(1234)
train_test_split <- initial_split(lc_clean, prop = 0.01)
training <- training(train_test_split)
testing <- testing(train_test_split)

model_lm<-lm(int_rate ~ poly(loan_amnt,3) + term+ dti + annual_inc + grade +grade:poly(loan_amnt,3):term +poly(loan_amnt,3):term +grade:term, training)
predictions <- predict(model_lm,testing)

# Model prediction performance
data.frame(
  RMSE = RMSE(predictions, testing$int_rate),
  Rsquare = R2(predictions, testing$int_rate)
)

RMSE	Rsquare
0.0123	0.891

Not surprisingly this model does not perform well – as we knew form the learning curves we constructed for a simpler model we need a lot more data to estimate this model reliably. Try running it again with different seeds. The model’s performance tends to be sensitive to the choice of the training set.

LASSO regression offers one solution – it extends the OLS regression by penalizing the model for setting any coefficient estimate to a value that is different from zero. The penalty is proportional to a parameter lambda. This parameter cannot be estimated directly (and for this reason sometimes it is referred to as hyperparameter). lambda will be selected through k-fold cross validation so as to provide the best out-of-sample performance. As a result of the LASSO procedure, only those features that are more strongly associated with the outcome will have non-zero coefficient estimates and the estimated model will be less sensitive to the training set. Sometimes LASSO regression is referred to as regularization.

# we will look for the optimal lambda in this sequence (we will try 1000 different lambdas)
lambda_seq <- seq(0, 0.01, length = 1000)

# lasso regression using k-fold cross validation to select the best lambda

lasso <- train(
 int_rate ~ poly(loan_amnt,3) + term+ dti + annual_inc + grade +grade:poly(loan_amnt,3):term +poly(loan_amnt,3):term +grade:term,
 data = training,
 method = "glmnet",
  preProc = c("center", "scale"), #This option standardizes the data before running the LASSO regression
  trControl = control,
  tuneGrid = expand.grid(alpha = 1, lambda = lambda_seq) #alpha=1 specifies to run a LASSO regression. If alpha=0 the model would run ridge regression.
  )

## + Fold01: alpha=1, lambda=0.01 
## - Fold01: alpha=1, lambda=0.01 
## + Fold02: alpha=1, lambda=0.01 
## - Fold02: alpha=1, lambda=0.01 
## + Fold03: alpha=1, lambda=0.01 
## - Fold03: alpha=1, lambda=0.01 
## + Fold04: alpha=1, lambda=0.01 
## - Fold04: alpha=1, lambda=0.01 
## + Fold05: alpha=1, lambda=0.01 
## - Fold05: alpha=1, lambda=0.01 
## + Fold06: alpha=1, lambda=0.01 
## - Fold06: alpha=1, lambda=0.01 
## + Fold07: alpha=1, lambda=0.01 
## - Fold07: alpha=1, lambda=0.01 
## + Fold08: alpha=1, lambda=0.01 
## - Fold08: alpha=1, lambda=0.01 
## + Fold09: alpha=1, lambda=0.01 
## - Fold09: alpha=1, lambda=0.01 
## + Fold10: alpha=1, lambda=0.01 
## - Fold10: alpha=1, lambda=0.01 
## Aggregating results
## Selecting tuning parameters
## Fitting alpha = 1, lambda = 0.00032 on full training set

# Model coefficients
coef(lasso$finalModel, lasso$bestTune$lambda)

## 58 x 1 sparse Matrix of class "dgCMatrix"
##                                              s1
## (Intercept)                        1.211648e-01
## poly(loan_amnt, 3)1                8.147846e-04
## poly(loan_amnt, 3)2                .           
## poly(loan_amnt, 3)3                2.318037e-04
## term60                             1.315642e-03
## dti                                4.468570e-04
## annual_inc                         .           
## gradeB                             1.689641e-02
## gradeC                             2.336358e-02
## gradeD                             2.495146e-02
## gradeE                             2.491406e-02
## gradeF                             1.898605e-02
## gradeG                             1.746858e-02
## poly(loan_amnt, 3)1:term60         .           
## poly(loan_amnt, 3)2:term60         .           
## poly(loan_amnt, 3)3:term60         .           
## term60:gradeB                     -8.625465e-04
## term60:gradeC                      .           
## term60:gradeD                      5.599650e-04
## term60:gradeE                      1.733082e-03
## term60:gradeF                      6.810212e-04
## term60:gradeG                      1.610131e-03
## poly(loan_amnt, 3)1:term36:gradeB  .           
## poly(loan_amnt, 3)2:term36:gradeB  .           
## poly(loan_amnt, 3)3:term36:gradeB  4.427114e-04
## poly(loan_amnt, 3)1:term60:gradeB  .           
## poly(loan_amnt, 3)2:term60:gradeB  4.701917e-04
## poly(loan_amnt, 3)3:term60:gradeB  .           
## poly(loan_amnt, 3)1:term36:gradeC  .           
## poly(loan_amnt, 3)2:term36:gradeC  3.455465e-04
## poly(loan_amnt, 3)3:term36:gradeC -2.604887e-04
## poly(loan_amnt, 3)1:term60:gradeC  .           
## poly(loan_amnt, 3)2:term60:gradeC  .           
## poly(loan_amnt, 3)3:term60:gradeC -3.559988e-04
## poly(loan_amnt, 3)1:term36:gradeD  .           
## poly(loan_amnt, 3)2:term36:gradeD  5.728226e-04
## poly(loan_amnt, 3)3:term36:gradeD  2.996661e-04
## poly(loan_amnt, 3)1:term60:gradeD  .           
## poly(loan_amnt, 3)2:term60:gradeD  .           
## poly(loan_amnt, 3)3:term60:gradeD -5.402934e-05
## poly(loan_amnt, 3)1:term36:gradeE  .           
## poly(loan_amnt, 3)2:term36:gradeE  .           
## poly(loan_amnt, 3)3:term36:gradeE  .           
## poly(loan_amnt, 3)1:term60:gradeE  .           
## poly(loan_amnt, 3)2:term60:gradeE  .           
## poly(loan_amnt, 3)3:term60:gradeE -4.702642e-04
## poly(loan_amnt, 3)1:term36:gradeF  .           
## poly(loan_amnt, 3)2:term36:gradeF  .           
## poly(loan_amnt, 3)3:term36:gradeF  .           
## poly(loan_amnt, 3)1:term60:gradeF  .           
## poly(loan_amnt, 3)2:term60:gradeF  .           
## poly(loan_amnt, 3)3:term60:gradeF  1.398907e-03
## poly(loan_amnt, 3)1:term36:gradeG  .           
## poly(loan_amnt, 3)2:term36:gradeG  .           
## poly(loan_amnt, 3)3:term36:gradeG  .           
## poly(loan_amnt, 3)1:term60:gradeG  .           
## poly(loan_amnt, 3)2:term60:gradeG  4.631181e-04
## poly(loan_amnt, 3)3:term60:gradeG  .

# Best lambda
lasso$bestTune$lambda

## [1] 0.0003203203

# Count of how many coefficients are greater than zero and how many are equal to zero

sum(coef(lasso$finalModel, lasso$bestTune$lambda)!=0)

## [1] 27

sum(coef(lasso$finalModel, lasso$bestTune$lambda)==0)

## [1] 31

# Make predictions
predictions <- predict(lasso,testing)

# Model prediction performance

data.frame(
  RMSE = RMSE(predictions, testing$int_rate),
  Rsquare = R2(predictions, testing$int_rate)
)

RMSE	Rsquare
0.0108	0.917

Q7 Comparison between OLS regression and LASSO

Which model performs best out of sample, OLS regression or LASSO? Why?

By comparing RMSE and LASSO, we can see LASSO performs better than OLS regression. The R square value increases by almost 0.03.

What value of lambda offers best performance? Is this sensitive to the random seed? Why?

The model performs best with lambda equals to 0.00038. This number is sensitive to the random seed because the lamda changes with different random seeds.

How many coefficients are zero and how many are non-zero in the LASSO model of best fit? Is number of zero (or non-zero) coefficients sensitive on the random seed? Why?

26 coefficients are zero and 32 are non-zero. It is also sensitive to the random seed. We got different numbers with random seed of 4 and 5.

Why is it important to standardize continuous variables before running LASSO?

Because we have a penalty on the absolute values of coefficients. With different scales of continuous variables, the levels of coefficients can be very different, which introduces imbanlance of importance of variables.

Using Time Information

Let’s try to further improve the model’s predictive performance. So far we have not used any time series information. Effectively, all things being equal, our prediction for the interest rate of a loan given in 2009 would be the same as that of a loan given in 2011. We do not think it a good assumption and we are now taking into consideration the time series information to fix this in our previous models.

First, we investigate graphically whether there are any time trends in the interest rates. (Note that the variable “issue_d” only has information on the month the loan was awarded but not the exact date.) We want to use this to improve the model by tring controlling for time in a linear fashion (i.e., a linear time trend) and controlling for time as quarter-year dummies (this is a method to capture non-linear effects of time – we assume that the impact of time doesn’t change within a quarter but it can chance from quarter to quarter). Finally, we checked if time affect loans of different grades differently.

#linear time trend (add code below)
ggplot(lc_clean, aes(x=issue_d, y=int_rate))+
  geom_point(size=0.1)+
  geom_smooth(method = "lm", se=FALSE)

#linear time trend by grade (add code below)
ggplot(lc_clean, aes(x=issue_d, y=int_rate, color=grade))+
  geom_point(size=0.1)+
  geom_smooth(method = "lm", se=FALSE)

#Train models using OLS regression and k-fold cross-validation
#The first model has some explanatory variables and a linear time trend

time1<-train(
  int_rate ~ loan_amnt*grade + term +dti + quartiles_annual_inc + issue_d , #fill your variables here "+ issue_d"
  lc_clean,
  method = "lm",
  trControl = control)

## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

summary(time1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.118983 -0.006602 -0.000730  0.006935  0.036319 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.492e-02  2.512e-03  -9.921  < 2e-16 ***
## loan_amnt              1.008e-07  2.014e-08   5.001 5.71e-07 ***
## gradeB                 3.607e-02  2.681e-04 134.564  < 2e-16 ***
## gradeC                 6.244e-02  2.934e-04 212.772  < 2e-16 ***
## gradeD                 8.127e-02  3.419e-04 237.674  < 2e-16 ***
## gradeE                 9.727e-02  4.542e-04 214.134  < 2e-16 ***
## gradeF                 1.152e-01  7.591e-04 151.828  < 2e-16 ***
## gradeG                 1.336e-01  1.521e-03  87.854  < 2e-16 ***
## term60                 2.238e-03  1.453e-04  15.400  < 2e-16 ***
## dti                    8.506e-06  8.133e-06   1.046 0.295631    
## quartiles_annual_inc2 -2.541e-04  1.518e-04  -1.674 0.094150 .  
## quartiles_annual_inc3 -4.275e-04  1.560e-04  -2.740 0.006139 ** 
## quartiles_annual_inc4 -5.694e-04  1.660e-04  -3.430 0.000603 ***
## issue_d                6.542e-06  1.690e-07  38.700  < 2e-16 ***
## `loan_amnt:gradeB`    -8.054e-11  2.401e-08  -0.003 0.997323    
## `loan_amnt:gradeC`    -1.200e-07  2.574e-08  -4.661 3.15e-06 ***
## `loan_amnt:gradeD`     1.151e-07  2.747e-08   4.189 2.80e-05 ***
## `loan_amnt:gradeE`     2.459e-07  2.959e-08   8.311  < 2e-16 ***
## `loan_amnt:gradeF`     3.005e-07  4.057e-08   7.407 1.31e-13 ***
## `loan_amnt:gradeG`     1.599e-07  7.120e-08   2.246 0.024721 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01031 on 37849 degrees of freedom
## Multiple R-squared:  0.9234, Adjusted R-squared:  0.9234 
## F-statistic: 2.403e+04 on 19 and 37849 DF,  p-value: < 2.2e-16

#The second model has a different linear time trend for each grade class
time2<-train(
    int_rate ~ loan_amnt*grade + term +dti + quartiles_annual_inc + issue_d * grade, #fill your variables here 
    lc_clean,
   method = "lm",
    trControl = control
   )

## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

summary(time2)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.120720 -0.006613 -0.000134  0.006753  0.030865 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.862e-01  4.357e-03  65.681  < 2e-16 ***
## loan_amnt              3.156e-07  1.786e-08  17.673  < 2e-16 ***
## gradeB                -2.298e-01  5.808e-03 -39.573  < 2e-16 ***
## gradeC                -3.541e-01  6.178e-03 -57.315  < 2e-16 ***
## gradeD                -5.147e-01  7.197e-03 -71.510  < 2e-16 ***
## gradeE                -6.134e-01  9.904e-03 -61.935  < 2e-16 ***
## gradeF                -7.098e-01  1.612e-02 -44.042  < 2e-16 ***
## gradeG                -7.102e-01  3.314e-02 -21.432  < 2e-16 ***
## term60                -6.331e-04  1.304e-04  -4.856 1.20e-06 ***
## dti                    5.407e-05  7.148e-06   7.564 4.00e-14 ***
## quartiles_annual_inc2 -2.976e-04  1.332e-04  -2.234  0.02547 *  
## quartiles_annual_inc3 -3.926e-04  1.369e-04  -2.869  0.00412 ** 
## quartiles_annual_inc4 -5.857e-04  1.456e-04  -4.022 5.79e-05 ***
## issue_d               -1.443e-05  2.935e-07 -49.179  < 2e-16 ***
## `loan_amnt:gradeB`    -1.560e-07  2.122e-08  -7.352 1.99e-13 ***
## `loan_amnt:gradeC`    -3.231e-07  2.283e-08 -14.154  < 2e-16 ***
## `loan_amnt:gradeD`    -1.771e-07  2.435e-08  -7.276 3.50e-13 ***
## `loan_amnt:gradeE`    -1.725e-07  2.658e-08  -6.489 8.76e-11 ***
## `loan_amnt:gradeF`    -1.865e-07  3.677e-08  -5.072 3.95e-07 ***
## `loan_amnt:gradeG`    -2.944e-07  6.439e-08  -4.572 4.85e-06 ***
## `gradeB:issue_d`       1.791e-05  3.903e-07  45.877  < 2e-16 ***
## `gradeC:issue_d`       2.807e-05  4.165e-07  67.402  < 2e-16 ***
## `gradeD:issue_d`       4.019e-05  4.849e-07  82.884  < 2e-16 ***
## `gradeE:issue_d`       4.799e-05  6.679e-07  71.852  < 2e-16 ***
## `gradeF:issue_d`       5.568e-05  1.085e-06  51.309  < 2e-16 ***
## `gradeG:issue_d`       5.695e-05  2.230e-06  25.541  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009048 on 37843 degrees of freedom
## Multiple R-squared:  0.9411, Adjusted R-squared:  0.9411 
## F-statistic: 2.418e+04 on 25 and 37843 DF,  p-value: < 2.2e-16

#Change the time trend to a quarter dummy variables.
#zoo::as.yearqrt() creates quarter dummies 
lc_clean_quarter<-lc_clean %>%
  mutate(yq = as.factor(as.yearqtr(lc_clean$issue_d, format = "%Y-%m-%d")))



time3<-train(
    int_rate ~ loan_amnt*grade + term +dti + quartiles_annual_inc + yq,#fill your variables here 
    lc_clean_quarter,
     method = "lm",
    trControl = control
   )

## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

summary(time3)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.116112 -0.005421 -0.000205  0.005814  0.037669 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.826e-02  9.075e-03   4.215 2.50e-05 ***
## loan_amnt              1.089e-07  1.777e-08   6.127 9.02e-10 ***
## gradeB                 3.657e-02  2.359e-04 155.008  < 2e-16 ***
## gradeC                 6.248e-02  2.584e-04 241.762  < 2e-16 ***
## gradeD                 8.095e-02  3.011e-04 268.891  < 2e-16 ***
## gradeE                 9.699e-02  4.000e-04 242.508  < 2e-16 ***
## gradeF                 1.154e-01  6.679e-04 172.748  < 2e-16 ***
## gradeG                 1.338e-01  1.338e-03  99.989  < 2e-16 ***
## term60                 4.576e-03  1.311e-04  34.916  < 2e-16 ***
## dti                    5.287e-06  7.157e-06   0.739 0.460068    
## quartiles_annual_inc2 -2.258e-04  1.336e-04  -1.690 0.090965 .  
## quartiles_annual_inc3 -3.008e-04  1.373e-04  -2.191 0.028469 *  
## quartiles_annual_inc4 -2.185e-04  1.462e-04  -1.495 0.134997    
## `yq2007 Q3`            1.931e-02  9.136e-03   2.113 0.034581 *  
## `yq2007 Q4`            1.486e-02  9.103e-03   1.632 0.102672    
## `yq2008 Q1`            2.129e-02  9.082e-03   2.345 0.019051 *  
## `yq2008 Q2`            2.292e-02  9.090e-03   2.522 0.011677 *  
## `yq2008 Q3`            2.407e-02  9.100e-03   2.645 0.008161 ** 
## `yq2008 Q4`            3.201e-02  9.083e-03   3.525 0.000425 ***
## `yq2009 Q1`            3.619e-02  9.079e-03   3.986 6.74e-05 ***
## `yq2009 Q2`            3.803e-02  9.078e-03   4.189 2.80e-05 ***
## `yq2009 Q3`            3.957e-02  9.077e-03   4.359 1.31e-05 ***
## `yq2009 Q4`            3.981e-02  9.076e-03   4.386 1.16e-05 ***
## `yq2010 Q1`            3.480e-02  9.076e-03   3.834 0.000126 ***
## `yq2010 Q2`            3.355e-02  9.075e-03   3.697 0.000219 ***
## `yq2010 Q3`            3.464e-02  9.075e-03   3.817 0.000135 ***
## `yq2010 Q4`            2.592e-02  9.075e-03   2.856 0.004290 ** 
## `yq2011 Q1`            2.805e-02  9.075e-03   3.091 0.001996 ** 
## `yq2011 Q2`            3.321e-02  9.075e-03   3.660 0.000253 ***
## `yq2011 Q3`            3.561e-02  9.074e-03   3.925 8.70e-05 ***
## `yq2011 Q4`            4.184e-02  9.074e-03   4.611 4.02e-06 ***
## `loan_amnt:gradeB`    -1.076e-07  2.117e-08  -5.083 3.73e-07 ***
## `loan_amnt:gradeC`    -2.140e-07  2.271e-08  -9.421  < 2e-16 ***
## `loan_amnt:gradeD`     5.910e-08  2.422e-08   2.440 0.014700 *  
## `loan_amnt:gradeE`     1.812e-07  2.609e-08   6.943 3.89e-12 ***
## `loan_amnt:gradeF`     2.255e-07  3.573e-08   6.311 2.81e-10 ***
## `loan_amnt:gradeG`     1.121e-07  6.265e-08   1.789 0.073581 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00907 on 37832 degrees of freedom
## Multiple R-squared:  0.9408, Adjusted R-squared:  0.9408 
## F-statistic: 1.671e+04 on 36 and 37832 DF,  p-value: < 2.2e-16

#We specify one quarter dummy variable for each grade. This is going to be a large model as there are 19 quarters x 7 grades = 133 quarter-grade dummies.
time4<-train(
    int_rate ~ loan_amnt*grade + term +dti + quartiles_annual_inc + yq * grade, #fill your variables here 
    lc_clean_quarter,
     method = "lm",
    trControl = control
   )

## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

summary(time4)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.120445 -0.004941  0.000306  0.004671  0.034212 
## 
## Coefficients: (11 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.858e-02  7.578e-03   2.451 0.014248 *  
## loan_amnt              3.008e-07  1.510e-08  19.922  < 2e-16 ***
## gradeB                 4.305e-02  3.193e-04 134.841  < 2e-16 ***
## gradeC                 6.974e-02  3.622e-04 192.538  < 2e-16 ***
## gradeD                 9.702e-02  4.189e-04 231.621  < 2e-16 ***
## gradeE                 1.177e-01  5.380e-04 218.715  < 2e-16 ***
## gradeF                 1.390e-01  8.637e-04 160.926  < 2e-16 ***
## gradeG                 1.603e-01  1.800e-03  89.015  < 2e-16 ***
## term60                 1.727e-03  1.131e-04  15.276  < 2e-16 ***
## dti                    4.804e-05  5.991e-06   8.017 1.11e-15 ***
## quartiles_annual_inc2 -2.415e-04  1.116e-04  -2.165 0.030375 *  
## quartiles_annual_inc3 -2.668e-04  1.146e-04  -2.327 0.019978 *  
## quartiles_annual_inc4 -1.833e-04  1.221e-04  -1.501 0.133291    
## `yq2007 Q3`            5.495e-02  7.722e-03   7.116 1.13e-12 ***
## `yq2007 Q4`            5.596e-02  7.711e-03   7.257 4.03e-13 ***
## `yq2008 Q1`            5.958e-02  7.616e-03   7.823 5.27e-15 ***
## `yq2008 Q2`            6.104e-02  7.651e-03   7.978 1.53e-15 ***
## `yq2008 Q3`            6.042e-02  7.687e-03   7.860 3.96e-15 ***
## `yq2008 Q4`            6.682e-02  7.616e-03   8.773  < 2e-16 ***
## `yq2009 Q1`            6.852e-02  7.598e-03   9.017  < 2e-16 ***
## `yq2009 Q2`            6.798e-02  7.593e-03   8.953  < 2e-16 ***
## `yq2009 Q3`            6.614e-02  7.589e-03   8.715  < 2e-16 ***
## `yq2009 Q4`            6.414e-02  7.587e-03   8.454  < 2e-16 ***
## `yq2010 Q1`            5.714e-02  7.586e-03   7.532 5.10e-14 ***
## `yq2010 Q2`            5.506e-02  7.583e-03   7.260 3.93e-13 ***
## `yq2010 Q3`            5.502e-02  7.583e-03   7.256 4.08e-13 ***
## `yq2010 Q4`            4.396e-02  7.581e-03   5.799 6.73e-09 ***
## `yq2011 Q1`            4.584e-02  7.580e-03   6.047 1.49e-09 ***
## `yq2011 Q2`            4.672e-02  7.580e-03   6.164 7.16e-10 ***
## `yq2011 Q3`            4.691e-02  7.579e-03   6.189 6.10e-10 ***
## `yq2011 Q4`            5.391e-02  7.575e-03   7.117 1.13e-12 ***
## `loan_amnt:gradeB`    -2.450e-07  1.794e-08 -13.659  < 2e-16 ***
## `loan_amnt:gradeC`    -3.703e-07  1.928e-08 -19.214  < 2e-16 ***
## `loan_amnt:gradeD`    -2.118e-07  2.056e-08 -10.300  < 2e-16 ***
## `loan_amnt:gradeE`    -2.256e-07  2.255e-08 -10.003  < 2e-16 ***
## `loan_amnt:gradeF`    -2.504e-07  3.154e-08  -7.938 2.12e-15 ***
## `loan_amnt:gradeG`    -3.434e-07  5.653e-08  -6.075 1.25e-09 ***
## `gradeB:yq2007 Q3`    -2.304e-02  2.299e-03 -10.020  < 2e-16 ***
## `gradeC:yq2007 Q3`    -3.562e-02  2.423e-03 -14.697  < 2e-16 ***
## `gradeD:yq2007 Q3`    -5.146e-02  3.711e-03 -13.867  < 2e-16 ***
## `gradeE:yq2007 Q3`    -5.699e-02  5.566e-03 -10.240  < 2e-16 ***
## `gradeF:yq2007 Q3`    -6.132e-02  3.759e-03 -16.311  < 2e-16 ***
## `gradeG:yq2007 Q3`            NA         NA      NA       NA    
## `gradeB:yq2007 Q4`    -2.403e-02  1.948e-03 -12.339  < 2e-16 ***
## `gradeC:yq2007 Q4`    -3.566e-02  1.799e-03 -19.823  < 2e-16 ***
## `gradeD:yq2007 Q4`    -5.008e-02  2.070e-03 -24.198  < 2e-16 ***
## `gradeE:yq2007 Q4`    -5.084e-02  2.722e-03 -18.675  < 2e-16 ***
## `gradeF:yq2007 Q4`    -5.443e-02  7.719e-03  -7.051 1.81e-12 ***
## `gradeG:yq2007 Q4`            NA         NA      NA       NA    
## `gradeB:yq2008 Q1`    -2.290e-02  9.838e-04 -23.277  < 2e-16 ***
## `gradeC:yq2008 Q1`    -3.337e-02  1.049e-03 -31.813  < 2e-16 ***
## `gradeD:yq2008 Q1`    -4.551e-02  1.178e-03 -38.642  < 2e-16 ***
## `gradeE:yq2008 Q1`    -5.324e-02  1.655e-03 -32.171  < 2e-16 ***
## `gradeF:yq2008 Q1`    -5.634e-02  2.844e-03 -19.807  < 2e-16 ***
## `gradeG:yq2008 Q1`    -6.318e-02  5.523e-03 -11.441  < 2e-16 ***
## `gradeB:yq2008 Q2`    -2.276e-02  1.363e-03 -16.697  < 2e-16 ***
## `gradeC:yq2008 Q2`    -3.374e-02  1.393e-03 -24.226  < 2e-16 ***
## `gradeD:yq2008 Q2`    -4.729e-02  1.647e-03 -28.715  < 2e-16 ***
## `gradeE:yq2008 Q2`    -5.146e-02  2.910e-03 -17.687  < 2e-16 ***
## `gradeF:yq2008 Q2`    -5.826e-02  3.592e-03 -16.218  < 2e-16 ***
## `gradeG:yq2008 Q2`            NA         NA      NA       NA    
## `gradeB:yq2008 Q3`    -2.023e-02  1.740e-03 -11.630  < 2e-16 ***
## `gradeC:yq2008 Q3`    -3.115e-02  1.668e-03 -18.670  < 2e-16 ***
## `gradeD:yq2008 Q3`    -4.497e-02  2.101e-03 -21.409  < 2e-16 ***
## `gradeE:yq2008 Q3`    -4.673e-02  3.008e-03 -15.535  < 2e-16 ***
## `gradeF:yq2008 Q3`    -5.664e-02  4.071e-03 -13.911  < 2e-16 ***
## `gradeG:yq2008 Q3`            NA         NA      NA       NA    
## `gradeB:yq2008 Q4`    -1.845e-02  1.018e-03 -18.121  < 2e-16 ***
## `gradeC:yq2008 Q4`    -3.049e-02  1.053e-03 -28.962  < 2e-16 ***
## `gradeD:yq2008 Q4`    -4.384e-02  1.289e-03 -34.014  < 2e-16 ***
## `gradeE:yq2008 Q4`    -4.913e-02  1.676e-03 -29.310  < 2e-16 ***
## `gradeF:yq2008 Q4`    -5.307e-02  3.907e-03 -13.582  < 2e-16 ***
## `gradeG:yq2008 Q4`    -5.526e-02  5.535e-03  -9.984  < 2e-16 ***
## `gradeB:yq2009 Q1`    -1.310e-02  9.322e-04 -14.050  < 2e-16 ***
## `gradeC:yq2009 Q1`    -2.475e-02  8.185e-04 -30.233  < 2e-16 ***
## `gradeD:yq2009 Q1`    -4.049e-02  9.186e-04 -44.076  < 2e-16 ***
## `gradeE:yq2009 Q1`    -4.419e-02  1.249e-03 -35.372  < 2e-16 ***
## `gradeF:yq2009 Q1`    -4.824e-02  2.811e-03 -17.163  < 2e-16 ***
## `gradeG:yq2009 Q1`            NA         NA      NA       NA    
## `gradeB:yq2009 Q2`    -1.349e-02  7.243e-04 -18.624  < 2e-16 ***
## `gradeC:yq2009 Q2`    -2.424e-02  7.533e-04 -32.185  < 2e-16 ***
## `gradeD:yq2009 Q2`    -4.049e-02  9.255e-04 -43.747  < 2e-16 ***
## `gradeE:yq2009 Q2`    -4.479e-02  1.311e-03 -34.162  < 2e-16 ***
## `gradeF:yq2009 Q2`    -5.316e-02  2.409e-03 -22.065  < 2e-16 ***
## `gradeG:yq2009 Q2`    -5.700e-02  4.104e-03 -13.889  < 2e-16 ***
## `gradeB:yq2009 Q3`    -1.140e-02  6.336e-04 -18.000  < 2e-16 ***
## `gradeC:yq2009 Q3`    -2.039e-02  6.928e-04 -29.427  < 2e-16 ***
## `gradeD:yq2009 Q3`    -3.290e-02  8.752e-04 -37.590  < 2e-16 ***
## `gradeE:yq2009 Q3`    -3.719e-02  1.131e-03 -32.871  < 2e-16 ***
## `gradeF:yq2009 Q3`    -4.274e-02  1.862e-03 -22.957  < 2e-16 ***
## `gradeG:yq2009 Q3`    -4.753e-02  3.162e-03 -15.030  < 2e-16 ***
## `gradeB:yq2009 Q4`    -8.481e-03  5.660e-04 -14.985  < 2e-16 ***
## `gradeC:yq2009 Q4`    -1.674e-02  6.161e-04 -27.177  < 2e-16 ***
## `gradeD:yq2009 Q4`    -2.822e-02  7.048e-04 -40.033  < 2e-16 ***
## `gradeE:yq2009 Q4`    -3.367e-02  1.108e-03 -30.401  < 2e-16 ***
## `gradeF:yq2009 Q4`    -3.639e-02  1.717e-03 -21.187  < 2e-16 ***
## `gradeG:yq2009 Q4`    -3.763e-02  2.556e-03 -14.720  < 2e-16 ***
## `gradeB:yq2010 Q1`    -1.070e-02  5.403e-04 -19.806  < 2e-16 ***
## `gradeC:yq2010 Q1`    -1.153e-02  6.006e-04 -19.199  < 2e-16 ***
## `gradeD:yq2010 Q1`    -2.164e-02  6.843e-04 -31.628  < 2e-16 ***
## `gradeE:yq2010 Q1`    -2.497e-02  1.015e-03 -24.593  < 2e-16 ***
## `gradeF:yq2010 Q1`    -3.071e-02  1.794e-03 -17.116  < 2e-16 ***
## `gradeG:yq2010 Q1`    -2.975e-02  3.106e-03  -9.580  < 2e-16 ***
## `gradeB:yq2010 Q2`    -9.913e-03  4.801e-04 -20.649  < 2e-16 ***
## `gradeC:yq2010 Q2`    -8.584e-03  5.269e-04 -16.293  < 2e-16 ***
## `gradeD:yq2010 Q2`    -1.870e-02  5.940e-04 -31.487  < 2e-16 ***
## `gradeE:yq2010 Q2`    -2.299e-02  7.790e-04 -29.507  < 2e-16 ***
## `gradeF:yq2010 Q2`    -2.840e-02  1.314e-03 -21.617  < 2e-16 ***
## `gradeG:yq2010 Q2`    -2.824e-02  2.662e-03 -10.611  < 2e-16 ***
## `gradeB:yq2010 Q3`    -7.443e-03  4.609e-04 -16.149  < 2e-16 ***
## `gradeC:yq2010 Q3`    -5.014e-03  5.055e-04  -9.918  < 2e-16 ***
## `gradeD:yq2010 Q3`    -1.705e-02  5.669e-04 -30.074  < 2e-16 ***
## `gradeE:yq2010 Q3`    -2.373e-02  6.796e-04 -34.922  < 2e-16 ***
## `gradeF:yq2010 Q3`    -2.740e-02  1.094e-03 -25.037  < 2e-16 ***
## `gradeG:yq2010 Q3`    -2.536e-02  1.880e-03 -13.488  < 2e-16 ***
## `gradeB:yq2010 Q4`    -7.629e-03  4.269e-04 -17.872  < 2e-16 ***
## `gradeC:yq2010 Q4`    -9.741e-04  4.811e-04  -2.025 0.042900 *  
## `gradeD:yq2010 Q4`    -1.265e-02  5.558e-04 -22.754  < 2e-16 ***
## `gradeE:yq2010 Q4`    -1.655e-02  6.725e-04 -24.607  < 2e-16 ***
## `gradeF:yq2010 Q4`    -1.942e-02  1.055e-03 -18.412  < 2e-16 ***
## `gradeG:yq2010 Q4`    -1.928e-02  1.690e-03 -11.409  < 2e-16 ***
## `gradeB:yq2011 Q1`    -5.823e-03  4.113e-04 -14.159  < 2e-16 ***
## `gradeC:yq2011 Q1`    -2.384e-03  4.731e-04  -5.041 4.66e-07 ***
## `gradeD:yq2011 Q1`    -1.072e-02  5.270e-04 -20.332  < 2e-16 ***
## `gradeE:yq2011 Q1`    -1.575e-02  6.063e-04 -25.972  < 2e-16 ***
## `gradeF:yq2011 Q1`    -1.818e-02  8.535e-04 -21.295  < 2e-16 ***
## `gradeG:yq2011 Q1`    -2.123e-02  1.555e-03 -13.651  < 2e-16 ***
## `gradeB:yq2011 Q2`    -1.318e-03  3.914e-04  -3.368 0.000759 ***
## `gradeC:yq2011 Q2`     3.869e-04  4.420e-04   0.875 0.381430    
## `gradeD:yq2011 Q2`    -4.212e-03  4.944e-04  -8.519  < 2e-16 ***
## `gradeE:yq2011 Q2`    -5.559e-03  5.849e-04  -9.504  < 2e-16 ***
## `gradeF:yq2011 Q2`    -5.157e-03  8.426e-04  -6.120 9.47e-10 ***
## `gradeG:yq2011 Q2`    -1.097e-02  1.592e-03  -6.890 5.66e-12 ***
## `gradeB:yq2011 Q3`     1.318e-03  3.652e-04   3.609 0.000308 ***
## `gradeC:yq2011 Q3`     2.390e-03  4.221e-04   5.662 1.50e-08 ***
## `gradeD:yq2011 Q3`    -3.556e-04  4.749e-04  -0.749 0.453964    
## `gradeE:yq2011 Q3`     1.596e-03  5.763e-04   2.770 0.005611 ** 
## `gradeF:yq2011 Q3`     9.203e-04  8.587e-04   1.072 0.283847    
## `gradeG:yq2011 Q3`    -1.687e-03  1.643e-03  -1.027 0.304369    
## `gradeB:yq2011 Q4`            NA         NA      NA       NA    
## `gradeC:yq2011 Q4`            NA         NA      NA       NA    
## `gradeD:yq2011 Q4`            NA         NA      NA       NA    
## `gradeE:yq2011 Q4`            NA         NA      NA       NA    
## `gradeF:yq2011 Q4`            NA         NA      NA       NA    
## `gradeG:yq2011 Q4`            NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.007563 on 37735 degrees of freedom
## Multiple R-squared:  0.959,  Adjusted R-squared:  0.9588 
## F-statistic:  6628 on 133 and 37735 DF,  p-value: < 2.2e-16

data.frame(
  time1$results$RMSE,
  time2$results$RMSE,
  time3$results$RMSE,
  time4$results$RMSE)

time1.results.RMSE	time2.results.RMSE	time3.results.RMSE	time4.results.RMSE
0.0103	0.00905	0.00908	0.00758

Q8 Prediction improvement by adding time information

From the results of linear regressions, we find almost all the time related variables significant. Therefore, the interest rates do change over time.
The predictions can be improved by adding quarter-year dummies. The RMSE has been improved from 0.0103 to 0.0076.

Using Bond Yields

One concern with using time trends for forecasting is that in order to make predictions for future loans we will need to project trends to the future. This is an extrapolation that may not be reasonable, especially if macroeconomic conditions in the future change. Furthermore, if we are using quarter-year dummies, it is not even possible to estimate the coefficient of these dummy variables for future quarters.

Instead, perhaps it’s better to find the reasons as to why different periods are different from one another. The csv file “MonthBondYields.csv” contains information on the yield of US Treasuries on the first day of each month. Can you use it to see if you can improve your predictions without using time dummies?

#load the data to memory as a dataframe
bond_prices<-readr::read_csv(here::here("data","MonthBondYields.csv"))

#make the date of the bond file comparable to the lending club dataset
#for some regional date/number (locale) settings this may not work. If it does try running the following line of code in the Console
#Sys.setlocale("LC_TIME","English")
bond_prices <- bond_prices %>%
  mutate(Date2=as.Date(paste("01",Date,sep="-"),"%d-%b-%y")) %>%
  select(-starts_with("X"))

#let's see what happened to bond yields over time. Lower bond yields mean the cost of borrowing has gone down.

bond_prices %>%
  ggplot(aes(x=Date2, y=Price))+
  geom_point(size=0.1, alpha=0.5)

#join the data using a left join
lc_with_bonds <- lc_clean %>%
  mutate(yq = as.factor(as.yearqtr(lc_clean$issue_d, format = "%Y-%m-%d"))) %>% 
  left_join(bond_prices, by = c("issue_d" = "Date2")) %>%
  arrange(issue_d) %>%
  filter(!is.na(Price)) #drop any observations where there re no bond prices available

# investigate graphically if there is a relationship 
lc_with_bonds%>%
  ggplot(aes(x=int_rate, y=Price))+
  geom_point(size=0.1, alpha=0.5)+geom_smooth(method="lm")

lc_with_bonds%>%
  ggplot(aes(x=int_rate, y=Price, color=grade))+
  geom_point(size=0.1, alpha=0.5)+
  geom_smooth(method="lm")

#let's train a model using the bond information
control <- trainControl (
    method="cv",
    number=10,
    verboseIter=TRUE)

plsFit<-train(
    int_rate ~  loan_amnt*grade + term +dti + Price * grade, 
    lc_with_bonds,
    method = "lm",
    trControl = control
   )

## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

summary(plsFit)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.117450 -0.005810 -0.000302  0.006494  0.040734 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.791e-02  5.361e-04 126.679  < 2e-16 ***
## loan_amnt           1.795e-07  1.863e-08   9.638  < 2e-16 ***
## gradeB              5.350e-02  6.990e-04  76.538  < 2e-16 ***
## gradeC              8.631e-02  7.843e-04 110.051  < 2e-16 ***
## gradeD              1.192e-01  8.941e-04 133.269  < 2e-16 ***
## gradeE              1.433e-01  1.167e-03 122.786  < 2e-16 ***
## gradeF              1.667e-01  1.791e-03  93.096  < 2e-16 ***
## gradeG              1.903e-01  3.456e-03  55.057  < 2e-16 ***
## term60              1.506e-03  1.323e-04  11.378  < 2e-16 ***
## dti                 2.802e-05  7.453e-06   3.760 0.000171 ***
## Price               1.301e-03  1.626e-04   8.005 1.22e-15 ***
## `loan_amnt:gradeB` -9.477e-08  2.249e-08  -4.214 2.51e-05 ***
## `loan_amnt:gradeC` -2.195e-07  2.416e-08  -9.088  < 2e-16 ***
## `loan_amnt:gradeD` -1.987e-09  2.572e-08  -0.077 0.938417    
## `loan_amnt:gradeE`  6.441e-08  2.787e-08   2.311 0.020827 *  
## `loan_amnt:gradeF`  6.990e-08  3.840e-08   1.820 0.068693 .  
## `loan_amnt:gradeG` -1.206e-07  6.770e-08  -1.781 0.074911 .  
## `gradeB:Price`     -5.774e-03  2.181e-04 -26.470  < 2e-16 ***
## `gradeC:Price`     -8.000e-03  2.429e-04 -32.934  < 2e-16 ***
## `gradeD:Price`     -1.274e-02  2.794e-04 -45.585  < 2e-16 ***
## `gradeE:Price`     -1.527e-02  3.611e-04 -42.287  < 2e-16 ***
## `gradeF:Price`     -1.667e-02  5.359e-04 -31.103  < 2e-16 ***
## `gradeG:Price`     -1.780e-02  9.982e-04 -17.829  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009584 on 37845 degrees of freedom
## Multiple R-squared:  0.9339, Adjusted R-squared:  0.9339 
## F-statistic: 2.43e+04 on 22 and 37845 DF,  p-value: < 2.2e-16

Q9 Explanatory power of bond yields

We are using loan amount times grade,term, dti and Price times grade to predict the interest rates
Yes, by adding bond yields to the model, the adjusted R square increased by 0.01, and the bond yields are significant on every grade.
With increase in bond yield (price) by 1%, the interest rates on the loan increase by 0.13% on Grade A loans. As the grade decreases (A to G) the effect of bond yields (price) on the interest rate starts to become negative. This means that as treasury yields (price) increases, the effect on interest rates start to decrease.

Q10 Model Comparison

TestModel1<-train(
    int_rate ~  loan_amnt*grade + term +dti + Price * grade, 
    lc_with_bonds,
    method = "lm",
    trControl = control
   )

## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

summary(TestModel1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.117450 -0.005810 -0.000302  0.006494  0.040734 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.791e-02  5.361e-04 126.679  < 2e-16 ***
## loan_amnt           1.795e-07  1.863e-08   9.638  < 2e-16 ***
## gradeB              5.350e-02  6.990e-04  76.538  < 2e-16 ***
## gradeC              8.631e-02  7.843e-04 110.051  < 2e-16 ***
## gradeD              1.192e-01  8.941e-04 133.269  < 2e-16 ***
## gradeE              1.433e-01  1.167e-03 122.786  < 2e-16 ***
## gradeF              1.667e-01  1.791e-03  93.096  < 2e-16 ***
## gradeG              1.903e-01  3.456e-03  55.057  < 2e-16 ***
## term60              1.506e-03  1.323e-04  11.378  < 2e-16 ***
## dti                 2.802e-05  7.453e-06   3.760 0.000171 ***
## Price               1.301e-03  1.626e-04   8.005 1.22e-15 ***
## `loan_amnt:gradeB` -9.477e-08  2.249e-08  -4.214 2.51e-05 ***
## `loan_amnt:gradeC` -2.195e-07  2.416e-08  -9.088  < 2e-16 ***
## `loan_amnt:gradeD` -1.987e-09  2.572e-08  -0.077 0.938417    
## `loan_amnt:gradeE`  6.441e-08  2.787e-08   2.311 0.020827 *  
## `loan_amnt:gradeF`  6.990e-08  3.840e-08   1.820 0.068693 .  
## `loan_amnt:gradeG` -1.206e-07  6.770e-08  -1.781 0.074911 .  
## `gradeB:Price`     -5.774e-03  2.181e-04 -26.470  < 2e-16 ***
## `gradeC:Price`     -8.000e-03  2.429e-04 -32.934  < 2e-16 ***
## `gradeD:Price`     -1.274e-02  2.794e-04 -45.585  < 2e-16 ***
## `gradeE:Price`     -1.527e-02  3.611e-04 -42.287  < 2e-16 ***
## `gradeF:Price`     -1.667e-02  5.359e-04 -31.103  < 2e-16 ***
## `gradeG:Price`     -1.780e-02  9.982e-04 -17.829  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009584 on 37845 degrees of freedom
## Multiple R-squared:  0.9339, Adjusted R-squared:  0.9339 
## F-statistic: 2.43e+04 on 22 and 37845 DF,  p-value: < 2.2e-16

# TestModel2<-train(
#     int_rate ~  loan_amnt*grade + term +dti + Price * grade + quartiles_annual_inc + home_ownership + addr_state , 
#     lc_with_bonds,
#     method = "lm",
#     trControl = control
#    )
# summary(TestModel2)
# 
# TestModel3<-train(
#     int_rate ~  loan_amnt*grade + term +dti + Price * grade + quartiles_annual_inc + term*loan_amnt, 
#     lc_with_bonds,
#     method = "lm",
#     trControl = control
#    )
# summary(TestModel3)

We have tried different models by adding home ownership, loan purpose, delinq_2 years,quartiles_annual_income and term*loan_amt, and they have either decreased our R-squared or increased it very insignificantly. By adding more interaction terms and time information, the R^2 can be improved to 96% but it takes hours to run the model.
After several attempts to increase the performance of the model (RMSE, R^2) by adding additional variables, we consider TestModel1 is the best model. This is because the performance doesn’t improve significantly with more variables and TestModel1 is the least complex while with enough accuracy.
TestModel1 uses loam_amnt\(\times\)grade, term, dti, Price\(\times\)grade, quartiles_annual_inc to predict the interest rate. It achieves an R^2 of 0.934 and an RMSE of 0.009584. The model can be used for prediction and extrapolated into the future as we didn’t include any time dummy variables.
TestModel1 has a 95% confidence interval with a length of \(1.96\times0.009584\times 2=0.0375\).

Q11 Further improvements

Adding quarterly data on US CPI:

cpi <- readr::read_csv(here::here("data","ConsumerPriceIndex.csv")) %>% 
  mutate(yq = as.factor(as.yearqtr(DATE, format = "%Y-%m-%d")))

lc_bonds_cpi <- lc_with_bonds %>% 
  left_join(cpi, by = c("yq")) %>% 
  arrange(issue_d) %>%
  filter(!is.na(CPALTT01USQ657N_NBD19600401)) #drop any observations where there re no bond prices available

# investigate graphically if there is a relationship 
lc_bonds_cpi%>%
  ggplot(aes(x=CPALTT01USQ657N_NBD19600401))+
  geom_histogram()

# investigate graphically if there is a relationship 
lc_bonds_cpi%>%
  ggplot(aes(x=int_rate, y=CPALTT01USQ657N_NBD19600401))+
  geom_point(size=0.1, alpha=0.5)+
  geom_smooth(method="lm")

lc_bonds_cpi%>%
  ggplot(aes(x=int_rate, y=CPALTT01USQ657N_NBD19600401, color=grade))+
  geom_point(size=0.1, alpha=0.5)+
  geom_smooth(method="lm")

plsFit<-train(
    int_rate ~  loan_amnt*grade + term +dti +  Price*grade+ CPALTT01USQ657N_NBD19600401 * grade, #fill your variables here 
    lc_bonds_cpi,
    method = "lm",
    trControl = control
   )

## + Fold01: intercept=TRUE 
## - Fold01: intercept=TRUE 
## + Fold02: intercept=TRUE 
## - Fold02: intercept=TRUE 
## + Fold03: intercept=TRUE 
## - Fold03: intercept=TRUE 
## + Fold04: intercept=TRUE 
## - Fold04: intercept=TRUE 
## + Fold05: intercept=TRUE 
## - Fold05: intercept=TRUE 
## + Fold06: intercept=TRUE 
## - Fold06: intercept=TRUE 
## + Fold07: intercept=TRUE 
## - Fold07: intercept=TRUE 
## + Fold08: intercept=TRUE 
## - Fold08: intercept=TRUE 
## + Fold09: intercept=TRUE 
## - Fold09: intercept=TRUE 
## + Fold10: intercept=TRUE 
## - Fold10: intercept=TRUE 
## Aggregating results
## Fitting final model on full training set

summary(plsFit)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.118658 -0.006044  0.000158  0.006133  0.036126 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           6.126e-02  5.479e-04 111.807  < 2e-16 ***
## loan_amnt                             1.788e-07  1.802e-08   9.922  < 2e-16 ***
## gradeB                                5.823e-02  7.151e-04  81.438  < 2e-16 ***
## gradeC                                9.425e-02  8.014e-04 117.599  < 2e-16 ***
## gradeD                                1.307e-01  9.211e-04 141.854  < 2e-16 ***
## gradeE                                1.605e-01  1.253e-03 128.052  < 2e-16 ***
## gradeF                                1.881e-01  1.998e-03  94.156  < 2e-16 ***
## gradeG                                2.073e-01  3.733e-03  55.517  < 2e-16 ***
## term60                                8.458e-04  1.316e-04   6.428 1.31e-10 ***
## dti                                   3.628e-05  7.218e-06   5.026 5.03e-07 ***
## Price                                 4.686e-03  1.810e-04  25.897  < 2e-16 ***
## CPALTT01USQ657N_NBD19600401          -3.294e-05  8.684e-07 -37.935  < 2e-16 ***
## `loan_amnt:gradeB`                   -8.482e-08  2.176e-08  -3.897 9.74e-05 ***
## `loan_amnt:gradeC`                   -2.087e-07  2.338e-08  -8.926  < 2e-16 ***
## `loan_amnt:gradeD`                   -9.708e-09  2.490e-08  -0.390   0.6966    
## `loan_amnt:gradeE`                    1.818e-08  2.708e-08   0.671   0.5021    
## `loan_amnt:gradeF`                    6.319e-09  3.744e-08   0.169   0.8660    
## `loan_amnt:gradeG`                   -1.540e-07  6.578e-08  -2.342   0.0192 *  
## `gradeB:Price`                       -8.189e-03  2.423e-04 -33.794  < 2e-16 ***
## `gradeC:Price`                       -1.200e-02  2.668e-04 -44.980  < 2e-16 ***
## `gradeD:Price`                       -1.837e-02  3.103e-04 -59.204  < 2e-16 ***
## `gradeE:Price`                       -2.315e-02  4.237e-04 -54.637  < 2e-16 ***
## `gradeF:Price`                       -2.646e-02  6.832e-04 -38.735  < 2e-16 ***
## `gradeG:Price`                       -2.564e-02  1.223e-03 -20.958  < 2e-16 ***
## `gradeB:CPALTT01USQ657N_NBD19600401`  2.277e-05  1.158e-06  19.662  < 2e-16 ***
## `gradeC:CPALTT01USQ657N_NBD19600401`  3.827e-05  1.232e-06  31.071  < 2e-16 ***
## `gradeD:CPALTT01USQ657N_NBD19600401`  5.315e-05  1.437e-06  36.997  < 2e-16 ***
## `gradeE:CPALTT01USQ657N_NBD19600401`  6.583e-05  1.870e-06  35.193  < 2e-16 ***
## `gradeF:CPALTT01USQ657N_NBD19600401`  7.705e-05  3.175e-06  24.266  < 2e-16 ***
## `gradeG:CPALTT01USQ657N_NBD19600401`  6.397e-05  5.445e-06  11.750  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009273 on 37838 degrees of freedom
## Multiple R-squared:  0.9381, Adjusted R-squared:  0.9381 
## F-statistic: 1.978e+04 on 29 and 37838 DF,  p-value: < 2.2e-16

Analysis on additional data

We would think that the additional data on inflation will make a difference because by adding both the bond yield and the inflation rate as variables to the model, it will give us a more accurate view on the real interest rate at the time.
However, when running the model, we can see that the adjusted r-squared only increases by more than 0.4% to 93.81%. The model also has a lower RMSE, making the predictions more accurate.

Team members

Alex Kubbinga
Clara Moreno Sanchez
Jean Huang
Raghav Mehta
Raina Doshi
Yuan Gao