Introduction

The financial market is a strange place that is very hard to navigate around. We have seen Warren Buffett, Ray Dalio, Charlie Munger - the very very best who have had billions of dollars. Others, 95% of the population, lose the money instead.

So, how are the best of the best pick its stocks? It is from the fundamental, the technical side or the sentimental side? Within this paper, we hope to bring another perspective, using machine learning models to predict the profitability of the stock price.

Paper Outline

First, we will list and explain the definition of each variable in our dataset. There are 25 variables in total as listed below. The variable we are going to predict is PROFIT. Then, we will start to process our data by merging with the macro data and removing some unwanted variables. After processing the data, we will have a data visualization section that shows the distribution and relationship between some variables. And to find the best model, we choose to explore three machine learning models which are LASSO, Random forest and Stacking models. We use a stacking method to create the stacking model by combining three models (Lasso, Random Forest, and KNN). Lastly, we will make a stock return prediction for 2021 using the trained model with the lowest R-squared and RMSE.

List of Variables

For the dataset, we includes financial information on companies in the S&P 500 stock index from 1999-2021. This information was scraped from Yahoo Finance in November of 2021, and collected in a csv format for data analysis. The information includes metrics like sales, earnings, cogs, stock price, and market sector as well as macroeconomic data such as GDP or Money Supply. The goal is to analyze and model this data to better improve projections for a company’s future profitability. The variables in the data set are described below:

Variable Meaning
YEAR The financial year of the company
COMPANY The company’s stock abbreviation symbol
MARKET.CAP The total market capitalization of the company (Volume * Price)
EARNINGS The earnings in dollars for the previous year for the given company
SALES How much the company sold in dollars last year
CASH How much cash the company has in dollars at the end of the previous year
Name The full name of the company
Sector The name of the sector that the company is a part of
PRICE The price of the stock when it is bought
Sell The price of the stock when it is sold
VOLUME The total number of shares that the company has at that moment
COGS The total amount the company paid as a cost directly related to the sale of products
INVESTMENT The total asset or item acquired with the goal of generating income or appreciation
RECIEVABLE The debts owed to a company by its customers for goods that have been delivered or used but not yet paid for
INVENTORY How much raw materials used in production as well as the goods produced that are available for sale
DEBTS How much money the company borrow from other parties
CPALTT01USM657N_PC1 The percentage change in CPI (measure of inflation)
GDP The monetary value of all finished goods and services made within a country
GDP_PC1 The percentage change in GDP
T10Y2Y Ten year treasury bonds minus two year treasury bonds
M1SL The total currency and other liquid instruments in a country’s economy
M1SL_PC1 The percentage change in money supply
PROFIT How much the money made or lost on an investment

Loading data

Data Visualization

We first explored distributions of our predictors and outcome. As the following graph suggested, all the numeric variables are severely right-skewed with some outliers. In our data, though the numbers of firms in each sector are unbalanced, we got a decent amount of data for each sector.

Then, we continued to explore the relationship between regressors and the outcome. By observation, we can’t see a strong correlation between fundemental factors, such as investments and debts, and stock return. It may indicate that linear model is not an ideal model in this case.

With this anamiation, we could see that the returns of stocks follow a cycle, which may be influenced by macroeconomic condition. The return in 1999 and 2019 seems to be the highest for all industries. Also, we noticed that some industries vary a lot year to year, such as IT and healthcare industry.

We further explored the relationship between the industry and the potential macroeconomic influencer using graphs below. Though the relationship doesn’t seem to be linear, we do see how macroeconomic condition affect each industry differently and will account for that in our model.

Stock Return Prediction Using Lasso

A Brief Overview on LASSO

LASSO(Least Absolute Shrinkage & Selection Operator) is a algorithm that penalizes less informative predictors on the basis of Least Square Regression. For OLS, our goal is to minimize the residual sum of squares1 whearas LASSO models try to minimize the RSS penalized by the number of coefficients2. LASSO model will shrink the variables that are less important to zero.

Results from Lasso

We used a Lasso model to predic the return.To account for the fact the macroeconomic condition affects each industry differently, we also included the interaction term between GDP and sector dummies. The table below shows the estimate of each predictor in the Lasso Model.We can see that indicators in macroeconomics are relatively important to predict the stock return. The change in inflation and GDP all remain significant after shrinking. The interaction terms between GDP and industries also showed importance, accounting for the fact that the macroeconomic condition affects each industry differently: GDP seems to affect the stock return in Communication Services, Energy, and Consumer Discretionary sectors less given there negative coefficient estimates. However, the Lasso model does not perform well. It only explains 20% of the varaince in stock return and it also has a prediction is typically off by 46.
Coefficient Estimates (LASSO)
term estimate penalty
M1SL 52.2489802 0.0059948
CPALTT01USM657N_PC1 22.8362276 0.0059948
GDP_PC1 18.5139718 0.0059948
(Intercept) 17.4206737 0.0059948
GDP_PC1_x_Sector_Utilities 14.5169194 0.0059948
GDP_PC1_x_Sector_Consumer Staples 10.9623359 0.0059948
GDP_PC1_x_Sector_Health Care 10.0994574 0.0059948
Sector_Information Technology 9.1062940 0.0059948
T10Y2Y 8.3785234 0.0059948
GDP_PC1_x_Sector_Real Estate 7.9131832 0.0059948
Sector_Communication Services 7.4629557 0.0059948
Sector_Health Care 5.9942110 0.0059948
Sector_Consumer Discretionary 4.8412271 0.0059948
GDP_PC1_x_Sector_Information Technology 4.0219608 0.0059948
Sector_Energy 2.7587880 0.0059948
COGS 2.2012752 0.0059948
GDP_PC1_x_Sector_Industrials 2.0052201 0.0059948
VOLUME 0.9051955 0.0059948
INVESTMENTS 0.4845378 0.0059948
CASH 0.4031170 0.0059948
EARNINGS 0.3671604 0.0059948
RECEIVABLE 0.1688682 0.0059948
INVENTORY 0.1677270 0.0059948
GDP_PC1_x_Sector_Financials 0.0413482 0.0059948
Sector_Materials 0.0000000 0.0059948
PE -0.2304146 0.0059948
DEBTS -0.2444685 0.0059948
Sector_Industrials -0.4250661 0.0059948
GDP_PC1_x_Sector_Materials -1.0239217 0.0059948
GDP_PC1_x_Sector_Communication Services -1.4192763 0.0059948
Sector_Financials -1.5534082 0.0059948
SALES -3.2886701 0.0059948
Sector_Real Estate -3.3504621 0.0059948
MARKET CAP -3.8678288 0.0059948
Sector_Consumer Staples -5.4154183 0.0059948
GDP_PC1_x_Sector_Energy -5.8821361 0.0059948
GDP_PC1_x_Sector_Consumer Discretionary -8.6297519 0.0059948
Sector_Utilities -8.7405789 0.0059948
M1SL_PC1 -11.4973966 0.0059948
GDP -16.0930330 0.0059948

The graph below shows the most essential predictors for our model. We can see that macroeconomics indicators all ranked the top.

Here is the performance of the model on the testing data, the data we did not use to train our model. If the points and the blue line are perfectly aligned with the red line, it means a perfect prediction. We can see that our LASSO model does not perform well and tends to underestimate the stock with high returns, which are mostly from 2020 or earlier years.

Stock Return Prediction Using Random Forest

A Brief Overview on Random Forest

A random forest is a supervised machine learning algorithm that is constructed from decision tree algorithms. It can be used to solve both regression and classification problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems. This algorithm consists of many decision trees and establishes the outcome based on the predictions of those trees. It takes the average of the output from various trees and make prediction.

We are going to train the random forest model by using ranger function with 200 trees and 6 variables randomly chosen at each split to predict our stock return.

Results from Random Forest

Below are the OOB prediction error (MSE), OOB root mean square error (RMSe), and R squared score from the random forest model we have trained above.

OOB error (MSE) represents the difference between the original and predicted values which are extracted by squaring the average difference over predictions from the trees that do not contain in their respective bootstrap sample.

## [1] 1545.602

OOB RMSE is the root mean square of the OOB prediction error above. The RMSE value is fairly high but seems to be significantly better than the LASSO model. We can see that the Random Forest model performs better than the Lasso model.

## [1] 39.31415

R-Squared score is a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variables in a regression model. For our model, the R-Squared score is around 0.4 meaning that 40% of our stock return is explained by our independent variables.

## [1] 0.4084518

Model Evaluation

Traning Data

We use K-fold cross validation technique to evaluate our model. Below are the root mean squared error (rmse) averaged over all the five folds on the training dataset.

This is the graph showing the actual return vs. predicted return.

Testing Data

The table below shows the rmse on testing data. It is a bit higher but fairly similar to the error on training data so we could say that our model did not overfit.

Below is the graph showing the actual return vs. predicted return on testing data.

Interpretable Machine Learning

Based on the box-plot and the histogram below, the residuals mostly lie between -50 to 50, but there are also some outliners that can go up to 400.

According to the feature importance bar chart below, we can see that the top three important features are market cap, earnings, M1SL_PC1, and GDP_PC1.

Stacking Model

A Brief Overview on Stacking Model

What is stacking? Model stacking is an ensembling method that takes the outputs of many models and combines them to generate a new model—referred to as an ensemble in this package—that generates predictions informed by each of its members.

Process of Stacking Model

Stacking model involves three separate steps:

  1. Set up the ensemble:
  • Specify a list of base learners (with a specific set of model parameters).

  • Specify a meta learning algorithm.

  1. Train the ensemble
  • Train each of the base learners on the training set.
  • Perform k-fold CV on each of the base learners and collect the cross-validated predictions from each (the same k-folds must be used for each base learner). These predicted values represent base learning models and the meta learning model, which can then be used to generate predictions on new data.
  1. Predict on new data.
  • To generate ensemble predictions, first generate predictions from the base learners.

  • Feed those predictions into the meta learner to generate the ensemble prediction.

## # A tibble: 2 × 3
##   member        type             weight
##   <chr>         <chr>             <dbl>
## 1 ranger_cv_1_1 rand_forest       0.836
## 2 knn_tune_1_4  nearest_neighbor  0.253

Here, we can see that the RMSE is 39.8, which shows that a predicted profit could have the mean difference with the actual profit of 39.7.

Testing Result from Stacking Model

Here, we can see that when the model predicts company with high rate of return, the model performs really well. Based on the table, every company within the highest prediction yields positive and extremely high rate of return.

After finishing our model, we then move on to compare the three models:

Comparison of the Three Models

With the three models, we then move on to see which model performs the best:

Lasso model

Random Forest Model

Stacking Model

Here, we can see that compared to the three models, even though Random Forest performs better than the stacking model, stacking model use features in the Random Forest along with additional features from KNN and lasso. With that reason, we will choose stacking model as our model choice.

Stock Return Prediction for 2021

After picking our model, we then move on to use the model to predict the potential profit for 2021:

Name .pred actual_ytd
APA Corporation 66.15775 80.30
Under Armour (Class C) 62.22877 39.77
Norwegian Cruise Line Holdings 62.12468 -8.26
DuPont 62.07898 11.29
News Corp (Class B) 62.00012 22.08
Ford 61.51675 132.51
Marathon Oil 60.92993 137.04
Carnival Corporation 60.67833 -4.22
Under Armour (Class A) 60.57563 39.77
DXC Technology 60.07332 20.76
Raytheon Technologies 58.76800 26.41
Halliburton 58.24086 26.34
Schlumberger 56.55640 41.50
Devon Energy 56.13831 178.43
Baker Hughes 55.92235 18.70
ConocoPhillips 55.85569 87.29
Bristol Myers Squibb 55.73066 -6.16
Hess Corporation 55.56343 52.58
United Airlines 54.59626 8.70
The Walt Disney Company 54.55654 -13.70

As we can see from the stock return prediction, our final model performs really well. For the top 20 companies, only four generates loss year-to-date. Moreover, if we invest the same amount of money for the top 20 companies, we will have a 47.35% rate of return year-to-date. This is much higher than the YTD return for the S&P 500 of 29.19%.

Conclusion

The weakness of our model is that the average error of prediction is still very high so if one wants to predict the exact return, our model won’t be ideal. But the strength of our model is that for stocks with extremely high returns, even though the prediction might not be that precise, highly likely, our model will predict positive returns. That means, in real life, if we choose the top stocks to invest in based on our prediction, it is less likely we are going to lose money. Another strength of using a model to help with investment is that it excludes our subjective feelings.

To make the model better, we could do more research and add more regressors. For example, some financial indicators that are important for value investing are not reflected in our model due to the lack of data. Examples of those variables include Price to Sales, Price to Cash Flow, and Price to Book. Also, since we are concerned about the long-term return and all fundamental factors usually take longer to affect the firms, it’s probably helpful to do return in 2 years or 3 years or include lag of some variables in our model.


  1. \(RSS_{OLS} = \sum_{i=1}^{n}(y_{i}-\hat{y_{i}})\)↩︎

  2. Penalized RSS = \(\sum_{i=1}^{n}(y_{i}-\hat{y_{i}}) + \lambda \sum_{j=1}^{p}|\hat{\beta_{j}}|\), where \(\lambda\) is the penalty paramter and \(\sum_{j=1}^{p}|\hat{\beta_{j}}|\) is the sum of non-intercept coefficients↩︎