The financial market is a strange place that is very hard to navigate around. We have seen Warren Buffett, Ray Dalio, Charlie Munger - the very very best who have had billions of dollars. Others, 95% of the population, lose the money instead.
So, how are the best of the best pick its stocks? It is from the fundamental, the technical side or the sentimental side? Within this paper, we hope to bring another perspective, using machine learning models to predict the profitability of the stock price.
First, we will list and explain the definition of each variable in our dataset. There are 25 variables in total as listed below. The variable we are going to predict is PROFIT. Then, we will start to process our data by merging with the macro data and removing some unwanted variables. After processing the data, we will have a data visualization section that shows the distribution and relationship between some variables. And to find the best model, we choose to explore three machine learning models which are LASSO, Random forest and Stacking models. We use a stacking method to create the stacking model by combining three models (Lasso, Random Forest, and KNN). Lastly, we will make a stock return prediction for 2021 using the trained model with the lowest R-squared and RMSE.
For the dataset, we includes financial information on companies in the S&P 500 stock index from 1999-2021. This information was scraped from Yahoo Finance in November of 2021, and collected in a csv format for data analysis. The information includes metrics like sales, earnings, cogs, stock price, and market sector as well as macroeconomic data such as GDP or Money Supply. The goal is to analyze and model this data to better improve projections for a company’s future profitability. The variables in the data set are described below:
| Variable | Meaning |
|---|---|
| YEAR | The financial year of the company |
| COMPANY | The company’s stock abbreviation symbol |
| MARKET.CAP | The total market capitalization of the company (Volume * Price) |
| EARNINGS | The earnings in dollars for the previous year for the given company |
| SALES | How much the company sold in dollars last year |
| CASH | How much cash the company has in dollars at the end of the previous year |
| Name | The full name of the company |
| Sector | The name of the sector that the company is a part of |
| PRICE | The price of the stock when it is bought |
| Sell | The price of the stock when it is sold |
| VOLUME | The total number of shares that the company has at that moment |
| COGS | The total amount the company paid as a cost directly related to the sale of products |
| INVESTMENT | The total asset or item acquired with the goal of generating income or appreciation |
| RECIEVABLE | The debts owed to a company by its customers for goods that have been delivered or used but not yet paid for |
| INVENTORY | How much raw materials used in production as well as the goods produced that are available for sale |
| DEBTS | How much money the company borrow from other parties |
| CPALTT01USM657N_PC1 | The percentage change in CPI (measure of inflation) |
| GDP | The monetary value of all finished goods and services made within a country |
| GDP_PC1 | The percentage change in GDP |
| T10Y2Y | Ten year treasury bonds minus two year treasury bonds |
| M1SL | The total currency and other liquid instruments in a country’s economy |
| M1SL_PC1 | The percentage change in money supply |
| PROFIT | How much the money made or lost on an investment |
We first explored distributions of our predictors and outcome. As the following graph suggested, all the numeric variables are severely right-skewed with some outliers. In our data, though the numbers of firms in each sector are unbalanced, we got a decent amount of data for each sector.
Then, we continued to explore the relationship between regressors and the outcome. By observation, we can’t see a strong correlation between fundemental factors, such as investments and debts, and stock return. It may indicate that linear model is not an ideal model in this case.
With this anamiation, we could see that the returns of stocks follow a cycle, which may be influenced by macroeconomic condition. The return in 1999 and 2019 seems to be the highest for all industries. Also, we noticed that some industries vary a lot year to year, such as IT and healthcare industry.
We further explored the relationship between the industry and the potential macroeconomic influencer using graphs below. Though the relationship doesn’t seem to be linear, we do see how macroeconomic condition affect each industry differently and will account for that in our model.
LASSO(Least Absolute Shrinkage & Selection Operator) is a algorithm that penalizes less informative predictors on the basis of Least Square Regression. For OLS, our goal is to minimize the residual sum of squares1 whearas LASSO models try to minimize the RSS penalized by the number of coefficients2. LASSO model will shrink the variables that are less important to zero.
| term | estimate | penalty |
|---|---|---|
| M1SL | 52.2489802 | 0.0059948 |
| CPALTT01USM657N_PC1 | 22.8362276 | 0.0059948 |
| GDP_PC1 | 18.5139718 | 0.0059948 |
| (Intercept) | 17.4206737 | 0.0059948 |
| GDP_PC1_x_Sector_Utilities | 14.5169194 | 0.0059948 |
GDP_PC1_x_Sector_Consumer Staples
|
10.9623359 | 0.0059948 |
GDP_PC1_x_Sector_Health Care
|
10.0994574 | 0.0059948 |
| Sector_Information Technology | 9.1062940 | 0.0059948 |
| T10Y2Y | 8.3785234 | 0.0059948 |
GDP_PC1_x_Sector_Real Estate
|
7.9131832 | 0.0059948 |
| Sector_Communication Services | 7.4629557 | 0.0059948 |
| Sector_Health Care | 5.9942110 | 0.0059948 |
| Sector_Consumer Discretionary | 4.8412271 | 0.0059948 |
GDP_PC1_x_Sector_Information Technology
|
4.0219608 | 0.0059948 |
| Sector_Energy | 2.7587880 | 0.0059948 |
| COGS | 2.2012752 | 0.0059948 |
| GDP_PC1_x_Sector_Industrials | 2.0052201 | 0.0059948 |
| VOLUME | 0.9051955 | 0.0059948 |
| INVESTMENTS | 0.4845378 | 0.0059948 |
| CASH | 0.4031170 | 0.0059948 |
| EARNINGS | 0.3671604 | 0.0059948 |
| RECEIVABLE | 0.1688682 | 0.0059948 |
| INVENTORY | 0.1677270 | 0.0059948 |
| GDP_PC1_x_Sector_Financials | 0.0413482 | 0.0059948 |
| Sector_Materials | 0.0000000 | 0.0059948 |
| PE | -0.2304146 | 0.0059948 |
| DEBTS | -0.2444685 | 0.0059948 |
| Sector_Industrials | -0.4250661 | 0.0059948 |
| GDP_PC1_x_Sector_Materials | -1.0239217 | 0.0059948 |
GDP_PC1_x_Sector_Communication Services
|
-1.4192763 | 0.0059948 |
| Sector_Financials | -1.5534082 | 0.0059948 |
| SALES | -3.2886701 | 0.0059948 |
| Sector_Real Estate | -3.3504621 | 0.0059948 |
| MARKET CAP | -3.8678288 | 0.0059948 |
| Sector_Consumer Staples | -5.4154183 | 0.0059948 |
| GDP_PC1_x_Sector_Energy | -5.8821361 | 0.0059948 |
GDP_PC1_x_Sector_Consumer Discretionary
|
-8.6297519 | 0.0059948 |
| Sector_Utilities | -8.7405789 | 0.0059948 |
| M1SL_PC1 | -11.4973966 | 0.0059948 |
| GDP | -16.0930330 | 0.0059948 |
The graph below shows the most essential predictors for our model. We can see that macroeconomics indicators all ranked the top.
Here is the performance of the model on the testing data, the data we did not use to train our model. If the points and the blue line are perfectly aligned with the red line, it means a perfect prediction. We can see that our LASSO model does not perform well and tends to underestimate the stock with high returns, which are mostly from 2020 or earlier years.
A random forest is a supervised machine learning algorithm that is constructed from decision tree algorithms. It can be used to solve both regression and classification problems. It utilizes ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems. This algorithm consists of many decision trees and establishes the outcome based on the predictions of those trees. It takes the average of the output from various trees and make prediction.
We are going to train the random forest model by using ranger function with 200 trees and 6 variables randomly chosen at each split to predict our stock return.
Below are the OOB prediction error (MSE), OOB root mean square error (RMSe), and R squared score from the random forest model we have trained above.
OOB error (MSE) represents the difference between the original and predicted values which are extracted by squaring the average difference over predictions from the trees that do not contain in their respective bootstrap sample.
## [1] 1545.602
OOB RMSE is the root mean square of the OOB prediction error above. The RMSE value is fairly high but seems to be significantly better than the LASSO model. We can see that the Random Forest model performs better than the Lasso model.
## [1] 39.31415
R-Squared score is a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variables in a regression model. For our model, the R-Squared score is around 0.4 meaning that 40% of our stock return is explained by our independent variables.
## [1] 0.4084518
We use K-fold cross validation technique to evaluate our model. Below are the root mean squared error (rmse) averaged over all the five folds on the training dataset.
This is the graph showing the actual return vs. predicted return.
The table below shows the rmse on testing data. It is a bit higher but fairly similar to the error on training data so we could say that our model did not overfit.
Below is the graph showing the actual return vs. predicted return on testing data.
Based on the box-plot and the histogram below, the residuals mostly lie between -50 to 50, but there are also some outliners that can go up to 400.
According to the feature importance bar chart below, we can see that the top three important features are market cap, earnings, M1SL_PC1, and GDP_PC1.
What is stacking? Model stacking is an ensembling method that takes the outputs of many models and combines them to generate a new model—referred to as an ensemble in this package—that generates predictions informed by each of its members.
Stacking model involves three separate steps:
Specify a list of base learners (with a specific set of model parameters).
Specify a meta learning algorithm.
To generate ensemble predictions, first generate predictions from the base learners.
Feed those predictions into the meta learner to generate the ensemble prediction.
## # A tibble: 2 × 3
## member type weight
## <chr> <chr> <dbl>
## 1 ranger_cv_1_1 rand_forest 0.836
## 2 knn_tune_1_4 nearest_neighbor 0.253
Here, we can see that the RMSE is 39.8, which shows that a predicted profit could have the mean difference with the actual profit of 39.7.
Here, we can see that when the model predicts company with high rate of return, the model performs really well. Based on the table, every company within the highest prediction yields positive and extremely high rate of return.
After finishing our model, we then move on to compare the three models:
With the three models, we then move on to see which model performs the best:
Here, we can see that compared to the three models, even though Random Forest performs better than the stacking model, stacking model use features in the Random Forest along with additional features from KNN and lasso. With that reason, we will choose stacking model as our model choice.
After picking our model, we then move on to use the model to predict the potential profit for 2021:
| Name | .pred | actual_ytd |
|---|---|---|
| APA Corporation | 66.15775 | 80.30 |
| Under Armour (Class C) | 62.22877 | 39.77 |
| Norwegian Cruise Line Holdings | 62.12468 | -8.26 |
| DuPont | 62.07898 | 11.29 |
| News Corp (Class B) | 62.00012 | 22.08 |
| Ford | 61.51675 | 132.51 |
| Marathon Oil | 60.92993 | 137.04 |
| Carnival Corporation | 60.67833 | -4.22 |
| Under Armour (Class A) | 60.57563 | 39.77 |
| DXC Technology | 60.07332 | 20.76 |
| Raytheon Technologies | 58.76800 | 26.41 |
| Halliburton | 58.24086 | 26.34 |
| Schlumberger | 56.55640 | 41.50 |
| Devon Energy | 56.13831 | 178.43 |
| Baker Hughes | 55.92235 | 18.70 |
| ConocoPhillips | 55.85569 | 87.29 |
| Bristol Myers Squibb | 55.73066 | -6.16 |
| Hess Corporation | 55.56343 | 52.58 |
| United Airlines | 54.59626 | 8.70 |
| The Walt Disney Company | 54.55654 | -13.70 |
As we can see from the stock return prediction, our final model performs really well. For the top 20 companies, only four generates loss year-to-date. Moreover, if we invest the same amount of money for the top 20 companies, we will have a 47.35% rate of return year-to-date. This is much higher than the YTD return for the S&P 500 of 29.19%.
The weakness of our model is that the average error of prediction is still very high so if one wants to predict the exact return, our model won’t be ideal. But the strength of our model is that for stocks with extremely high returns, even though the prediction might not be that precise, highly likely, our model will predict positive returns. That means, in real life, if we choose the top stocks to invest in based on our prediction, it is less likely we are going to lose money. Another strength of using a model to help with investment is that it excludes our subjective feelings.
To make the model better, we could do more research and add more regressors. For example, some financial indicators that are important for value investing are not reflected in our model due to the lack of data. Examples of those variables include Price to Sales, Price to Cash Flow, and Price to Book. Also, since we are concerned about the long-term return and all fundamental factors usually take longer to affect the firms, it’s probably helpful to do return in 2 years or 3 years or include lag of some variables in our model.