The idea
Even it is claimed that we have entered into the post-COVID era and the world has back to normal, with the emergency of new variants, the virus threat is still one problem that we will be confronting in the next decade. Hence, when it comes to the equity market volatility, the virus impact will be a factor or signal that investors should take into consideration.
In this case, a coding program is designed to find out the virus impact on the equity market, with a focus on the correlation between the equity daily returns and the virus data. Besides, machine learning models were ultilized to investigate if the virus data can explain some movements in the daily returns of some equities and how much it can explain if there does exist an impact. Afterwards, the coding program filters out the top 5 stocks real-time with significant correlations with virus data and rather high values of R-square in multiple linear regressions.
The process
- Data Source
To ensure the data source is updated in a timely basis, yfinance library from Yahoo! Finance is used as the data source for the stock prices. Additionally, approximately 10142 US stock symbols from SP500, NASDAQ, Dow Jones and others are selected into the stock list. The stock symbols are from yahoo_fin library.
In terms of virus data, real-time new cases, cumulative cases, new deaths, cumulative deaths data are extracted from WHO Coronavirus (COVID-19) Dashboard.
- Steps
- Statistics Analysis with Correlation
- New_cases
- Cumulative_cases
- New_deaths
- Cumulative_deaths
- New_cases_shift
- New_deaths_shift
- Cumulative_cases_shift
- Cumulative_deaths_shift
- OLS Fit
- Summarize the top 5 stocks highly correlated with virus data
Besides the data import and cleaning, the program includes the following parts to investigate the virus impact on equity market.
The correlations between the stock daily return and virus data is investigated for each stock selected. The virus data includes the four orginal variables and another four variables generated by bringing the data forward to five days later. This is based on the assumption that there would be a short-period delay when the virus data has an impact on the equity market. And the eight variables include:
Since the correlation only presents the relationship between the virus data movement and the equity return movement, the next topic to figure out will be how much the equity return movement can be explained by the virus data. Hence, I used the multiple linear regression here. With the ultilization of R square in OLS statistics, the determination degree can be more intuitive.
Additionally, to find out the determination degree for each equity, a loop is created to rotate each equity into the equation above.
After obtaining the correlation between stock return and virus data and the R square in OLS for each US stock, by summarzing the extreme values of correlations and sorting the values of R square, the frequencies of appearances that one stock has an extreme value of correlation and that the R square for one stock ranks in top 10 are counted.
With selecting the 5 stocks that have the highest frequency, a table containing top 5 significant stocks is created.
Afterwards, the top 5 stock table is merged with the correlation table and the R square table. Therefore, the users are able to figure out the top 5 stocks that the virus data has significant impact on and relevant statistics to know about whether the impact is negative or positive and how huge the impact will be from R square.
Moreover, a line graph is also draw as a visualization regarding the virus impact on the top 5 stocks.
- Coding
- WHO Coronavirus (COVID-19) Dashboard: https://covid19.who.int/data
- US stock symbols from SP500, NASDAQ, Dow Jones and others
- Historical prices from Yahoo! Finance
A. Data Import and Cleaning
Source:
#a. Library installed and import # Install yfinance !pip install yfinance # install wget !pip install wget import wget !pip install html-table-parser-python3 !pip install yahoo_fin # import pandas import pandas as pd import numpy as np import matplotlib.pyplot as plt from datetime import datetime, timedelta, date from pandas_datareader import data as pdr import statsmodels.formula.api as sm from yahoo_fin import stock_info as si import yfinance as yf from statsmodels.tsa.statespace.sarimax import SARIMAX import plotly.express as px # Library for opening url and creating # requests import urllib.request # pretty-print python data structures import pprint # for parsing all the tables present # on the website from html_table_parser.parser import HTMLTableParser yf.pdr_override() # override previous packages #b. COVID-19 Data Import # import COVID-19 data CVD19_URL = "https://covid19.who.int/WHO-COVID-19-global-data.csv" CVD19 = wget.download(CVD19_URL, "CVD19.csv") CVD19_df = pd.read_csv("/content/CVD19.csv") CVD19_df.head(10) # Group by date and sum up the data CVD19_group = CVD19_df.groupby(['Date_reported'], as_index=False).sum() # Rename the date column CVD19_group = CVD19_group.rename(columns={'Date_reported': 'Date'}) # Change the format of the date column from object to the datetime CVD19_group['Date'] = pd.to_datetime(CVD19_group['Date'], format='%Y-%m-%d') # Bring the data forward for 5 day, since it's possible that there will be a delay when it comes to the virus impact on the stock market shiftnum = 5 CVD19_group['New_cases_shift'] = CVD19_group['New_cases'].shift(shiftnum) CVD19_group['New_deaths_shift'] = CVD19_group['New_deaths'].shift(shiftnum) CVD19_group['Cumulative_cases_shift'] = CVD19_group['Cumulative_cases'].shift(shiftnum) CVD19_group['Cumulative_deaths_shift'] = CVD19_group['Cumulative_deaths'].shift(shiftnum) # Remove the cumulative_cases column and create a new dataframe CVD19_data = CVD19_group.fillna(0) CVD19_data #c. US Stock Price data import # Import the US stock symbol list # gather stock symbols from major US exchanges df1 = pd.DataFrame( si.tickers_sp500() ) df2 = pd.DataFrame( si.tickers_nasdaq() ) df3 = pd.DataFrame( si.tickers_dow() ) df4 = pd.DataFrame( si.tickers_other() ) # convert DataFrame to list, then to sets sym1 = set( symbol for symbol in df1[0].values.tolist() ) sym2 = set( symbol for symbol in df2[0].values.tolist() ) sym3 = set( symbol for symbol in df3[0].values.tolist() ) sym4 = set( symbol for symbol in df4[0].values.tolist() ) # join the 4 sets into one. Because it's a set, there will be no duplicate symbols symbols = set.union( sym1, sym2, sym3, sym4 ) # Some stocks are 5 characters. A fifth character is often used to differentiate under certain exchange requirements. my_list = ['W', 'R', 'P', 'Q'] del_set = set() sav_set = set() for symbol in symbols: if len( symbol ) > 4 and symbol[-1] in my_list: del_set.add( symbol ) else: sav_set.add( symbol ) # Convert the cleaned symbol set to a list format list_stock = list(sav_set) list_stock = list_stock[1:] len(list_stock) # import US equity data from Yahoo! Finance equity_data = yf.download(' '.join(list_stock), start="2020-01-03", end=datetime.today()) #start with WHO virus data start date equity_data # Select the adjusted closing price only equity_price = equity_data['Adj Close'] # Calculate the percentage change (simple return) for each column (equity) equity_change = equity_price.pct_change() # Clean the NaN with 0 equity_change = equity_change.loc[:,equity_change.notna().any(axis=0)] #drop column if it's all nan equity_change = equity_change.fillna(0) # Convert the index to date column equity_change.reset_index(inplace=True) equity_change d. Merge virus data with stock price data # Convert the date type to the same CVD19_data['Date'] = CVD19_data['Date'].dt.date equity_change['Date'] = equity_change['Date'].dt.date equity_change # Merge the equity price change table with the COVID-19 data table df = pd.merge(CVD19_data, equity_change, how = 'left', on = ['Date']) # Drop NaN since the reason why there are NaNs in the dataframe is that the stock market closed on certain days df_clean = df.dropna() df_clean
B. Statistics analysis
# Computing correlation between virus data and stock return with cor() method daily_cor_matrix = df_clean.corr() daily_cor = daily_cor_matrix.iloc[0:8,8:] daily_cor = daily_cor.T daily_cor = daily_cor.dropna() daily_cor = daily_cor.sort_values(by=['New_cases','Cumulative_cases','New_deaths','Cumulative_deaths','New_cases_shift','New_deaths_shift','Cumulative_cases_shift','Cumulative_deaths_shift'], ascending=False) daily_cor #daily_cor = daily_cor.sort_values(by=['New_cases','New_cases_shift'], ascending=False) daily_cor = daily_cor.sort_values(by=['New_deaths','New_deaths_shift'], ascending=False) daily_cor.describe() # Clear the NAN to 0 in the correlation table daily_cor = daily_cor.fillna(0) # Find the index and value for the extreme values in each column cor_max = daily_cor.astype(float).idxmax() cor_min = daily_cor.astype(float).idxmin() max_value = daily_cor.astype(float).max() min_value = daily_cor.astype(float).min() # Convert to dataframe cor_max = cor_max.to_frame() cor_min = cor_min.to_frame() max_value = max_value.to_frame() min_value = min_value.to_frame() # Convert index to column cor_max['Virus_Data'] = cor_max.index cor_min['Virus_Data'] = cor_min.index max_value['Virus_Data'] = max_value.index min_value['Virus_Data'] = min_value.index # Rename the value column cor_max = cor_max.rename(columns={0: 'Stock_Max'}) cor_min = cor_min.rename(columns={0: 'Stock_Min'}) max_value = max_value.rename(columns={0: 'Max_cor'}) min_value = min_value.rename(columns={0: 'Min_cor'}) #extre_cor = cor_max.merge(cor_min).merge(max_value).merge(min_value) extre_cor = pd.merge(cor_max, cor_min, on=['Virus_Data']) extre_cor = pd.merge(extre_cor, max_value, on=['Virus_Data']) extre_cor = pd.merge(extre_cor, min_value, on=['Virus_Data']) # Reorder extre_cor = extre_cor.iloc[:, [1, 2, 4, 0, 3]] extre_cor
C. OLS Fit
# Create the stock symbol list available in cleaned dataframe df_stock = list(df_clean.columns.values[9:]) len(df_stock) # Fit each stock price percentage change with the virus data rsquare = [] for i in range(len(df_stock)): result = sm.ols(formula= "Q(df_stock[i])" + " ~ New_cases + Cumulative_cases + New_deaths + Cumulative_deaths + New_cases_shift + New_deaths_shift + Cumulative_cases_shift + Cumulative_deaths_shift", data = df_clean).fit() rsquare.append(result.rsquared) #Create a dataframe taken stock symbol and r-square of the model fit dic_ols = {'Stock': df_stock, 'R_Square': rsquare} df_olsfit = pd.DataFrame(dic_ols) # Rank the r-square df_olsfit = df_olsfit.sort_values(by='R_Square', ascending=False) df_olsfit = df_olsfit.dropna() df_olsfit.head(10)
D. Summarize the top 5 stocks highly correlated with virus data
* Select the stocks that have extreme values of correlation and stocks that have top 10 R square in the OLS models
* Calculate the frequency of stocks and create a dataframe to include the relevant data, i.e. correlation values and R square
# Stocks appearing in correlation extreme value table and stocks with top 10 high R square highcor_stock = extre_cor['Stock_Min'].to_list() + extre_cor['Stock_Max'].to_list() highRsquare_stock = df_olsfit['Stock'].iloc[0:10].to_list() # Combine to a list sig_stock = highcor_stock + highRsquare_stock sig_stock # Calculate the frequency of each stock appearance fre_stock = pd.Series(sig_stock).value_counts() # Select the top 5 stocks with high frequency topnum = 5 higsig_stock = fre_stock[0:topnum] higsig_stock = higsig_stock.to_frame() higsig_stock = higsig_stock.reset_index() higsig_stock = higsig_stock.rename(columns={'index': 'Stock', 0:'Frequency'}) higsig_stock # Drop the index daily_cor_1 = daily_cor.reset_index() daily_cor_1 = daily_cor_1.rename(columns={'index': 'Stock'}) daily_cor_1 # Merge the top 5 stocks with correlation table and OLS R square table higsig_stock_info = pd.merge(higsig_stock, df_olsfit, how = 'left', on = ['Stock']) higsig_stock_info = pd.merge(higsig_stock_info, daily_cor_1, how = 'left', on = ['Stock']) higsig_stock_info # Visualization # Remove the Frequency column to shorten the interval of y-axis value higsig_stock_info.drop('Frequency', axis=1, inplace=True) #import pandas as pd pd.options.plotting.backend = "plotly" higsig_stock_info.plot(x='Stock', y= higsig_stock_info.columns.to_list()) info_melt = higsig_stock_info.melt(id_vars='Stock', value_vars=higsig_stock_info.columns.to_list()) px.line(info_melt, x='Stock' , y='value' , color='variable')
Conclusion
- Output Illustration
- Top 5 stocks that the virus data has the significant impact on and relevant statistics
- Line Graph
Since the stock price data and the virus data is updated in a timely basis, the output based on data input till 13 Jan, 2023, is selected here as an illustration.
From the filtered result, it can be found out that except pharmaceutical stocks like MNK (Mallinckrodt Ord Shs) that are affected by the virus data as expected, rising rates ETF and income ETF like CGMU (Capital Group Municipal Income ETF), highly related to macro economics, are also influenced and correlated significantly by the virus. Addistionally, what is worthwhile to be brought up is that, metal industry like steel stocks, SPLP (Steel Partners Holdings L.P.) in this case, is demonstrated to be related with the virus data, as well.
Stock | Frequency | R_Square | New_cases | Cumulative_cases | New_deaths | Cumulative_deaths |
CGMU | 6 | 0.05732247 | -0.0366245 | 0.17086446 | -0.1231886 | 0.12318316 |
MNK | 5 | 0.07330773 | 0.02170666 | -0.2057192 | 0.12518742 | -0.1480804 |
FMNY | 3 | 0.05359012 | -0.1787646 | -0.0570802 | -0.0020982 | -0.0737248 |
RISR | 2 | 0.02755915 | 0.14530292 | 0.0167018 | 0.02088781 | 0.03529421 |
SPLP | 2 | 0.03315919 | 0.02301829 | -0.0129021 | 0.15777535 | 0.01464001 |
Stock | Frequency | New_cases_shift | New_deaths_shift | Cumulative_cases_shift | Cumulative_deaths_shift |
CGMU | 6 | -0.0229384 | -0.0984047 | 0.1721595 | 0.1239793 |
MNK | 5 | 0.02105184 | 0.12182084 | -0.2070832 | -0.148975039 |
FMNY | 3 | -0.1959668 | -0.0023733 | -0.0548701 | -0.073474835 |
RISR | 2 | 0.14023064 | 0.02134866 | 0.01506644 | 0.035110124 |
SPLP | 2 | 0.0188916 | 0.14614681 | -0.0132161 | 0.013443623 |
The line graph here is to provide an intuitive view for the readers, since stocks filtered out may have different response towards the virus changes. For instance, as from the graph below, MNK has negative correlations towards cumulative virus data that has been shifted, while it is positive correlated to new virus data and current cumulative virus data. And when it comes to other stocks, from the graph, it can also easily to find out that stocks such as RISR, are completely positively correlated to the virus data, meaning that the higher the number of the cases, the higher the daily return.
In short, the graph gives a clear view by giving a cut-off to split the positive correlations and negative correlations.
- To improve
In next steps, a direction can be worked on is the estimation of the daily returns of the top 5 stocks. Since the top 5 stocks that virus data has a significant impact on are selected, the future daily returns of the 5 stocks may be estimated using Machine Learning models. By doing so, the investors would be able to know not only the top stocks they shall be pay more attention due to the virus threat, but also may set up a solid expection of the future changes on the returns with statistic proofs.
Moreover, it takes time to load the stock history data from Yahoo! Finance library. It can be more efficient if there is one way to speed up the data import. The reason why this program still stick to using Yahoo! Finance library is that the real-time stock data is ensured and the library offers a wide range of stock options. If there’re any new Python library or updated features that would help improve the efficiency, this will be a possible improvement in the future.