Live Rankings of US Stocks by Virus Impact on Stock Returns

The idea

Even it is claimed that we have entered into the post-COVID era and the world has back to normal, with the emergency of new variants, the virus threat is still one problem that we will be confronting in the next decade. Hence, when it comes to the equity market volatility, the virus impact will be a factor or signal that investors should take into consideration.

In this case, a coding program is designed to find out the virus impact on the equity market, with a focus on the correlation between the equity daily returns and the virus data. Besides, machine learning models were ultilized to investigate if the virus data can explain some movements in the daily returns of some equities and how much it can explain if there does exist an impact. Afterwards, the coding program filters out the top 5 stocks real-time with significant correlations with virus data and rather high values of R-square in multiple linear regressions.

Colab Coding Link

The process

Data Source

To ensure the data source is updated in a timely basis, yfinance library from Yahoo! Finance is used as the data source for the stock prices. Additionally, approximately 10142 US stock symbols from SP500, NASDAQ, Dow Jones and others are selected into the stock list. The stock symbols are from yahoo_fin library.

In terms of virus data, real-time new cases, cumulative cases, new deaths, cumulative deaths data are extracted from WHO Coronavirus (COVID-19) Dashboard.

Steps

Besides the data import and cleaning, the program includes the following parts to investigate the virus impact on equity market.

Statistics Analysis with Correlation

The correlations between the stock daily return and virus data is investigated for each stock selected. The virus data includes the four orginal variables and another four variables generated by bringing the data forward to five days later. This is based on the assumption that there would be a short-period delay when the virus data has an impact on the equity market. And the eight variables include:

New_cases
Cumulative_cases
New_deaths
Cumulative_deaths
New_cases_shift
New_deaths_shift
Cumulative_cases_shift
Cumulative_deaths_shift

OLS Fit

Since the correlation only presents the relationship between the virus data movement and the equity return movement, the next topic to figure out will be how much the equity return movement can be explained by the virus data. Hence, I used the multiple linear regression here. With the ultilization of R square in OLS statistics, the determination degree can be more intuitive.

Additionally, to find out the determination degree for each equity, a loop is created to rotate each equity into the equation above.

Summarize the top 5 stocks highly correlated with virus data

After obtaining the correlation between stock return and virus data and the R square in OLS for each US stock, by summarzing the extreme values of correlations and sorting the values of R square, the frequencies of appearances that one stock has an extreme value of correlation and that the R square for one stock ranks in top 10 are counted.

With selecting the 5 stocks that have the highest frequency, a table containing top 5 significant stocks is created.

Afterwards, the top 5 stock table is merged with the correlation table and the R square table. Therefore, the users are able to figure out the top 5 stocks that the virus data has significant impact on and relevant statistics to know about whether the impact is negative or positive and how huge the impact will be from R square.

Moreover, a line graph is also draw as a visualization regarding the virus impact on the top 5 stocks.

Coding

A. Data Import and Cleaning Source:

WHO Coronavirus (COVID-19) Dashboard: https://covid19.who.int/data

US stock symbols from SP500, NASDAQ, Dow Jones and others

Historical prices from Yahoo! Finance


#a. Library installed and import

# Install yfinance
!pip install yfinance

# install wget 
!pip install wget
import wget

!pip install html-table-parser-python3

!pip install yahoo_fin

# import pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta, date

from pandas_datareader import data as pdr
import statsmodels.formula.api as sm
from yahoo_fin import stock_info as si
import yfinance as yf

from statsmodels.tsa.statespace.sarimax import SARIMAX
import plotly.express as px

# Library for opening url and creating
# requests
import urllib.request
 
# pretty-print python data structures
import pprint
 
# for parsing all the tables present
# on the website
from html_table_parser.parser import HTMLTableParser
 
yf.pdr_override() # override previous packages

#b. COVID-19 Data Import

# import COVID-19 data

CVD19_URL = "https://covid19.who.int/WHO-COVID-19-global-data.csv"

CVD19 = wget.download(CVD19_URL, "CVD19.csv")

CVD19_df = pd.read_csv("/content/CVD19.csv")


CVD19_df.head(10)

# Group by date and sum up the data
CVD19_group = CVD19_df.groupby(['Date_reported'], as_index=False).sum()

# Rename the date column
CVD19_group = CVD19_group.rename(columns={'Date_reported': 'Date'})

# Change the format of the date column from object to the datetime
CVD19_group['Date'] =  pd.to_datetime(CVD19_group['Date'], format='%Y-%m-%d')

# Bring the data forward for 5 day, since it's possible that there will be a delay when it comes to the virus impact on the stock market
shiftnum = 5
CVD19_group['New_cases_shift'] = CVD19_group['New_cases'].shift(shiftnum)
CVD19_group['New_deaths_shift'] = CVD19_group['New_deaths'].shift(shiftnum)
CVD19_group['Cumulative_cases_shift'] = CVD19_group['Cumulative_cases'].shift(shiftnum)
CVD19_group['Cumulative_deaths_shift'] = CVD19_group['Cumulative_deaths'].shift(shiftnum)

# Remove the cumulative_cases column and create a new dataframe
CVD19_data = CVD19_group.fillna(0)
CVD19_data

#c. US Stock Price data import

# Import the US stock symbol list


# gather stock symbols from major US exchanges
df1 = pd.DataFrame( si.tickers_sp500() )
df2 = pd.DataFrame( si.tickers_nasdaq() )
df3 = pd.DataFrame( si.tickers_dow() )
df4 = pd.DataFrame( si.tickers_other() )

# convert DataFrame to list, then to sets
sym1 = set( symbol for symbol in df1[0].values.tolist() )
sym2 = set( symbol for symbol in df2[0].values.tolist() )
sym3 = set( symbol for symbol in df3[0].values.tolist() )
sym4 = set( symbol for symbol in df4[0].values.tolist() )

# join the 4 sets into one. Because it's a set, there will be no duplicate symbols
symbols = set.union( sym1, sym2, sym3, sym4 )

# Some stocks are 5 characters. A fifth character is often used to differentiate under certain exchange requirements.
my_list = ['W', 'R', 'P', 'Q']
del_set = set()
sav_set = set()

for symbol in symbols:
    if len( symbol ) > 4 and symbol[-1] in my_list:
        del_set.add( symbol )
    else:
        sav_set.add( symbol )

# Convert the cleaned symbol set to a list format
list_stock = list(sav_set)
list_stock = list_stock[1:]
len(list_stock)

# import US equity data from Yahoo! Finance
equity_data = yf.download(' '.join(list_stock), start="2020-01-03", end=datetime.today()) #start with WHO virus data start date
equity_data

# Select the adjusted closing price only
equity_price = equity_data['Adj Close']

# Calculate the percentage change (simple return) for each column (equity)
equity_change = equity_price.pct_change()

# Clean the NaN with 0
equity_change = equity_change.loc[:,equity_change.notna().any(axis=0)] #drop column if it's all nan
equity_change = equity_change.fillna(0)

# Convert the index to date column
equity_change.reset_index(inplace=True)
equity_change

d. Merge virus data with stock price data

# Convert the date type to the same
CVD19_data['Date'] = CVD19_data['Date'].dt.date
equity_change['Date'] = equity_change['Date'].dt.date
equity_change

# Merge the equity price change table with the COVID-19 data table
df = pd.merge(CVD19_data, equity_change, how = 'left', on = ['Date'])

# Drop NaN since the reason why there are NaNs in the dataframe is that the stock market closed on certain days
df_clean = df.dropna()
df_clean

B. Statistics analysis


# Computing correlation between virus data and stock return with cor() method
daily_cor_matrix = df_clean.corr()
daily_cor = daily_cor_matrix.iloc[0:8,8:]
daily_cor = daily_cor.T
daily_cor = daily_cor.dropna()
daily_cor = daily_cor.sort_values(by=['New_cases','Cumulative_cases','New_deaths','Cumulative_deaths','New_cases_shift','New_deaths_shift','Cumulative_cases_shift','Cumulative_deaths_shift'], ascending=False)
daily_cor

#daily_cor = daily_cor.sort_values(by=['New_cases','New_cases_shift'], ascending=False)
daily_cor = daily_cor.sort_values(by=['New_deaths','New_deaths_shift'], ascending=False)
daily_cor.describe()

# Clear the NAN to 0 in the correlation table
daily_cor = daily_cor.fillna(0)

# Find the index and value for the extreme values in each column
cor_max = daily_cor.astype(float).idxmax()
cor_min = daily_cor.astype(float).idxmin()
max_value = daily_cor.astype(float).max()
min_value = daily_cor.astype(float).min()

# Convert to dataframe
cor_max = cor_max.to_frame()
cor_min = cor_min.to_frame()
max_value = max_value.to_frame()
min_value = min_value.to_frame()

# Convert index to column
cor_max['Virus_Data'] = cor_max.index
cor_min['Virus_Data'] = cor_min.index
max_value['Virus_Data'] = max_value.index
min_value['Virus_Data'] = min_value.index

# Rename the value column
cor_max = cor_max.rename(columns={0: 'Stock_Max'})
cor_min = cor_min.rename(columns={0: 'Stock_Min'})
max_value = max_value.rename(columns={0: 'Max_cor'})
min_value = min_value.rename(columns={0: 'Min_cor'})

#extre_cor = cor_max.merge(cor_min).merge(max_value).merge(min_value)
extre_cor = pd.merge(cor_max, cor_min, on=['Virus_Data'])
extre_cor = pd.merge(extre_cor, max_value, on=['Virus_Data'])
extre_cor = pd.merge(extre_cor, min_value, on=['Virus_Data'])

# Reorder
extre_cor = extre_cor.iloc[:, [1, 2, 4, 0, 3]]
extre_cor

C. OLS Fit


# Create the stock symbol list available in cleaned dataframe
df_stock = list(df_clean.columns.values[9:])
len(df_stock)

# Fit each stock price percentage change with the virus data
rsquare = []
for i in range(len(df_stock)):
  result = sm.ols(formula= "Q(df_stock[i])" + " ~ New_cases + Cumulative_cases + New_deaths + Cumulative_deaths + New_cases_shift + New_deaths_shift + Cumulative_cases_shift + Cumulative_deaths_shift", data = df_clean).fit()
  rsquare.append(result.rsquared)

#Create a dataframe taken stock symbol and r-square of the model fit
dic_ols = {'Stock': df_stock, 'R_Square': rsquare}
df_olsfit = pd.DataFrame(dic_ols)
# Rank the r-square
df_olsfit = df_olsfit.sort_values(by='R_Square', ascending=False)
df_olsfit = df_olsfit.dropna()
df_olsfit.head(10)

D. Summarize the top 5 stocks highly correlated with virus data * Select the stocks that have extreme values of correlation and stocks that have top 10 R square in the OLS models * Calculate the frequency of stocks and create a dataframe to include the relevant data, i.e. correlation values and R square


# Stocks appearing in correlation extreme value table and stocks with top 10 high R square
highcor_stock = extre_cor['Stock_Min'].to_list() + extre_cor['Stock_Max'].to_list()
highRsquare_stock = df_olsfit['Stock'].iloc[0:10].to_list()

# Combine to a list
sig_stock = highcor_stock + highRsquare_stock
sig_stock

# Calculate the frequency of each stock appearance
fre_stock = pd.Series(sig_stock).value_counts()

# Select the top 5 stocks with high frequency
topnum = 5
higsig_stock = fre_stock[0:topnum]
higsig_stock = higsig_stock.to_frame()
higsig_stock = higsig_stock.reset_index()
higsig_stock = higsig_stock.rename(columns={'index': 'Stock', 0:'Frequency'})
higsig_stock

# Drop the index
daily_cor_1 = daily_cor.reset_index()
daily_cor_1 = daily_cor_1.rename(columns={'index': 'Stock'})
daily_cor_1

# Merge the top 5 stocks with correlation table and OLS R square table 
higsig_stock_info = pd.merge(higsig_stock, df_olsfit, how = 'left', on = ['Stock'])
higsig_stock_info = pd.merge(higsig_stock_info, daily_cor_1, how = 'left', on = ['Stock'])
higsig_stock_info

# Visualization

# Remove the Frequency column to shorten the interval of y-axis value
higsig_stock_info.drop('Frequency', axis=1, inplace=True)

#import pandas as pd

pd.options.plotting.backend = "plotly"
higsig_stock_info.plot(x='Stock', y= higsig_stock_info.columns.to_list())
info_melt = higsig_stock_info.melt(id_vars='Stock', value_vars=higsig_stock_info.columns.to_list())
px.line(info_melt, x='Stock' , y='value' , color='variable')

Conclusion

Output Illustration

Since the stock price data and the virus data is updated in a timely basis, the output based on data input till 13 Jan, 2023, is selected here as an illustration.

Top 5 stocks that the virus data has the significant impact on and relevant statistics

From the filtered result, it can be found out that except pharmaceutical stocks like MNK (Mallinckrodt Ord Shs) that are affected by the virus data as expected, rising rates ETF and income ETF like CGMU (Capital Group Municipal Income ETF), highly related to macro economics, are also influenced and correlated significantly by the virus. Addistionally, what is worthwhile to be brought up is that, metal industry like steel stocks, SPLP (Steel Partners Holdings L.P.) in this case, is demonstrated to be related with the virus data, as well.

Stock	Frequency	R_Square	New_cases	Cumulative_cases	New_deaths	Cumulative_deaths
CGMU	6	0.05732247	-0.0366245	0.17086446	-0.1231886	0.12318316
MNK	5	0.07330773	0.02170666	-0.2057192	0.12518742	-0.1480804
FMNY	3	0.05359012	-0.1787646	-0.0570802	-0.0020982	-0.0737248
RISR	2	0.02755915	0.14530292	0.0167018	0.02088781	0.03529421
SPLP	2	0.03315919	0.02301829	-0.0129021	0.15777535	0.01464001

Stock	Frequency	New_cases_shift	New_deaths_shift	Cumulative_cases_shift	Cumulative_deaths_shift
CGMU	6	-0.0229384	-0.0984047	0.1721595	0.1239793
MNK	5	0.02105184	0.12182084	-0.2070832	-0.148975039
FMNY	3	-0.1959668	-0.0023733	-0.0548701	-0.073474835
RISR	2	0.14023064	0.02134866	0.01506644	0.035110124
SPLP	2	0.0188916	0.14614681	-0.0132161	0.013443623

Line Graph

The line graph here is to provide an intuitive view for the readers, since stocks filtered out may have different response towards the virus changes. For instance, as from the graph below, MNK has negative correlations towards cumulative virus data that has been shifted, while it is positive correlated to new virus data and current cumulative virus data. And when it comes to other stocks, from the graph, it can also easily to find out that stocks such as RISR, are completely positively correlated to the virus data, meaning that the higher the number of the cases, the higher the daily return.

In short, the graph gives a clear view by giving a cut-off to split the positive correlations and negative correlations.

To improve

In next steps, a direction can be worked on is the estimation of the daily returns of the top 5 stocks. Since the top 5 stocks that virus data has a significant impact on are selected, the future daily returns of the 5 stocks may be estimated using Machine Learning models. By doing so, the investors would be able to know not only the top stocks they shall be pay more attention due to the virus threat, but also may set up a solid expection of the future changes on the returns with statistic proofs.

Moreover, it takes time to load the stock history data from Yahoo! Finance library. It can be more efficient if there is one way to speed up the data import. The reason why this program still stick to using Yahoo! Finance library is that the real-time stock data is ensured and the library offers a wide range of stock options. If there’re any new Python library or updated features that would help improve the efficiency, this will be a possible improvement in the future.