COMEX Gold Future Prediction

The idea

The idea of COMEX gold future prediction derives from a pitch of longing the COMEX gold future to hedge the inflation risk with the rate rise. It happens that gold futures are taken into the portfolios as one component due to its rather stable value and negative correlation with other financial instruments such as equity and fixed income.

Hence, the python coding program is designed to provide real-time updated predictions of COMEX gold future prices with the inflation data as indicators.

Additionally, after figuring out the predictive power of the independent variables (inflation data and gold ETF price), since the indicator data in the future may not be available, to improve the accuracy of the predictions, the time series predictions of the indicator data is used for the predictions.

Colab Coding Link

The process

Data Source

To provide real-time updates on the predictions, the data source for both independent variables and the dependent variable, i.e. inflation data and the future prices, is scrapped from websites to ensure the most recent data is input for the predictions.

The details of the data source are below:

ㅤ	Variable Name	Data Source
Independent variable	Consumer price index (CPI)	Economic Research - Federal Reserve Bank of St. Louis
Independent variable	Personal Consumption Expenditures (PCE)	Economic Research - Federal Reserve Bank of St. Louis
Independent variable	Real Gross Domestic Product (GDP)	Economic Research - Federal Reserve Bank of St. Louis
Independent variable	10-Year Real Interest Rate	Economic Research - Federal Reserve Bank of St. Louis
Dependent variable	Gold ETF price (SPDR Gold Shares)	Yahoo! Finance
Dependent variable	COMEX gold future price	Yahoo! Finance

Regarding the historical data available, the indicator data was shifted back for one day for the prediction purpose. This means that for instance, to predict today’s gold future price, the CPI used will be yesterday’s CPI instead of today’s CPI, since it’s highly possible that today’s CPI will not be availble at the moment when predicting today’s CPI.

Steps

Besides the data input and cleaning included, the program follows the steps below to find out the predictive power of independent variables used and improve the accuracy of models.

COMEX Gold Future Prediction through Time Series

COMEX Gold Future Prediction with OLS Model using Inflation Data

COMEX Gold Future Prediction with OLS Model using Inflation Data and Gold ETF Price

Coding

A. Loading Data from the Yahoo! Finance and perform the time series prediction


# Install yfinance
!pip install yfinance

# import pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta, date

from pandas_datareader import data as pdr
import yfinance as yf

from statsmodels.tsa.statespace.sarimax import SARIMAX
 
yf.pdr_override() # override previous packages

# Input Gold Future Data from Yahoo! Finance and plot

# Select the 20-year period till the current date
year = 20
start = datetime.today() - timedelta(days=365*year)
end  = datetime.today()
 
# The first argument is the series we want, second and third are the starting & end dates
comexgld = pdr.get_data_yahoo('GC=F', start, end) 

 
# Select the adjusted closing price data 
gld = pd.DataFrame({'gld': comexgld['Adj Close']})
            
# Print the DataFrame
print(gld)

# Plot the respective graphs
gld.plot(title='Plot of COMEX Gold Future Price', figsize=(14,10), grid=True);

# Split the 10 years by taking the recent 3-year data as the testing data, and the rest as the training data
splityear = 5
splitdate = datetime.today() - timedelta(days=365*splityear)
print(splitdate)


# Create the testing dataset
testnum = 150

predictperiod = datetime.today() - timedelta(days=testnum)
print(predictperiod)

# Split the testing data and training data and visualize

train = gld[gld.index < pd.to_datetime(predictperiod - timedelta(days=1), format='%Y-%m-%d')]
test = gld[gld.index > pd.to_datetime(predictperiod - timedelta(days=1), format='%Y-%m-%d')]


plt.plot(train, color = "black")
plt.plot(test, color = "red")
plt.ylabel('COMEX Gold Price')
plt.xlabel('Date')
plt.xticks(rotation=45)
plt.title("Training and Testing split for COMEX Gold Future Data")
plt.show()

# Holt-Winters Method

from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing, Holt 
import statsmodels.api as sm

test_copy = test.copy() 
fit_ts = ExponentialSmoothing(np.asarray(train['gld']) ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() 

test_copy['Prediction'] = fit_ts.forecast(len(test))

# Prediction Plot

plt.figure(figsize=(16,8))  
plt.plot(train['gld'], label='Train') 
plt.plot(test['gld'], label='Actual') 
plt.plot(test_copy['Prediction'], label='Prediction') 
plt.legend(loc='best') 
plt.show()

B. Prediction with Inflation Drivers in Linear Regression

From the prediction through time series, it can be found out that the predictions do not fit the actual prices much, though it does show a increasing trend of the pricing. The chances are that the fluctuation of the inflation, have effects on the gold future price. Hence, in the prediction through the linear regression below, variables below are selected to predict the gold future price using the linear regression:

* Consumer price index (CPI) * Personal Consumption Expenditures (PCE) * Real Gross Domestic Product * 10-Year Real Interest Rate Kindly note that all data is real-time updated, from the Economic Research - Federal Reserve Bank of St. Louis: https://research.stlouisfed.org/


# a. Import indicator data and data cleaning

# install wget 
!pip install wget
import wget

# import CPI data

CPI_URL = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=CPIAUCSL&scale=left&cosd=1947-01-01&coed=2022-11-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2022-12-23&revision_date=2022-12-23&nd=1947-01-01"

response = wget.download(CPI_URL, "CPI.csv")

CPI_df = pd.read_csv("/content/CPI.csv")


CPI_df.head(5)

# Convert monthly CPI data to daily: CPI on each date in the month maintains the same, at the value of the montly CPI.

#convert to datetimeindex
CPI_df = CPI_df.set_index('DATE')
CPI_df.index = pd.to_datetime(CPI_df.index, dayfirst=True)

#add new next month for correct resample
idx = CPI_df.index[-1] + pd.offsets.MonthBegin(1)

#add new next month for correct resample
CPI_df = CPI_df.append(CPI_df.iloc[[-1]].rename({CPI_df.index[-1]: idx}))

#resample with forward filling values, remove last helper row
CPI_df = CPI_df.resample('D').ffill().iloc[:-1]

print (CPI_df)

#create a new dataset with till to date data
latestcpi = CPI_df.CPIAUCSL[-1]
CPI_data = {'DATE': pd.date_range(idx,end,freq='d'), 'CPIAUCSL': latestcpi}
CPI_df1 = pd.DataFrame(data=CPI_data)
CPI_df1 = CPI_df1.set_index('DATE')


# Append the new table to the CPI historical data table 
CPI_df = CPI_df.append(CPI_df1)
CPI_df.tail(10)

# import PCE data
import wget

PCE_URL = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=748&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=PCE&scale=left&cosd=1959-01-01&coed=2022-10-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2022-12-23&revision_date=2022-12-23&nd=1959-01-01"

response = wget.download(PCE_URL, "PCE.csv")

PCE_df = pd.read_csv("/content/PCE.csv")

PCE_df.tail(5)

# Convert monthly PCE data to daily: PCE on each date in the month maintains the same, at the value of the montly PCE.

#convert to datetimeindex
PCE_df = PCE_df.set_index('DATE')
PCE_df.index = pd.to_datetime(PCE_df.index, dayfirst=True)

#add new next month for correct resample
idx = PCE_df.index[-1] + pd.offsets.MonthBegin(1)

#add new next month for correct resample
PCE_df = PCE_df.append(PCE_df.iloc[[-1]].rename({PCE_df.index[-1]: idx}))

#resample with forward filling values, remove last helper row
PCE_df = PCE_df.resample('D').ffill().iloc[:-1]

print (PCE_df)

#create a new dataset with till to date data
latestpce = PCE_df.PCE[-1]
PCE_data = {'DATE': pd.date_range(idx,end,freq='d'), 'PCE': latestpce}
PCE_df1 = pd.DataFrame(data=PCE_data)
PCE_df1 = PCE_df1.set_index('DATE')


# Append the new table to the PCE historical data table 
PCE_df = PCE_df.append(PCE_df1)
PCE_df.tail(10)

# import GDP data
GDP_URL = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=GDPC1&scale=left&cosd=1947-01-01&coed=2022-07-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Quarterly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2022-12-23&revision_date=2022-12-23&nd=1947-01-01"

response = wget.download(GDP_URL, "RealGDP.csv")

GDP_df = pd.read_csv("/content/RealGDP.csv")

GDP_df.tail(5)

# Convert quarterly GDP data to daily: GDP on each date in the month maintains the same, at the value of the quarterly GDP.

#convert to datetimeindex
GDP_df = GDP_df.set_index('DATE')
GDP_df.index = pd.to_datetime(GDP_df.index, dayfirst=True)

#add new next month for correct resample
idx = GDP_df.index[-1] + pd.offsets.MonthBegin(1)

#add new next month for correct resample
GDP_df = GDP_df.append(GDP_df.iloc[[-1]].rename({GDP_df.index[-1]: idx}))

#resample with forward filling values, remove last helper row
GDP_df = GDP_df.resample('D').ffill().iloc[:-1]

print (GDP_df.tail(40))

#create a new dataset with till to date data
latestGDP = GDP_df.GDPC1[-1]
GDP_data = {'DATE': pd.date_range(idx,end,freq='d'), 'GDPC1': latestGDP}
GDP_df1 = pd.DataFrame(data=GDP_data)
GDP_df1 = GDP_df1.set_index('DATE')


# Append the new table to the GDP historical data table 
GDP_df = GDP_df.append(GDP_df1)
GDP_df.tail(10)

# import 10-Year Real Interest Rate data
Rate_URL = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=REAINTRATREARAT10Y&scale=left&cosd=1982-01-01&coed=2022-12-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2022-12-23&revision_date=2022-12-23&nd=1982-01-01"

response = wget.download(Rate_URL, "Rate.csv")

Rate_df = pd.read_csv("/content/Rate.csv")

Rate_df.tail(5)

# Convert monthly Rate data to daily: Rate on each date in the month maintains the same, at the value of the montly rate.

#convert to datetimeindex
Rate_df = Rate_df.set_index('DATE')
Rate_df.index = pd.to_datetime(Rate_df.index, dayfirst=True)

#add new next month for correct resample
idx = Rate_df.index[-1] + pd.offsets.MonthBegin(1)

#add new next month for correct resample
Rate_df = Rate_df.append(Rate_df.iloc[[-1]].rename({Rate_df.index[-1]: idx}))

#resample with forward filling values, remove last helper row
Rate_df = Rate_df.resample('D').ffill().iloc[:-1]

print(Rate_df)

#create a new dataset with till to date data
latestRate = Rate_df.REAINTRATREARAT10Y[-1]
Rate_data = {'DATE': pd.date_range(idx,end-timedelta(days=1),freq='d'), 'REAINTRATREARAT10Y': latestRate}
Rate_df1 = pd.DataFrame(data=Rate_data)
Rate_df1 = Rate_df1.set_index('DATE')


# Append the new table to the Rate historical data table 
Rate_df = Rate_df.append(Rate_df1)
Rate_df.tail(10)

# b. Data Processing (Fill in NA, Merging and Shifting data)

# Fill in the gold future price till today with the latest price
if gld.index[-1] != end:
  latestgld = gld.gld[-1]
  gld_data = {'DATE': pd.date_range(gld.index[-1]+timedelta(days=1), end,freq='d'), 'gld': latestgld}
  gld_df1 = pd.DataFrame(data=gld_data)
  gld_df1 = gld_df1.set_index('DATE')
  # Append the new table to the Rate historical data table 
  gld = gld.append(gld_df1)

gld.tail(10)

# Rename the gold price index to prepare for the merge
gld.index.name = 'DATE'

# Merge each variable data into one dataframe

df = CPI_df

df = pd.merge(CPI_df, PCE_df, how='left', on = ['DATE'])
df = pd.merge(df, GDP_df, how='left', on = ['DATE'])
df = pd.merge(df, Rate_df, how='left', on = ['DATE'])
df = pd.merge(df, gld, how='left', on = ['DATE'])


# Rename the columns
df = df.rename(columns={'CPIAUCSL': 'CPI', 'GDPC1': 'GDP', 'REAINTRATREARAT10Y': 'Rate', 'gld': 'Gold'})

# Got the 20-year Dataset
df.tail(20)

# Since in the real life, when the prediction is being made, it's possible that the inflation indicator data on that day will not be available. Hence, by setting the shifting number as 1, we use yesterday's inflation data to predict today's gold future price here.

# Set Lag Values for indicators to prepare for the prediction

# Bring the data forward for 1 day, meaning that yesterday's data is used to predict today's price
shiftnum = 1
df['CPI_shift'] = df['CPI'].shift(shiftnum)
df['PCE_shift'] = df['PCE'].shift(shiftnum)
df['GDP_shift'] = df['GDP'].shift(shiftnum)
df['Rate_shift'] = df['Rate'].shift(shiftnum)


# Remove rows with duplicated index
df = df[~df.index.duplicated()]
df

# c. OLS modelling with historical data

import statsmodels.formula.api as sm

result = sm.ols(formula="Gold ~ CPI_shift + PCE_shift + GDP_shift + Rate_shift", data = df).fit()
print(result.summary())

# d. Prediction with OLS Model

# The period for predictions will be extended from 150 days to a longer period (210 days) to evaluate the performance of the model, and provide more valuable insights regaring a rather long-term gold future price movement. 

# In this case, during some prediction period, the indicator data will not be available since it's the data from the future. To improve the accuracy of the predictions, the time series data of those indicators is used to predict the gold future price.  For instance, if the current date is 27 December 2022, the CPI data afterwards will not be available. Thus the CPI data after 27 December 2022 for predictions is the CPI predictions derived from the time series analysis of the historical CPI data.

# Create the testing dataset (Total predicting days: 210 days, including 90-day day in the past to compare with the actual prices and evaluate the model performace)

predictdays = 210
actualdays = 90 # 120-day data to check the predictions with the actual prices and 30-day data left to obtain the predictions  
predictperiod = datetime.today() - timedelta(days=predictdays - actualdays)
print(predictperiod)

train_df = df[df.index < pd.to_datetime(predictperiod - timedelta(days=1), format='%Y-%m-%d')]
test_df = df[df.index > pd.to_datetime(predictperiod - timedelta(days=1), format='%Y-%m-%d')]

test_df_histo = test_df[['CPI_shift', 'PCE_shift', 'GDP_shift', 'Rate_shift']]

#test_df.index = pd.to_datetime(test_df.index, format = '%Y-%m-%d').strftime('%Y-%m-%d')

test_df_histo # testing data with 120-day historical data

# Create the dataset in 30 days with the dates
test_predict = {'DATE': pd.date_range(end, end+timedelta(days=actualdays-1),freq='d')}
test_predict

# Clear the NAN in the shifted data
CPI_clean = np.asarray(df['CPI_shift'])
CPI_clean = CPI_clean[~np.isnan(CPI_clean)]

PCE_clean = np.asarray(df['PCE_shift'])
PCE_clean = PCE_clean[~np.isnan(PCE_clean)]

GDP_clean = np.asarray(df['GDP_shift'])
GDP_clean = GDP_clean[~np.isnan(GDP_clean)]

Rate_clean = np.asarray(df['Rate_shift'])
Rate_clean = Rate_clean[~np.isnan(Rate_clean)]

# Fit using the times series (Exponential Smoothing)

fit_ts_CPI = ExponentialSmoothing(CPI_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() 
fit_ts_PCE = ExponentialSmoothing(PCE_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() 
fit_ts_GDP = ExponentialSmoothing(GDP_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() 
fit_ts_Rate = ExponentialSmoothing(Rate_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() 

# Make the predictions and add into the 30-day testing dataset

test_predict['CPI_shift'] = fit_ts_CPI.forecast(actualdays)
test_predict['PCE_shift'] = fit_ts_PCE.forecast(actualdays)
test_predict['GDP_shift'] = fit_ts_GDP.forecast(actualdays)
test_predict['Rate_shift'] = fit_ts_Rate.forecast(actualdays)
test_predict

# Append the 30-day data to the testing dataframe
test_df1 = pd.DataFrame(data=test_predict)

# Format the date column and set it as the index
test_df1 = test_df1.set_index('DATE')

test_df = test_df_histo.append(test_df1)
test_df

# Prediction

# Create the test date frame to predict 

prediction = result.predict(test_df)
pre_df = prediction
pre_df.index = test_df.index
pre_df

# Plot the actual prices and the prediction prices

fig, ax = plt.subplots()
ax.plot(gld.index[len(gld)-predictdays:], gld['gld'].values[len(gld)-predictdays:], label='Price')
ax.plot(pre_df.index, pre_df)

plt.show()

C. Prediction in Linear Regression with Inflation Indicators and Gold ETF Price

As from the statistical summary and the prediction from the model, what can be found out that when only using the inflation data as the indicators of the gold future price, the historic movement of the inflation data can fit around 91.8% of the future price data, since the adjusted R-square derived from the model is 0.918. This indicates that the inflation data only can already explain the gold future price much. However, from the line chart of the prediction and actual prices, there're also some points that can be improved in the model. * Prediction Delay: We can found out that there's some delay when it comes to the predictions. For instance, according to the predictions, the lowest point during the fall is estimated to happen between late November 2022 and December 2022, while the lowestest point of the actual price during the period is the price in early November 2022. * Conciseness: The more recent the prediction is, the closer the prediction is to the actual price. * Level: The line generated from the predictions is a line made up of multiple straight lines, which means that predictions for certain periods are the same value. This is because the inflation indicators we used here are either quarterly or monthly. Therefore, in next step, the gold ETF data (SPDR Gold Shares) will be added into the model to provide the real-time movements for the predictions. Additionally, for the commodity market, the consumer estimation influence the future prices much. Hence, the gold ETF data is also used as an variable which is related to the consumer confidence.


# a. Import Gold ETF data from Yahoo! Finance

# Input the gold ETF data (SPDR Gold Shares) from Yahoo! Finance and plot
 
# The first argument is the series we want, second and third are the starting & end dates
gldetf = pdr.get_data_yahoo('GLD', start, end) 

 
# Select the adjusted closing price data 
gldetf = pd.DataFrame({'etf': gldetf['Adj Close']})
            
# Print the DataFrame
print(gldetf)

# Plot the respective graphs
gldetf.plot(title='Plot of SPDR Gold Shares', figsize=(14,10), grid=True);

# Merge the ETF data with the indicator data

gldetf.index.name = 'DATE'

df_etf = pd.merge(df, gldetf, how='left', on = ['DATE'])


df_etf.tail(5)

# Shift the ETF data

shiftnum2 = 1

df_etf['etf_shift'] = df_etf['etf'].shift(shiftnum2)
gldetf['etf_shift'] = gldetf['etf'].shift(shiftnum2)

# Remove rows with duplicated index
df_etf = df_etf[~df_etf.index.duplicated()]
df_etf

# b. OLS Modelling with Gold ETF added as one indicator

# OLS and statistical summary

result1 = sm.ols(formula="Gold ~ CPI_shift + PCE_shift + GDP_shift + Rate_shift + etf_shift", data = df_etf).fit()
print(result1.summary())

#Model R-squared: 0.999

# c. Creating the testing dataset with time series predictions of indicator data

# To further evaluate the performance of using the time series data to predict, in the following steps, the whole testing dataset will be filled in with the predictions of indicators using time series analysis.

# Create the dataset in 30 days with the dates
test_predict_2 = {'DATE': pd.date_range(end-timedelta(days=predictdays-actualdays), end+timedelta(days=actualdays-1),freq='d')}
test_predict_2

# Clear the NAN in the shifted data
CPI_clean_2 = np.asarray(df['CPI_shift'][:actualdays-predictdays-1])
CPI_clean_2 = CPI_clean_2[~np.isnan(CPI_clean_2)]

PCE_clean_2 = np.asarray(df['PCE_shift'][:actualdays-predictdays-1])
PCE_clean_2 = PCE_clean_2[~np.isnan(PCE_clean_2)]

GDP_clean_2 = np.asarray(df['GDP_shift'][:actualdays-predictdays-1])
GDP_clean_2 = GDP_clean_2[~np.isnan(GDP_clean_2)]

Rate_clean_2 = np.asarray(df['Rate_shift'][:actualdays-predictdays-1])
Rate_clean_2 = Rate_clean_2[~np.isnan(Rate_clean_2)]

# Fit using the times series (Exponential Smoothing)

fit_ts_CPI_2 = ExponentialSmoothing(CPI_clean_2 ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() 
fit_ts_PCE_2 = ExponentialSmoothing(PCE_clean_2 ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() 
fit_ts_GDP_2 = ExponentialSmoothing(GDP_clean_2 ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() 
fit_ts_Rate_2 = ExponentialSmoothing(Rate_clean_2 ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() 

# Make the predictions and add into the 30-day testing dataset

test_predict_2['CPI_shift'] = fit_ts_CPI_2.forecast(predictdays)
test_predict_2['PCE_shift'] = fit_ts_PCE_2.forecast(predictdays)
test_predict_2['GDP_shift'] = fit_ts_GDP_2.forecast(predictdays)
test_predict_2['Rate_shift'] = fit_ts_Rate_2.forecast(predictdays)
test_predict_2

# Transfer the data to the dataframe
test_df2 = pd.DataFrame(data=test_predict_2)

# Format the date column and set it as the index
test_df2 = test_df2.set_index('DATE')
test_df2.index = test_df2.index.normalize()

#test_df = test_df_histo.append(test_df1)
test_df2

# Clear the NAN in the gold ETF data
etf_clean = np.asarray(df_etf['etf'][:actualdays-predictdays-1])
etf_clean = etf_clean[~np.isnan(etf_clean)]

etf_shift_clean = np.asarray(df_etf['etf_shift'][:actualdays-predictdays-1])
etf_shift_clean = etf_shift_clean[~np.isnan(etf_shift_clean)]

# Fit using the times series (Exponential Smoothing)

fit_ts_etf = ExponentialSmoothing(etf_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() 
fit_ts_etf_shift = ExponentialSmoothing(etf_shift_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() 

# Make the predictions and add into the 30-day testing dataset

test_df2['etf'] = fit_ts_etf.forecast(predictdays)
test_df2['etf_shift'] = fit_ts_etf_shift.forecast(predictdays)

test_df2

# select the portion of the testing dataset that contains actual historical gold future prices and merge with the gold future prices
test_histo = pd.merge(test_df2[:-actualdays+predictdays], gld, how='left', on = ['DATE'])

#fill in NaN with the nearest data
test_histo.fillna(method='ffill', inplace=True)

test_histo

# d. Select the testing dataset with actual historical prices to build the OLS model and evaluate the predictive power of the model

# Evaluate the model that used the time series data to predict

result2 = sm.ols(formula="gld ~ CPI_shift + PCE_shift + GDP_shift + Rate_shift + etf_shift", data = test_histo).fit()
print(result2.summary())

# Model R-square: 0.823

# e. Make the predictions using the OSL Model

# Prediction

# Create the test date frame to predict 

prediction_addetf = result2.predict(test_df2)
pre_df_addetf = prediction_addetf
pre_df_addetf.index = test_df2.index
pre_df_addetf

# Plot the actual prices and the prediction prices

fig, ax = plt.subplots()
ax.plot(gld.index[len(gld)-predictdays:], gld['gld'].values[len(gld)-predictdays:], label='Price')
ax.plot(pre_df_addetf.index, pre_df_addetf)

plt.show()

Conclusion

Python Visualization Output

With the orange line presenting the prediction and the blue line on behalf of the actual historical data, below are line graphs from the two OLS models.

Visualization Output from the OLS model (Using the Inflation Data as the Indicators)

Visualization Output from the OLS model (Using the Inflation Data and Gold ETF as the Indicators)

Key Findings

Through the models trained to predict, it can be found out that:

Inflation data is highly correlated with COMEX gold future prices. A model with inflation data as the indicators can explain 91.8% of the COMEX gold future prices.

Gold ETF also have effects on COMEX gold future prices. With gold ETF added into the model, 99.9% percent of the COMEX gold future price movements can be explaint by the inflation data and gold ETF data.

The real-time updated COMEX gold future predictions are obtained, with the predictive power (R-square) 0.823.

To Improve

From the Python Coding program, the real-time gold future price predictions with high predictive power are able to obtained whenever the program is run.

However, the visualization is not interactive and it may be challenging for people who have limited knowledge in Python to read codes and interperate outputs. Therefore, it would be more efficient if there’s one way that the users can have a intuitive view of the predictions without going through the coding part.

To improve on this, one dashboard is linked to the Python program in order to provide a easy-to-read and interactive view. Kindly find the other post on the visualization with dashboard.

Further Improvements

Updated on 13 Feb, 2023

Localization of timezone

Since the datatype of the time in yahoo! Finance data is changed from datetime to datetime with a timezone, the codes below are added to localize to prepare for the data merge according to the date in the same data type.


# Localize the timezone
gld.index = pd.to_datetime(gld.index).tz_localize(None)

Remove NAN from the historical COMEX gold data due to the closing market


# Remove the NAN in historical data
gld_df = gld_df[~gld_df.isin([np.nan, np.inf, -np.inf]).any(1)]
gld_df

New signals added in to improve the prediction power

Federal Funds Effective Rate
All Employees, Total Nonfarm
Nominal Broad U.S. Dollar Index
Total Reserves excluding Gold for Republic of Korea
Producer Price Index by Commodity: Miscellaneous Products: Jewelry, Gold and Platinum

R square for the real-time predictions rose from 0.881 to 0.939 owing to adding the signals above, according to data on 13 Feb, 2023.