COMEX Gold Future Prediction

COMEX Gold Future Prediction

Edited
Last updated July 14, 2023

The idea


notion image
 
The idea of COMEX gold future prediction derives from a pitch of longing the COMEX gold future to hedge the inflation risk with the rate rise. It happens that gold futures are taken into the portfolios as one component due to its rather stable value and negative correlation with other financial instruments such as equity and fixed income.
 
Hence, the python coding program is designed to provide real-time updated predictions of COMEX gold future prices with the inflation data as indicators.
 
Additionally, after figuring out the predictive power of the independent variables (inflation data and gold ETF price), since the indicator data in the future may not be available, to improve the accuracy of the predictions, the time series predictions of the indicator data is used for the predictions.
 
 

The process


  1. Data Source
    1. To provide real-time updates on the predictions, the data source for both independent variables and the dependent variable, i.e. inflation data and the future prices, is scrapped from websites to ensure the most recent data is input for the predictions.
       
      The details of the data source are below:
       
      Variable Name
      Data Source
      Independent variable
      Consumer price index (CPI)
      Independent variable
      Personal Consumption Expenditures (PCE)
      Independent variable
      Real Gross Domestic Product (GDP)
      Independent variable
      10-Year Real Interest Rate
      Dependent variable
      Gold ETF price (SPDR Gold Shares)
      Dependent variable
      COMEX gold future price
       
      Regarding the historical data available, the indicator data was shifted back for one day for the prediction purpose. This means that for instance, to predict today’s gold future price, the CPI used will be yesterday’s CPI instead of today’s CPI, since it’s highly possible that today’s CPI will not be availble at the moment when predicting today’s CPI.
       
  1. Steps
    1.  
      Besides the data input and cleaning included, the program follows the steps below to find out the predictive power of independent variables used and improve the accuracy of models.
       
      • COMEX Gold Future Prediction through Time Series
      • COMEX Gold Future Prediction with OLS Model using Inflation Data
      • COMEX Gold Future Prediction with OLS Model using Inflation Data and Gold ETF Price
       
  1. Coding
A. Loading Data from the Yahoo! Finance and perform the time series prediction
# Install yfinance !pip install yfinance # import pandas import pandas as pd import numpy as np import matplotlib.pyplot as plt from datetime import datetime, timedelta, date from pandas_datareader import data as pdr import yfinance as yf from statsmodels.tsa.statespace.sarimax import SARIMAX yf.pdr_override() # override previous packages # Input Gold Future Data from Yahoo! Finance and plot # Select the 20-year period till the current date year = 20 start = datetime.today() - timedelta(days=365*year) end = datetime.today() # The first argument is the series we want, second and third are the starting & end dates comexgld = pdr.get_data_yahoo('GC=F', start, end) # Select the adjusted closing price data gld = pd.DataFrame({'gld': comexgld['Adj Close']}) # Print the DataFrame print(gld) # Plot the respective graphs gld.plot(title='Plot of COMEX Gold Future Price', figsize=(14,10), grid=True); # Split the 10 years by taking the recent 3-year data as the testing data, and the rest as the training data splityear = 5 splitdate = datetime.today() - timedelta(days=365*splityear) print(splitdate) # Create the testing dataset testnum = 150 predictperiod = datetime.today() - timedelta(days=testnum) print(predictperiod) # Split the testing data and training data and visualize train = gld[gld.index < pd.to_datetime(predictperiod - timedelta(days=1), format='%Y-%m-%d')] test = gld[gld.index > pd.to_datetime(predictperiod - timedelta(days=1), format='%Y-%m-%d')] plt.plot(train, color = "black") plt.plot(test, color = "red") plt.ylabel('COMEX Gold Price') plt.xlabel('Date') plt.xticks(rotation=45) plt.title("Training and Testing split for COMEX Gold Future Data") plt.show() # Holt-Winters Method from statsmodels.tsa.api import ExponentialSmoothing, SimpleExpSmoothing, Holt import statsmodels.api as sm test_copy = test.copy() fit_ts = ExponentialSmoothing(np.asarray(train['gld']) ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() test_copy['Prediction'] = fit_ts.forecast(len(test)) # Prediction Plot plt.figure(figsize=(16,8)) plt.plot(train['gld'], label='Train') plt.plot(test['gld'], label='Actual') plt.plot(test_copy['Prediction'], label='Prediction') plt.legend(loc='best') plt.show()
 
B. Prediction with Inflation Drivers in Linear Regression
From the prediction through time series, it can be found out that the predictions do not fit the actual prices much, though it does show a increasing trend of the pricing. The chances are that the fluctuation of the inflation, have effects on the gold future price. Hence, in the prediction through the linear regression below, variables below are selected to predict the gold future price using the linear regression:
* Consumer price index (CPI) * Personal Consumption Expenditures (PCE) * Real Gross Domestic Product * 10-Year Real Interest Rate Kindly note that all data is real-time updated, from the Economic Research - Federal Reserve Bank of St. Louis: https://research.stlouisfed.org/
 
# a. Import indicator data and data cleaning # install wget !pip install wget import wget # import CPI data CPI_URL = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=CPIAUCSL&scale=left&cosd=1947-01-01&coed=2022-11-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2022-12-23&revision_date=2022-12-23&nd=1947-01-01" response = wget.download(CPI_URL, "CPI.csv") CPI_df = pd.read_csv("/content/CPI.csv") CPI_df.head(5) # Convert monthly CPI data to daily: CPI on each date in the month maintains the same, at the value of the montly CPI. #convert to datetimeindex CPI_df = CPI_df.set_index('DATE') CPI_df.index = pd.to_datetime(CPI_df.index, dayfirst=True) #add new next month for correct resample idx = CPI_df.index[-1] + pd.offsets.MonthBegin(1) #add new next month for correct resample CPI_df = CPI_df.append(CPI_df.iloc[[-1]].rename({CPI_df.index[-1]: idx})) #resample with forward filling values, remove last helper row CPI_df = CPI_df.resample('D').ffill().iloc[:-1] print (CPI_df) #create a new dataset with till to date data latestcpi = CPI_df.CPIAUCSL[-1] CPI_data = {'DATE': pd.date_range(idx,end,freq='d'), 'CPIAUCSL': latestcpi} CPI_df1 = pd.DataFrame(data=CPI_data) CPI_df1 = CPI_df1.set_index('DATE') # Append the new table to the CPI historical data table CPI_df = CPI_df.append(CPI_df1) CPI_df.tail(10) # import PCE data import wget PCE_URL = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=748&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=PCE&scale=left&cosd=1959-01-01&coed=2022-10-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2022-12-23&revision_date=2022-12-23&nd=1959-01-01" response = wget.download(PCE_URL, "PCE.csv") PCE_df = pd.read_csv("/content/PCE.csv") PCE_df.tail(5) # Convert monthly PCE data to daily: PCE on each date in the month maintains the same, at the value of the montly PCE. #convert to datetimeindex PCE_df = PCE_df.set_index('DATE') PCE_df.index = pd.to_datetime(PCE_df.index, dayfirst=True) #add new next month for correct resample idx = PCE_df.index[-1] + pd.offsets.MonthBegin(1) #add new next month for correct resample PCE_df = PCE_df.append(PCE_df.iloc[[-1]].rename({PCE_df.index[-1]: idx})) #resample with forward filling values, remove last helper row PCE_df = PCE_df.resample('D').ffill().iloc[:-1] print (PCE_df) #create a new dataset with till to date data latestpce = PCE_df.PCE[-1] PCE_data = {'DATE': pd.date_range(idx,end,freq='d'), 'PCE': latestpce} PCE_df1 = pd.DataFrame(data=PCE_data) PCE_df1 = PCE_df1.set_index('DATE') # Append the new table to the PCE historical data table PCE_df = PCE_df.append(PCE_df1) PCE_df.tail(10) # import GDP data GDP_URL = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=GDPC1&scale=left&cosd=1947-01-01&coed=2022-07-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Quarterly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2022-12-23&revision_date=2022-12-23&nd=1947-01-01" response = wget.download(GDP_URL, "RealGDP.csv") GDP_df = pd.read_csv("/content/RealGDP.csv") GDP_df.tail(5) # Convert quarterly GDP data to daily: GDP on each date in the month maintains the same, at the value of the quarterly GDP. #convert to datetimeindex GDP_df = GDP_df.set_index('DATE') GDP_df.index = pd.to_datetime(GDP_df.index, dayfirst=True) #add new next month for correct resample idx = GDP_df.index[-1] + pd.offsets.MonthBegin(1) #add new next month for correct resample GDP_df = GDP_df.append(GDP_df.iloc[[-1]].rename({GDP_df.index[-1]: idx})) #resample with forward filling values, remove last helper row GDP_df = GDP_df.resample('D').ffill().iloc[:-1] print (GDP_df.tail(40)) #create a new dataset with till to date data latestGDP = GDP_df.GDPC1[-1] GDP_data = {'DATE': pd.date_range(idx,end,freq='d'), 'GDPC1': latestGDP} GDP_df1 = pd.DataFrame(data=GDP_data) GDP_df1 = GDP_df1.set_index('DATE') # Append the new table to the GDP historical data table GDP_df = GDP_df.append(GDP_df1) GDP_df.tail(10) # import 10-Year Real Interest Rate data Rate_URL = "https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=REAINTRATREARAT10Y&scale=left&cosd=1982-01-01&coed=2022-12-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2022-12-23&revision_date=2022-12-23&nd=1982-01-01" response = wget.download(Rate_URL, "Rate.csv") Rate_df = pd.read_csv("/content/Rate.csv") Rate_df.tail(5) # Convert monthly Rate data to daily: Rate on each date in the month maintains the same, at the value of the montly rate. #convert to datetimeindex Rate_df = Rate_df.set_index('DATE') Rate_df.index = pd.to_datetime(Rate_df.index, dayfirst=True) #add new next month for correct resample idx = Rate_df.index[-1] + pd.offsets.MonthBegin(1) #add new next month for correct resample Rate_df = Rate_df.append(Rate_df.iloc[[-1]].rename({Rate_df.index[-1]: idx})) #resample with forward filling values, remove last helper row Rate_df = Rate_df.resample('D').ffill().iloc[:-1] print(Rate_df) #create a new dataset with till to date data latestRate = Rate_df.REAINTRATREARAT10Y[-1] Rate_data = {'DATE': pd.date_range(idx,end-timedelta(days=1),freq='d'), 'REAINTRATREARAT10Y': latestRate} Rate_df1 = pd.DataFrame(data=Rate_data) Rate_df1 = Rate_df1.set_index('DATE') # Append the new table to the Rate historical data table Rate_df = Rate_df.append(Rate_df1) Rate_df.tail(10) # b. Data Processing (Fill in NA, Merging and Shifting data) # Fill in the gold future price till today with the latest price if gld.index[-1] != end: latestgld = gld.gld[-1] gld_data = {'DATE': pd.date_range(gld.index[-1]+timedelta(days=1), end,freq='d'), 'gld': latestgld} gld_df1 = pd.DataFrame(data=gld_data) gld_df1 = gld_df1.set_index('DATE') # Append the new table to the Rate historical data table gld = gld.append(gld_df1) gld.tail(10) # Rename the gold price index to prepare for the merge gld.index.name = 'DATE' # Merge each variable data into one dataframe df = CPI_df df = pd.merge(CPI_df, PCE_df, how='left', on = ['DATE']) df = pd.merge(df, GDP_df, how='left', on = ['DATE']) df = pd.merge(df, Rate_df, how='left', on = ['DATE']) df = pd.merge(df, gld, how='left', on = ['DATE']) # Rename the columns df = df.rename(columns={'CPIAUCSL': 'CPI', 'GDPC1': 'GDP', 'REAINTRATREARAT10Y': 'Rate', 'gld': 'Gold'}) # Got the 20-year Dataset df.tail(20) # Since in the real life, when the prediction is being made, it's possible that the inflation indicator data on that day will not be available. Hence, by setting the shifting number as 1, we use yesterday's inflation data to predict today's gold future price here. # Set Lag Values for indicators to prepare for the prediction # Bring the data forward for 1 day, meaning that yesterday's data is used to predict today's price shiftnum = 1 df['CPI_shift'] = df['CPI'].shift(shiftnum) df['PCE_shift'] = df['PCE'].shift(shiftnum) df['GDP_shift'] = df['GDP'].shift(shiftnum) df['Rate_shift'] = df['Rate'].shift(shiftnum) # Remove rows with duplicated index df = df[~df.index.duplicated()] df # c. OLS modelling with historical data import statsmodels.formula.api as sm result = sm.ols(formula="Gold ~ CPI_shift + PCE_shift + GDP_shift + Rate_shift", data = df).fit() print(result.summary()) # d. Prediction with OLS Model # The period for predictions will be extended from 150 days to a longer period (210 days) to evaluate the performance of the model, and provide more valuable insights regaring a rather long-term gold future price movement. # In this case, during some prediction period, the indicator data will not be available since it's the data from the future. To improve the accuracy of the predictions, the time series data of those indicators is used to predict the gold future price. For instance, if the current date is 27 December 2022, the CPI data afterwards will not be available. Thus the CPI data after 27 December 2022 for predictions is the CPI predictions derived from the time series analysis of the historical CPI data. # Create the testing dataset (Total predicting days: 210 days, including 90-day day in the past to compare with the actual prices and evaluate the model performace) predictdays = 210 actualdays = 90 # 120-day data to check the predictions with the actual prices and 30-day data left to obtain the predictions predictperiod = datetime.today() - timedelta(days=predictdays - actualdays) print(predictperiod) train_df = df[df.index < pd.to_datetime(predictperiod - timedelta(days=1), format='%Y-%m-%d')] test_df = df[df.index > pd.to_datetime(predictperiod - timedelta(days=1), format='%Y-%m-%d')] test_df_histo = test_df[['CPI_shift', 'PCE_shift', 'GDP_shift', 'Rate_shift']] #test_df.index = pd.to_datetime(test_df.index, format = '%Y-%m-%d').strftime('%Y-%m-%d') test_df_histo # testing data with 120-day historical data # Create the dataset in 30 days with the dates test_predict = {'DATE': pd.date_range(end, end+timedelta(days=actualdays-1),freq='d')} test_predict # Clear the NAN in the shifted data CPI_clean = np.asarray(df['CPI_shift']) CPI_clean = CPI_clean[~np.isnan(CPI_clean)] PCE_clean = np.asarray(df['PCE_shift']) PCE_clean = PCE_clean[~np.isnan(PCE_clean)] GDP_clean = np.asarray(df['GDP_shift']) GDP_clean = GDP_clean[~np.isnan(GDP_clean)] Rate_clean = np.asarray(df['Rate_shift']) Rate_clean = Rate_clean[~np.isnan(Rate_clean)] # Fit using the times series (Exponential Smoothing) fit_ts_CPI = ExponentialSmoothing(CPI_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() fit_ts_PCE = ExponentialSmoothing(PCE_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() fit_ts_GDP = ExponentialSmoothing(GDP_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() fit_ts_Rate = ExponentialSmoothing(Rate_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() # Make the predictions and add into the 30-day testing dataset test_predict['CPI_shift'] = fit_ts_CPI.forecast(actualdays) test_predict['PCE_shift'] = fit_ts_PCE.forecast(actualdays) test_predict['GDP_shift'] = fit_ts_GDP.forecast(actualdays) test_predict['Rate_shift'] = fit_ts_Rate.forecast(actualdays) test_predict # Append the 30-day data to the testing dataframe test_df1 = pd.DataFrame(data=test_predict) # Format the date column and set it as the index test_df1 = test_df1.set_index('DATE') test_df = test_df_histo.append(test_df1) test_df # Prediction # Create the test date frame to predict prediction = result.predict(test_df) pre_df = prediction pre_df.index = test_df.index pre_df # Plot the actual prices and the prediction prices fig, ax = plt.subplots() ax.plot(gld.index[len(gld)-predictdays:], gld['gld'].values[len(gld)-predictdays:], label='Price') ax.plot(pre_df.index, pre_df) plt.show()
 
 
C. Prediction in Linear Regression with Inflation Indicators and Gold ETF Price
As from the statistical summary and the prediction from the model, what can be found out that when only using the inflation data as the indicators of the gold future price, the historic movement of the inflation data can fit around 91.8% of the future price data, since the adjusted R-square derived from the model is 0.918. This indicates that the inflation data only can already explain the gold future price much. However, from the line chart of the prediction and actual prices, there're also some points that can be improved in the model. * Prediction Delay: We can found out that there's some delay when it comes to the predictions. For instance, according to the predictions, the lowest point during the fall is estimated to happen between late November 2022 and December 2022, while the lowestest point of the actual price during the period is the price in early November 2022. * Conciseness: The more recent the prediction is, the closer the prediction is to the actual price. * Level: The line generated from the predictions is a line made up of multiple straight lines, which means that predictions for certain periods are the same value. This is because the inflation indicators we used here are either quarterly or monthly. Therefore, in next step, the gold ETF data (SPDR Gold Shares) will be added into the model to provide the real-time movements for the predictions. Additionally, for the commodity market, the consumer estimation influence the future prices much. Hence, the gold ETF data is also used as an variable which is related to the consumer confidence.
 
# a. Import Gold ETF data from Yahoo! Finance # Input the gold ETF data (SPDR Gold Shares) from Yahoo! Finance and plot # The first argument is the series we want, second and third are the starting & end dates gldetf = pdr.get_data_yahoo('GLD', start, end) # Select the adjusted closing price data gldetf = pd.DataFrame({'etf': gldetf['Adj Close']}) # Print the DataFrame print(gldetf) # Plot the respective graphs gldetf.plot(title='Plot of SPDR Gold Shares', figsize=(14,10), grid=True); # Merge the ETF data with the indicator data gldetf.index.name = 'DATE' df_etf = pd.merge(df, gldetf, how='left', on = ['DATE']) df_etf.tail(5) # Shift the ETF data shiftnum2 = 1 df_etf['etf_shift'] = df_etf['etf'].shift(shiftnum2) gldetf['etf_shift'] = gldetf['etf'].shift(shiftnum2) # Remove rows with duplicated index df_etf = df_etf[~df_etf.index.duplicated()] df_etf # b. OLS Modelling with Gold ETF added as one indicator # OLS and statistical summary result1 = sm.ols(formula="Gold ~ CPI_shift + PCE_shift + GDP_shift + Rate_shift + etf_shift", data = df_etf).fit() print(result1.summary()) #Model R-squared: 0.999 # c. Creating the testing dataset with time series predictions of indicator data # To further evaluate the performance of using the time series data to predict, in the following steps, the whole testing dataset will be filled in with the predictions of indicators using time series analysis. # Create the dataset in 30 days with the dates test_predict_2 = {'DATE': pd.date_range(end-timedelta(days=predictdays-actualdays), end+timedelta(days=actualdays-1),freq='d')} test_predict_2 # Clear the NAN in the shifted data CPI_clean_2 = np.asarray(df['CPI_shift'][:actualdays-predictdays-1]) CPI_clean_2 = CPI_clean_2[~np.isnan(CPI_clean_2)] PCE_clean_2 = np.asarray(df['PCE_shift'][:actualdays-predictdays-1]) PCE_clean_2 = PCE_clean_2[~np.isnan(PCE_clean_2)] GDP_clean_2 = np.asarray(df['GDP_shift'][:actualdays-predictdays-1]) GDP_clean_2 = GDP_clean_2[~np.isnan(GDP_clean_2)] Rate_clean_2 = np.asarray(df['Rate_shift'][:actualdays-predictdays-1]) Rate_clean_2 = Rate_clean_2[~np.isnan(Rate_clean_2)] # Fit using the times series (Exponential Smoothing) fit_ts_CPI_2 = ExponentialSmoothing(CPI_clean_2 ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() fit_ts_PCE_2 = ExponentialSmoothing(PCE_clean_2 ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() fit_ts_GDP_2 = ExponentialSmoothing(GDP_clean_2 ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() fit_ts_Rate_2 = ExponentialSmoothing(Rate_clean_2 ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() # Make the predictions and add into the 30-day testing dataset test_predict_2['CPI_shift'] = fit_ts_CPI_2.forecast(predictdays) test_predict_2['PCE_shift'] = fit_ts_PCE_2.forecast(predictdays) test_predict_2['GDP_shift'] = fit_ts_GDP_2.forecast(predictdays) test_predict_2['Rate_shift'] = fit_ts_Rate_2.forecast(predictdays) test_predict_2 # Transfer the data to the dataframe test_df2 = pd.DataFrame(data=test_predict_2) # Format the date column and set it as the index test_df2 = test_df2.set_index('DATE') test_df2.index = test_df2.index.normalize() #test_df = test_df_histo.append(test_df1) test_df2 # Clear the NAN in the gold ETF data etf_clean = np.asarray(df_etf['etf'][:actualdays-predictdays-1]) etf_clean = etf_clean[~np.isnan(etf_clean)] etf_shift_clean = np.asarray(df_etf['etf_shift'][:actualdays-predictdays-1]) etf_shift_clean = etf_shift_clean[~np.isnan(etf_shift_clean)] # Fit using the times series (Exponential Smoothing) fit_ts_etf = ExponentialSmoothing(etf_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() fit_ts_etf_shift = ExponentialSmoothing(etf_shift_clean ,seasonal_periods=365 ,trend='add', seasonal='add',).fit() # Make the predictions and add into the 30-day testing dataset test_df2['etf'] = fit_ts_etf.forecast(predictdays) test_df2['etf_shift'] = fit_ts_etf_shift.forecast(predictdays) test_df2 # select the portion of the testing dataset that contains actual historical gold future prices and merge with the gold future prices test_histo = pd.merge(test_df2[:-actualdays+predictdays], gld, how='left', on = ['DATE']) #fill in NaN with the nearest data test_histo.fillna(method='ffill', inplace=True) test_histo # d. Select the testing dataset with actual historical prices to build the OLS model and evaluate the predictive power of the model # Evaluate the model that used the time series data to predict result2 = sm.ols(formula="gld ~ CPI_shift + PCE_shift + GDP_shift + Rate_shift + etf_shift", data = test_histo).fit() print(result2.summary()) # Model R-square: 0.823 # e. Make the predictions using the OSL Model # Prediction # Create the test date frame to predict prediction_addetf = result2.predict(test_df2) pre_df_addetf = prediction_addetf pre_df_addetf.index = test_df2.index pre_df_addetf # Plot the actual prices and the prediction prices fig, ax = plt.subplots() ax.plot(gld.index[len(gld)-predictdays:], gld['gld'].values[len(gld)-predictdays:], label='Price') ax.plot(pre_df_addetf.index, pre_df_addetf) plt.show()
 
 
 

Conclusion


  1. Python Visualization Output
    1.  
      With the orange line presenting the prediction and the blue line on behalf of the actual historical data, below are line graphs from the two OLS models.
       
      Visualization Output from the OLS model (Using the Inflation Data as the Indicators)
       
      notion image
       
      Visualization Output from the OLS model (Using the Inflation Data and Gold ETF as the Indicators)
       
      notion image
       
  1. Key Findings
    1. Through the models trained to predict, it can be found out that:
      • Inflation data is highly correlated with COMEX gold future prices. A model with inflation data as the indicators can explain 91.8% of the COMEX gold future prices.
      • Gold ETF also have effects on COMEX gold future prices. With gold ETF added into the model, 99.9% percent of the COMEX gold future price movements can be explaint by the inflation data and gold ETF data.
      • The real-time updated COMEX gold future predictions are obtained, with the predictive power (R-square) 0.823.
       
  1. To Improve
    1. From the Python Coding program, the real-time gold future price predictions with high predictive power are able to obtained whenever the program is run.
       
      However, the visualization is not interactive and it may be challenging for people who have limited knowledge in Python to read codes and interperate outputs. Therefore, it would be more efficient if there’s one way that the users can have a intuitive view of the predictions without going through the coding part.
       
      To improve on this, one dashboard is linked to the Python program in order to provide a easy-to-read and interactive view. Kindly find the other post on the visualization with dashboard.
 

Further Improvements


Updated on 13 Feb, 2023
 
  • Localization of timezone
    • Since the datatype of the time in yahoo! Finance data is changed from datetime to datetime with a timezone, the codes below are added to localize to prepare for the data merge according to the date in the same data type.
       
      # Localize the timezone gld.index = pd.to_datetime(gld.index).tz_localize(None)
 
 
  • Remove NAN from the historical COMEX gold data due to the closing market
    •  
      # Remove the NAN in historical data gld_df = gld_df[~gld_df.isin([np.nan, np.inf, -np.inf]).any(1)] gld_df
       
 
  • New signals added in to improve the prediction power
    • Federal Funds Effective Rate
    • All Employees, Total Nonfarm
    • Nominal Broad U.S. Dollar Index
    • Total Reserves excluding Gold for Republic of Korea
    • Producer Price Index by Commodity: Miscellaneous Products: Jewelry, Gold and Platinum
    •  
      R square for the real-time predictions rose from 0.881 to 0.939 owing to adding the signals above, according to data on 13 Feb, 2023.