Realtime

GDELT + Stock Data: Correlation Analysis

Alright, maybe we need to step back a bit here. I'm jumping so quickly into these theories that I'm probably rushing things and missing key insights or leads to new strategy ideas. Lets start with our initial data. I'm going to keep it generalised to complete economic data just for the last year. I'm eager (perhaps naively) for quick returns. I would send an empty query but queries are limited to a minimum of 4 letters.

from collections import Counter
from backtesting import Strategy
from backtesting.lib import crossover
from gdeltdoc import GdeltDoc, Filters
import pandas as pd
import matplotlib.pyplot as plt
import yfinance as yf
import numpy as np
from scipy.stats import pearsonr, zscore
import seaborn as sns

START_DATE = "2023-12-30"
END_DATE = "2024-12-30"
TICKER = "SPY"
SEARCH_TERM = "economy"

We can get our data in the same way we did before. However, I'm going to add in search results article count and total article count in the hopes there's something there.

# Set up GDELT filters
f = Filters(
    keyword=SEARCH_TERM,
    start_date=START_DATE,
    end_date=END_DATE,
    #theme="WB_625_HEALTH_ECONOMICS_AND_FINANCE"
)

gd = GdeltDoc()
articles = gd.article_search(f)
timeline_tone = gd.timeline_search("timelinetone", f)
timeline_raw = gd.timeline_search("timelinevolraw", f)
timeline = timeline_tone.merge(timeline_raw, how="left", on="datetime")
timeline
datetime Average Tone Article Count All Articles
0 2023-12-30 00:00:00+00:00 0.2448 6307 111724
1 2023-12-31 00:00:00+00:00 0.2038 8796 129642
2 2024-01-01 00:00:00+00:00 0.4657 6215 89190
3 2024-01-02 00:00:00+00:00 0.1338 7402 113699
4 2024-01-03 00:00:00+00:00 -0.0392 8467 140100
... ... ... ... ...
362 2024-12-26 00:00:00+00:00 0.4670 9350 142705
363 2024-12-27 00:00:00+00:00 0.1646 10576 158638
364 2024-12-28 00:00:00+00:00 0.5111 6214 107098
365 2024-12-29 00:00:00+00:00 0.2791 4873 83258
366 2024-12-30 00:00:00+00:00 0.4826 9122 124712

367 rows × 4 columns

Like a true data scientist, lets plot these variables to see if anything is going on.

# Plotting the data with separate y-axes for each line
fig, ax1 = plt.subplots(figsize=(12, 6))

# Plot Average Tone on the first y-axis
ax1.plot(timeline['datetime'], timeline['Average Tone'], label='Average Tone', color='blue', linewidth=1.5)
ax1.set_ylabel('Average Tone', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Create a second y-axis for Article Count
ax2 = ax1.twinx()
ax2.plot(timeline['datetime'], timeline['Article Count'], label='Article Count', color='green', linewidth=1.5)
ax2.set_ylabel('Article Count', color='green')
ax2.tick_params(axis='y', labelcolor='green')

# Create a third y-axis for All Articles
ax3 = ax1.twinx()
ax3.spines['right'].set_position(('outward', 60))  # Offset the third y-axis
ax3.plot(timeline['datetime'], timeline['All Articles'], label='All Articles', color='red', linewidth=1.5)
ax3.set_ylabel('All Articles', color='red')
ax3.tick_params(axis='y', labelcolor='red')

# Adding labels and title
plt.title('Trends Over Time for Average Tone, Article Count, and All Articles')
plt.grid(alpha=0.3)

# Adding legends
fig.tight_layout()  # Adjust layout to accommodate multiple y-axes
plt.show()

output

So insightful. The problem is that there is just too much data and too much change day to day to spot any patterns by sight alone. Lets prepare our financial data and collect everything together before we do some more analysis.

def get_tone(df):
  tones = [0 for i in range(df.shape[0])]
  timeline["datetime_clean"] = timeline.datetime.dt.date
  for index, date in enumerate(df.index):
    if date.date() in list(timeline.datetime.dt.date):
      tones[index] = float(timeline[timeline.datetime_clean == date.date()]["Average Tone"])
    else:
      tones[index] = 0
  return pd.Series(tones)

def article_count(df): 
  count = [0 for i in range(df.shape[0])]
  timeline["datetime_clean"] = timeline.datetime.dt.date
  for index, date in enumerate(df.index):
    if date.date() in list(timeline.datetime.dt.date):
      count[index] = float(timeline[timeline.datetime_clean == date.date()]["Article Count"])
    else:
      count[index] = 0
  return pd.Series(count)

def article_all(df):
  count = [0 for i in range(df.shape[0])]
  timeline["datetime_clean"] = timeline.datetime.dt.date
  for index, date in enumerate(df.index):
    if date.date() in list(timeline.datetime.dt.date):
      count[index] = float(timeline[timeline.datetime_clean == date.date()]["All Articles"])
    else:
      count[index] = 0
  return pd.Series(count)

def get_volume(df):
   return df.Volume

data = yf.download(TICKER, start=START_DATE, end=END_DATE, multi_level_index=False)

data["Average Tone"] = list(get_tone(data))
data["Article Count"] = list(get_article_count(data))
data["All Articles"] = list(get_all_count(data))

data
Close Volume Average Tone Article Count All Articles
Date
2024-01-02 466.663940 123623700 0.1338 7402.0 113699.0
2024-01-03 462.852844 103585900 -0.0392 8467.0 140100.0
2024-01-04 461.361969 84232200 -0.0177 11418.0 171729.0
2024-01-05 461.993866 86060800 0.0540 9509.0 159932.0
2024-01-08 468.589294 74879100 0.1491 8584.0 145645.0
... ... ... ... ... ...
2024-12-20 591.150024 125716700 0.4267 11918.0 170707.0
2024-12-23 594.690002 57635800 0.3075 10892.0 169788.0
2024-12-24 601.299988 33160100 0.4874 8868.0 132293.0
2024-12-26 601.340027 41219100 0.4670 9350.0 142705.0
2024-12-27 595.010010 64847900 0.1646 10576.0 158638.0

250 rows × 5 columns

I'm thinking we can start with some correlation analysis. Its probably better to let the data lead the way rather than guessing what might be the best approach.

# Calculate the correlation matrix
correlation_matrix = data[['Volume', 'Average Tone', 'Article Count', 'All Articles']].corr()

# Create a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    correlation_matrix,
    annot=True, 
    cmap='coolwarm',  
    fmt='.2f', 
    linewidths=0.5, 
    cbar_kws={'label': 'Correlation Coefficient'} 
)

# Add labels and title
plt.title('Correlation Matrix Heatmap')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.show()

outpu4

Now that is unexpected. Most of the correlations here are to be expected but seeing a correlation between "Volume" and "Average Tone" is interesting. Lets check if its statistically significant.

pearsonr(data["Volume"], data["Average Tone"])
PearsonRResult(statistic=np.float64(-0.34058738224457286), pvalue=np.float64(3.309642087743323e-08))

I think we've found a trend. Not really sure what to do with that. Its probably the first step in getting to a proper strategy.