Realtime

GDELT + Stock Data: Sentiment Score Correlation Analysis

So the goal is to use the GDELT Project to predict changes in the stock market. A cool idea? Definitely. An overly optimistic waste of time? Probably.

My initial thought was that we need to get all articles published for each day within a given period related to a specific company, run some sentiment analysis, take an average sentiment score and use that to predict the movement of the company's stock price.

The first step was the understand how to get raw article content. In retrospect, there's probably some copyright issues, hence why I couldn't figure out how to get this data. Some more digging led to the discovery of the GDELT Full Text Search API (which ironically doesn't provide full text) that had been created to allow us to retrieve data. More importantly, it comes with a "sentiment timeline" where the sentiment analysis has been done for us, containing preprocessed "Average Tone". So we know that the data exists, now the task is to write some code to get this data.

Easier said than done. Some of the provided code in the GDELT articles is not functioning anymore. Like any true developer, I went on the hunt for a suitable Python package with the hopes someone had done the work for me. Thats when I found gdeltdoc. Compared to the other GDELT packages, this one had a release in the last 2 years so I had high hopes. The docs also show that you can get "timelinetone" and that its rather simple to get this data.

from gdeltdoc import GdeltDoc, Filters

f = Filters(
    keyword="Unilever",
    start_date="2020-05-10",
    end_date="2024-12-25"
)

gd = GdeltDoc()
articles = gd.article_search(f)
timeline = gd.timeline_search("timelinetone", f)
timeline
datetime Average Tone
0 2020-05-10 00:00:00+00:00 0.2913
1 2020-05-11 00:00:00+00:00 0.3398
2 2020-05-12 00:00:00+00:00 0.9365
3 2020-05-13 00:00:00+00:00 1.0953
4 2020-05-14 00:00:00+00:00 -0.0733
... ... ...
1684 2024-12-21 00:00:00+00:00 -0.2121
1685 2024-12-22 00:00:00+00:00 0.6575
1686 2024-12-23 00:00:00+00:00 1.7305
1687 2024-12-24 00:00:00+00:00 0.7071
1688 2024-12-25 00:00:00+00:00 1.1904

1689 rows × 2 columns

Happy days. We can retrieve data going back to 2017-01-01 which should be sufficient to test our strategy.

The next step is gathering financial data. We chose to use Yahoo Finance and its associated Python package. Its free, simple to use and well supported. It only goes down to a period of 1 day. Ideally, we'd like data every 15 minutes but that starts to get pretty expensive.

import yfinance as yf

stock_data = yf.download("UL", start="2020-05-11", end="2024-12-25", multi_level_index=False)
stock_data['Price Change'] = stock_data['Close'] - stock_data['Open']
stock_data.head()
Close High Low Open Volume Price Change
Date
2020-05-11 44.141987 44.344937 43.490850 43.609238 1627300 0.532749
2020-05-12 43.871391 44.539438 43.871391 44.361856 1021200 -0.490465
2020-05-13 43.786819 44.311114 43.566957 44.116615 1065000 -0.329796
2020-05-14 43.492542 43.586365 43.006351 43.356068 1302100 0.136474
2020-05-15 43.680206 43.680206 43.211078 43.245197 1532800 0.435009

Then we just merge everything together and we have our data. An initial correlation score between "Average Tone" and the daily change in stock price indicated that this probably isn't a good strategy (r = 0.03, p = 0.83). However, I suspect this isn't the right way to test this theory so probably need to do some backtesting.