GDELT + Stock Data: Sentiment Score Correlation Analysis
So the goal is to use the GDELT Project to predict changes in the stock market. A cool idea? Definitely. An overly optimistic waste of time? Probably.
My initial thought was that we need to get all articles published for each day within a given period related to a specific company, run some sentiment analysis, take an average sentiment score and use that to predict the movement of the company's stock price.
The first step was the understand how to get raw article content. In retrospect, there's probably some copyright issues, hence why I couldn't figure out how to get this data. Some more digging led to the discovery of the GDELT Full Text Search API (which ironically doesn't provide full text) that had been created to allow us to retrieve data. More importantly, it comes with a "sentiment timeline" where the sentiment analysis has been done for us, containing preprocessed "Average Tone". So we know that the data exists, now the task is to write some code to get this data.
Easier said than done. Some of the provided code in the GDELT articles is not functioning anymore. Like any true developer, I went on the hunt for a suitable Python package with the hopes someone had done the work for me. Thats when I found gdeltdoc. Compared to the other GDELT packages, this one had a release in the last 2 years so I had high hopes. The docs also show that you can get "timelinetone" and that its rather simple to get this data.
from gdeltdoc import GdeltDoc, Filters
f = Filters(
keyword="Unilever",
start_date="2020-05-10",
end_date="2024-12-25"
)
gd = GdeltDoc()
articles = gd.article_search(f)
timeline = gd.timeline_search("timelinetone", f)
timeline
datetime | Average Tone | |
---|---|---|
0 | 2020-05-10 00:00:00+00:00 | 0.2913 |
1 | 2020-05-11 00:00:00+00:00 | 0.3398 |
2 | 2020-05-12 00:00:00+00:00 | 0.9365 |
3 | 2020-05-13 00:00:00+00:00 | 1.0953 |
4 | 2020-05-14 00:00:00+00:00 | -0.0733 |
... | ... | ... |
1684 | 2024-12-21 00:00:00+00:00 | -0.2121 |
1685 | 2024-12-22 00:00:00+00:00 | 0.6575 |
1686 | 2024-12-23 00:00:00+00:00 | 1.7305 |
1687 | 2024-12-24 00:00:00+00:00 | 0.7071 |
1688 | 2024-12-25 00:00:00+00:00 | 1.1904 |
1689 rows × 2 columns
Happy days. We can retrieve data going back to 2017-01-01 which should be sufficient to test our strategy.
The next step is gathering financial data. We chose to use Yahoo Finance and its associated Python package. Its free, simple to use and well supported. It only goes down to a period of 1 day. Ideally, we'd like data every 15 minutes but that starts to get pretty expensive.
import yfinance as yf
stock_data = yf.download("UL", start="2020-05-11", end="2024-12-25", multi_level_index=False)
stock_data['Price Change'] = stock_data['Close'] - stock_data['Open']
stock_data.head()
Close | High | Low | Open | Volume | Price Change | |
---|---|---|---|---|---|---|
Date | ||||||
2020-05-11 | 44.141987 | 44.344937 | 43.490850 | 43.609238 | 1627300 | 0.532749 |
2020-05-12 | 43.871391 | 44.539438 | 43.871391 | 44.361856 | 1021200 | -0.490465 |
2020-05-13 | 43.786819 | 44.311114 | 43.566957 | 44.116615 | 1065000 | -0.329796 |
2020-05-14 | 43.492542 | 43.586365 | 43.006351 | 43.356068 | 1302100 | 0.136474 |
2020-05-15 | 43.680206 | 43.680206 | 43.211078 | 43.245197 | 1532800 | 0.435009 |
Then we just merge everything together and we have our data. An initial correlation score between "Average Tone" and the daily change in stock price indicated that this probably isn't a good strategy (r = 0.03, p = 0.83). However, I suspect this isn't the right way to test this theory so probably need to do some backtesting.