Stock data is [https://finance.yahoo.com/quote/DANSKE.CO)
Top8 news and works as a ranking of global news. Berlingske news. Ranked by Berlingske self. [https://www.berlingske.dk)
"1" when stock price value rise or stays the same
"0" when stock price value decrease.
Use data from '2019-11-05 16:45:06' to '2020-01-17 17:05:06' for modelling. The split is 80/20.
AUC evaluation metric.
# nlp_notebook.ipynb
# Assumes python vers. 3.6
# __author__ = 'mizio'
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import csv as csv
import numpy as np
import pandas as pd
import pylab as plt
from wordcloud import WordCloud,STOPWORDS
from datetime import date
import datetime
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sys import path
import os
from os.path import dirname, abspath
path_of_doc = dirname(dirname(abspath(os.path.split(os.getcwd())[0])))
path_to_ml_lib = path_of_doc + '/machine_learning_modules/'
if path_to_ml_lib not in path:
path.append(path_to_ml_lib)
from datamunging.explore import *
from machine_learning_estimators.tensorflow_NN import *
# Load data
#file_name = 'berlingske_scraped_data2020-01-18_13:00:21.197992.csv'
#file_name = 'berlingske_scraped_data2020-01-21_21:08:45.439396.csv'
file_name = 'berlingske_scraped_data2020-01-21_22:27:41.064382.csv'
df = pd.read_csv('../CorrelationDataDanskeBankBerlingske/' + file_name, header=0)
count = df.count()[0]
df.info()
# clean for rows with null
df = df.dropna()
df.info()
df.head(-1)
#file_name_danske = 'danske_bank_scraped_data2020-01-18_13:00:21.360295.csv'
#file_name_danske = 'danske_bank_scraped_data2020-01-21_21:08:45.914529.csv'
file_name_danske = 'danske_bank_scraped_data2020-01-21_22:27:41.902021.csv'
df_djia = pd.read_csv('../CorrelationDataDanskeBankBerlingske/' + file_name_danske, header=0)
df_djia.info()
df_djia.head()
Compare timestamps for stock prices with that of berlingske
df.TimeStamp.head()
df_djia.TimeStamp.head()
Align timestamps of Berlingske news media with open market hours for stock prices.
def is_during_market_hours(timestamp):
current_danish_time = timestamp
open_hour = datetime.datetime(current_danish_time.year, current_danish_time.month, current_danish_time.day, hour=8, minute=50)
close_hour = datetime.datetime(current_danish_time.year, current_danish_time.month, current_danish_time.day, hour=17, minute=10)
return (timestamp > open_hour) & (timestamp < close_hour) & (timestamp.weekday() < 5)
def timestamp_format_to_datetime(df):
df.loc[:, "TimeStamp"] = df['TimeStamp'].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S"))
return df
def prepare_timestamp_open_market_hours(df, start_time=None, end_time=None):
if start_time is not None and end_time is not None:
filtering = df["TimeStamp"].apply(lambda x: is_during_market_hours(x) & (x >= start_time) & (x <= end_time))
else:
filtering = df["TimeStamp"].apply(lambda x: is_during_market_hours(x))
df.sort_values("TimeStamp", inplace=True)
df.where(filtering, inplace=True)
df = df.dropna()
return df
df = timestamp_format_to_datetime(df)
df = prepare_timestamp_open_market_hours(df)
count = df.shape[0]
start_time = df.TimeStamp[df.index[0]]
end_time = df.TimeStamp[df.index[-1]]
#df = prepare_timestamp_open_market_hours(df, start_time, end_time)
training_test_ratio = 0.8
df_train = df.iloc[:int(count*training_test_ratio), :].copy()
df_test = df.iloc[int(count*training_test_ratio):, :].copy()
df_test.shape, df_train.shape
start_time, end_time
df_djia = timestamp_format_to_datetime(df_djia)
df_djia = prepare_timestamp_open_market_hours(df_djia, start_time, end_time)
df_djia.TimeStamp.head()
df.TimeStamp.head()
A start time stamp should be provide such, since we have more earlier data scraped for the stock price case.
In the following cells a Label column with (0,1) is a two-class classification problem. News are displayed in columns Top1 up to Top8 and works as a ranking of news. Problem is to identify which rows hold information that can be connected to a price change in a stock.
df_djia['TimeStamp'].iloc[:10]
chronological_days = df['TimeStamp'].apply(lambda x: x.minute)
chronological_days_djia = df_djia['TimeStamp'].apply(lambda x: x.minute)
plt.figure()
plt.plot(chronological_days.values, "o")
plt.plot(chronological_days_djia.values, "+")
plt.show()
plt.close()
chronological_days = df['TimeStamp']
chronological_days_djia = df_djia['TimeStamp']
# Let index start from 1
index_df = np.arange(chronological_days.shape[0])
index_djia = np.arange(chronological_days_djia.shape[0])
plt.figure()
plt.plot(index_df, chronological_days.values, "o")
plt.plot(index_djia, chronological_days_djia.values, "+")
plt.show()
plt.close()
This shows that dates for financial and news data are correctly aligned. This shows that data is only for open market hours as expected.
Prepare data such that all words in row gets into a list with single word elements
df_train.head(1)
# Print Top news headline
ranking_news = 1
news_index = 1 + ranking_news
example_top_news = df_train.iloc[0, news_index]
print(example_top_news)
example_top_news.lower()
print(example_top_news.lower())
Clean phrase for abbreviations and punctuations and other non-word parts. Could probably be optimized for Danish language
headline_words_as_vector = CountVectorizer().build_tokenizer()(example_top_news.lower())
print(CountVectorizer().build_tokenizer()(example_top_news.lower()))
Build new dataframe with words and corresponding count
pd.DataFrame([[x, headline_words_as_vector.count(x)] for x in set(headline_words_as_vector)], columns=["word", "word_count"])
Instead of taking only 1 news headline append all the 8 news headlines into 1 string of words and make a count
headline_count = 8
all_headline_words_as_vector = ''
for ranking_news in range(1,headline_count + 1):
news_index = 1 + ranking_news
top_news = df_train.iloc[0, news_index]
if ranking_news == 1:
all_headline_words_as_vector = top_news
else:
all_headline_words_as_vector = ' '.join([all_headline_words_as_vector,top_news])
print(all_headline_words_as_vector)
all_headline_words_as_vector = CountVectorizer().build_tokenizer()(all_headline_words_as_vector.lower())
pd.DataFrame([[x, all_headline_words_as_vector.count(x)] for x in set(all_headline_words_as_vector)], columns=["word", "word_count"])
Notice some words do not look Danish which may occur due to webscraping output formats. Some words may have been merged with others. One could identify the group of words and clean the data by mapping these into their appropriate condition.
def prepare_data(df):
training_data_rows = []
for row in range(0, df.shape[0]):
all_headline_words_as_vector = ''
for ranking_news in range(1,headline_count+1):
news_index = 1 + ranking_news
top_news = df.iloc[row, news_index]
if ranking_news == 1:
all_headline_words_as_vector = str(top_news)
else:
all_headline_words_as_vector = ' '.join([all_headline_words_as_vector,str(top_news)])
training_data_rows.append(all_headline_words_as_vector)
return training_data_rows
training_data_rows = prepare_data(df_train)
test_data = prepare_data(df_test.iloc[:-1,:])
all_data_rows = []
all_data_rows.extend(test_data)
all_data_rows.extend(training_data_rows)
len(training_data_rows), len(test_data), len(all_data_rows)
Create a count column for each of the words appearing
# Define corpus for text input
corpus = all_data_rows
count_vectorizer = CountVectorizer()
# Inserts the corpus in vectorizer.
fit_matrix = count_vectorizer.fit(all_data_rows)
training_data_transformed = count_vectorizer.transform(training_data_rows)
print(training_data_transformed.shape)
#training_data_transformed[100:101]
#Todo: check that sparse matrix is correct since wordcloud shows all 2-word combination when it should only be 1-word.
dense_matrix = training_data_transformed.todense()
np.where(dense_matrix)
A quick comparison of size. If each row implies 90 new words then total size would be 90*4080 ~ 360000 but we expect less since not all words will be new and the above shows a count of 8927.
training_data_rows[:3]
#type(training_data_transformed)
print(training_data_transformed)
print(count_vectorizer.get_feature_names()[1000:1400])
Put the words into a word cloud and show most occuring words by size
text = " ".join(all_heads for all_heads in training_data_rows)
#text = " ".join(all_heads for all_heads in training_data_rows[:4])
print(len(text))
#print(text)
wordcloud = WordCloud(stopwords=STOPWORDS,
background_color='white',
width=2500,
max_words=100,
height=2000
).generate(text)
plt.figure(1,figsize=(13, 13))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
plt.close()
Two-word combinations may appear due to re-occurences of exact phrases appearing in the data, which emphasizes that some words always have the same neighbours. This Wordcloud feature is independent of how we set the n-gram parameter in the following cells.
# The following mapping will be used for correlation modelling between Berlingske and Danske Bank stock price
# If price increase or const => 1
# If price decrease => 0
following_day_prices = pd.Series(df_djia.StockPrice[1:].values)
current_day_prices = pd.Series(df_djia.StockPrice[:-1].values)
current_day_prices.shape,following_day_prices.shape
logical = (current_day_prices >= following_day_prices).astype(int)
Simple cheat model where Label column is set manually independent of the stock price movement. if False: df_train["Label"] = 1 df_train.loc[:10, "Label"] = 0 df_test["Label"] = 1
#df_test.loc[:df_test.index[0] + 100, "Label"] = 0
df_test.loc[:df_test.index[0] + 100, "Label"] = 0
logical[:10]
# correct model
#del df_train["Label"]
#del df_test["Label"]
#df_test.loc[:, :] = df_test.iloc[:-1, :]
logical[:df_train.shape[0]].values.shape, df_train.shape, logical[df_train.shape[0]:].values.shape, df_test.shape
diff_logical = logical[df_train.shape[0]:].values.shape[0] - df_test.shape[0]
count_df = df_train.shape[0]
df_train = df_train.iloc[:,:]
df_train["Label"] = logical[:count_df].values
df_test = df_test.iloc[:diff_logical, :]
df_test.loc[:, "Label"] = logical[df_train.shape[0]:].values
logical[:df_train.shape[0]].values.shape, df_train.shape, logical[df_train.shape[0]:].values.shape, df_test.shape
Use a simple Logistic Regression model for the two-class classification problem
log_reg_model = LogisticRegression(solver='liblinear')
#log_reg_model = LogisticRegression(solver='lbfgs')
log_reg_model = log_reg_model.fit(training_data_transformed, df_train.Label)
# test_data = prepare_data(df_test)
Number of feature columns is the same for both training and test data shown below for the shape of transformed test data. Here only a transform is applied to test data and not a fit.
test_data_transformed = count_vectorizer.transform(test_data[:-1])
print(test_data_transformed.shape)
print(test_data_transformed)
predict = log_reg_model.predict(test_data_transformed)
test_data_transformed.shape
# Check and correct if shape mismatch.
#print(predict)
df_test.Label.shape, predict.shape
Confusion matrix that explores true and false positives
confusion_matrix_mod(df_test.Label, predict)
save_path = '../plots/confusion_matrix_logreg'
multipage(''.join([save_path, '.pdf']))
plt.show()
plt.close()
Notice that the rate to hit a true positive is 'True Pos'/'Num Pos' = 0.75 which is a which may at first sight seem good but actually is not. Ideally the confusion matrix should not have a majority of false negatives and false positives as is the case in above result with high values in off diagonal fields.
Check that top 10 coefficients used in the model match some of the words from word cloud
def feature_coefficients(count_vectorizer, log_reg_model):
feature_names = count_vectorizer.get_feature_names()
model_coefficients = log_reg_model.coef_.tolist()[0]
coefficients_in_dataframe = pd.DataFrame({'feature': feature_names, 'Coefficient': model_coefficients})
coefficients_in_dataframe = coefficients_in_dataframe.sort_values(['Coefficient', 'feature'], ascending=[0, 1])
return coefficients_in_dataframe
coefficients_in_dataframe = feature_coefficients(count_vectorizer, log_reg_model)
coefficients_in_dataframe.head(10)
coefficients_in_dataframe.tail(10)
# prepare data for xgboost
#print(type(training_data_transformed))
#print(training_data_transformed.shape)
#print(training_data_transformed)
#print(test_data_transformed.shape)
#xgb_train_x, xgb_train_y, xgb_test = select_feats_of_testdata(df, df_test, 'Label')
training_data_transformed.shape, test_data_transformed.shape, df_train.Label.shape
output = xgboost(training_data_transformed.todense(), df_train.Label, test_data_transformed.todense())
prediction_binary = (output > 0.5).astype(int)
prediction_binary
confusion_matrix_mod(df_test.Label, prediction_binary)
save_path = '../plots/confusion_matrix_xgboost'
multipage(''.join([save_path, '.pdf']))
plt.show()
plt.close()
Again off diagonal count is not good in case of false positives.
print(training_data_transformed[:,:-2].shape), print(test_data_transformed[:,:-2].shape)
# n=2 vectorizer
count_vectorizer_n2 = CountVectorizer(ngram_range=(2,2))
fit_matrix_n2 = count_vectorizer_n2.fit(all_data_rows)
train_data_transformed_n2 = count_vectorizer_n2.transform(training_data_rows)
train_data_transformed_n2.shape
The number of feature columns has highly increase from 8927 to 28548 which is a factor 3.
log_reg_model_n2 = LogisticRegression()
log_reg_model_n2 = log_reg_model_n2.fit(train_data_transformed_n2, df_train.Label)
feature_coefficients_df = feature_coefficients(count_vectorizer_n2, log_reg_model_n2)
feature_coefficients_df.head(10)
feature_coefficients_df.tail(10)
test_data_transformed_n2 = count_vectorizer_n2.transform(test_data[:-1])
test_data_transformed_n2.shape
predict_n2 = log_reg_model_n2.predict(test_data_transformed_n2)
Confusion matrix that explores true and false positives
confusion_matrix_mod(df_test.Label, predict_n2)
save_path = '../plots/confusion_matrix_logreg_n2'
multipage(''.join([save_path, '.pdf']))
plt.show()
plt.close()
Off diagonals are still not good with many false positives.
output = xgboost(train_data_transformed_n2.todense(), df_train.Label, test_data_transformed_n2.todense())
prediction_binary = (output > 0.5).astype(int)
confusion_matrix_mod(df_test.Label, prediction_binary)
save_path = '../plots/confusion_matrix_xgboost_n2'
multipage(''.join([save_path, '.pdf']))
plt.show()
plt.close()
prediction_binary
The models obtained are not good enough. The four models are from logistic regression and XGBoost applied to 1-gram and 2-gram prepared data. Only the logistic regr. applied to 1-gram transformed data may have a microscopic chance. Common for all the models is that they set all Label values to 1, which may result in good accuracy, but also note that number of false positive is very high, meaning that a model is oversimplified. A better model could probably be trained using data where simple non-relevant words ('er', 'en', 'jeg', 'har') have been subtracted even though they may have high occurrence in the data set. From viewing the WordCloud indeed it is observerd that there appear to be a high occurence of words that may appear in any phrase and therefore may not be particularly connected to movements of the Danske Bank stock price. A fast way would be to remove top X occuring words by assuming they are non-relevant.