Measure Correlation between Dow Jones Industrial Average and Top25 news.¶

Stock data is Dow Jones Industrial Average (DJIA) https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI

Top25 news and works as a ranking of global news. Reddit WorldNews Channel (/r/worldnews). Ranked by reddit users' votes on single date. https://www.reddit.com/r/worldnews?hl

"1" when DJIA Adj Close value rise or stays the same

"0" when DJIA Adj Close value decrease.

Use data from 2008-08-08 to 2014-12-31 as training data, for test data use from 2015-01-02 to 2016-07-01. The split is roughly a 80/20.

AUC evaluation metric.

# nlp_notebook.ipynb
#  Assumes python vers. 3.6
# __author__ = 'mizio'

import csv as csv
import numpy as np
import pandas as pd
import pylab as plt
from wordcloud import WordCloud,STOPWORDS
from datetime import date
import datetime
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sys import path
import os
from os.path import dirname, abspath
path_of_doc = dirname(dirname(abspath(os.path.split(os.getcwd())[0])))
path_to_ml_lib = path_of_doc + '/machine_learning_modules/'
if path_to_ml_lib not in path:
    path.append(path_to_ml_lib)
    
from datamunging.explore import *
from machine_learning_estimators.tensorflow_NN import *

# Load data
df = pd.read_csv('../stocknews/Combined_News_DJIA.csv', header=0)
df_train = df[df.Date < '2015-01-01']
df_test = df[df.Date > '2015-01-02']

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1989 entries, 0 to 1988
Data columns (total 27 columns):
Date     1989 non-null object
Label    1989 non-null int64
Top1     1989 non-null object
Top2     1989 non-null object
Top3     1989 non-null object
Top4     1989 non-null object
Top5     1989 non-null object
Top6     1989 non-null object
Top7     1989 non-null object
Top8     1989 non-null object
Top9     1989 non-null object
Top10    1989 non-null object
Top11    1989 non-null object
Top12    1989 non-null object
Top13    1989 non-null object
Top14    1989 non-null object
Top15    1989 non-null object
Top16    1989 non-null object
Top17    1989 non-null object
Top18    1989 non-null object
Top19    1989 non-null object
Top20    1989 non-null object
Top21    1989 non-null object
Top22    1989 non-null object
Top23    1988 non-null object
Top24    1986 non-null object
Top25    1986 non-null object
dtypes: int64(1), object(26)
memory usage: 419.6+ KB

# clean for rows with null
df = df.dropna()

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1986 entries, 0 to 1988
Data columns (total 27 columns):
Date     1986 non-null object
Label    1986 non-null int64
Top1     1986 non-null object
Top2     1986 non-null object
Top3     1986 non-null object
Top4     1986 non-null object
Top5     1986 non-null object
Top6     1986 non-null object
Top7     1986 non-null object
Top8     1986 non-null object
Top9     1986 non-null object
Top10    1986 non-null object
Top11    1986 non-null object
Top12    1986 non-null object
Top13    1986 non-null object
Top14    1986 non-null object
Top15    1986 non-null object
Top16    1986 non-null object
Top17    1986 non-null object
Top18    1986 non-null object
Top19    1986 non-null object
Top20    1986 non-null object
Top21    1986 non-null object
Top22    1986 non-null object
Top23    1986 non-null object
Top24    1986 non-null object
Top25    1986 non-null object
dtypes: int64(1), object(26)
memory usage: 434.4+ KB

df.head()

Observe that Label column with (0,1) is a two-class classification problem. News are displayed in columns Top1 up to Top25 and works as a ranking of global news. Problem is to identify which rows hold information that can be connected to a price in a stock.

# Visualize if the news stream only match bank days.
# Set start pt. to 2008-08-08 and convert all datetimes to days.

start_date = datetime.datetime.strptime("2008-08-08", "%Y-%m-%d")
chronological_days = df['Date'].apply(lambda x: (datetime.datetime.strptime(x, "%Y-%m-%d") - start_date).days)
# print(chronological_days)

plt.figure()
plt.plot(chronological_days, ".")
plt.plot(chronological_days.index)
plt.show()
plt.close()

This shows that the index is counting slower than the bank days which is to be expected. Ex. first 5 days gets enumerated 1-5, but then weekend causes the following Monday (day 8) to be enumerated with index value of 6.

df_djia = pd.read_csv('../stocknews/DJIA_table.csv', header=0)

df_djia.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1989 entries, 0 to 1988
Data columns (total 7 columns):
Date         1989 non-null object
Open         1989 non-null float64
High         1989 non-null float64
Low          1989 non-null float64
Close        1989 non-null float64
Volume       1989 non-null int64
Adj Close    1989 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 108.9+ KB

df_djia.head(2)

Prepare data such that all words in row gets into a list with single word elements

df_train.head(1)

# Print Top news headline
ranking_news = 4
news_index = 1 + ranking_news 
example_top_news = df_train.iloc[0, news_index]
print(example_top_news)
print(example_top_news[2:-1])

b'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire'
Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire

example_top_news.lower()
print(example_top_news.lower())

b'russian tanks are moving towards the capital of south ossetia, which has reportedly been completely destroyed by georgian artillery fire'

Clean phrase for abbreviations and punctuations and other non relevant parts

headline_words_as_vector = CountVectorizer().build_tokenizer()(example_top_news.lower())
print(CountVectorizer().build_tokenizer()(example_top_news.lower()))

['russian', 'tanks', 'are', 'moving', 'towards', 'the', 'capital', 'of', 'south', 'ossetia', 'which', 'has', 'reportedly', 'been', 'completely', 'destroyed', 'by', 'georgian', 'artillery', 'fire']

Build new dataframe with words and corresponding count

pd.DataFrame([[x, headline_words_as_vector.count(x)] for x in set(headline_words_as_vector)], columns=["word", "word_count"])

Instead of taking only 1 news headline append all the 25 news headlines into 1 list of words and make a count

all_headline_words_as_vector = ''
for ranking_news in range(1,26):
    news_index = 1 + ranking_news
    top_news = df_train.iloc[0, news_index]
    all_headline_words_as_vector = ' '.join([all_headline_words_as_vector,top_news[2:-1]])

print(all_headline_words_as_vector)
all_headline_words_as_vector = CountVectorizer().build_tokenizer()(all_headline_words_as_vector.lower())

 Georgia 'downs two Russian warplanes' as countries move to brink of war BREAKING: Musharraf to be impeached. Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube) Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire Afghan children raped with 'impunity,' U.N. official says - this is sick, a three year old was raped and they do nothing 150 Russian tanks have entered South Ossetia whilst Georgia shoots down two Russian jets. Breaking: Georgia invades South Ossetia, Russia warned it would intervene on SO's side The 'enemy combatent' trials are nothing but a sham: Salim Haman has been sentenced to 5 1/2 years, but will be kept longer anyway just because they feel like it. Georgian troops retreat from S. Osettain capital, presumably leaving several hundred people killed. [VIDEO] Did the U.S. Prep Georgia for War with Russia? Rice Gives Green Light for Israel to Attack Iran: Says U.S. has no veto over Israeli military ops Announcing:Class Action Lawsuit on Behalf of American Public Against the FBI So---Russia and Georgia are at war and the NYT's top story is opening ceremonies of the Olympics?  What a fucking disgrace and yet further proof of the decline of journalism. China tells Bush to stay out of other countries' affairs Did World War III start today? Georgia Invades South Ossetia - if Russia gets involved, will NATO absorb Georgia and unleash a full scale war? Al-Qaeda Faces Islamist Backlash Condoleezza Rice: "The US would not act to prevent an Israeli strike on Iran." Israeli Defense Minister Ehud Barak: "Israel is prepared for uncompromising victory in the case of military hostilities." This is a busy day:  The European Union has approved new sanctions against Iran in protest at its nuclear programme. Georgia will withdraw 1,000 soldiers from Iraq to help fight off Russian forces in Georgia's breakaway region of South Ossetia Why the Pentagon Thinks Attacking Iran is a Bad Idea - US News &amp; World Report Caucasus in crisis: Georgia invades South Ossetia Indian shoe manufactory  - And again in a series of "you do not like your work?" Visitors Suffering from Mental Illnesses Banned from Olympics No Help for Mexico's Kidnapping Surge

pd.DataFrame([[x, all_headline_words_as_vector.count(x)] for x in set(all_headline_words_as_vector)], columns=["word", "word_count"])

def prepare_data(df):
    training_data_rows = []
    for row in range(0, df.shape[0]):
        all_headline_words_as_vector = ''

        for ranking_news in range(1,26):
            news_index = 1 + ranking_news
            top_news = df.iloc[row, news_index]
            all_headline_words_as_vector = ' '.join([all_headline_words_as_vector,str(top_news)[2:-1]])
        training_data_rows.append(all_headline_words_as_vector)
    return training_data_rows

training_data_rows = prepare_data(df_train)

print(len(training_data_rows))

1611

Create a count column for each of the words appearing

count_vectorizer = CountVectorizer()
training_data_transformed = count_vectorizer.fit_transform(training_data_rows)
print(training_data_transformed.shape)

(1611, 37014)

A quick comparison of size. If each row implies 230 new words then total size would be 230*1611 ~ 350000 but we expect less since not all words will be new.

type(training_data_transformed)

scipy.sparse.csr.csr_matrix

print(training_data_transformed)
print(count_vectorizer.get_feature_names()[1000:1400])

  (0, 32206)	1
  (0, 18116)	1
  (0, 20660)	1
  (0, 3818)	1
  (0, 16129)	1
  (0, 20520)	1
  (0, 31995)	1
  (0, 35521)	1
  (0, 36406)	1
  (0, 36794)	1
  (0, 36783)	1
  (0, 29614)	1
  (0, 1787)	1
  (0, 19999)	1
  (0, 29982)	1
  (0, 16465)	1
  (0, 8228)	1
  (0, 6061)	1
  (0, 27619)	1
  (0, 2299)	1
  (0, 22252)	1
  (0, 15974)	1
  (0, 3649)	1
  (0, 3288)	1
  (0, 33105)	1
  :	:
  (1610, 23476)	1
  (1610, 3224)	2
  (1610, 22590)	2
  (1610, 19088)	2
  (1610, 13000)	3
  (1610, 18133)	2
  (1610, 19420)	1
  (1610, 23389)	3
  (1610, 17335)	1
  (1610, 14960)	1
  (1610, 2383)	1
  (1610, 35833)	1
  (1610, 36688)	1
  (1610, 17240)	1
  (1610, 29069)	2
  (1610, 36303)	1
  (1610, 6503)	1
  (1610, 5571)	3
  (1610, 14911)	1
  (1610, 33021)	16
  (1610, 2858)	1
  (1610, 17023)	1
  (1610, 23210)	11
  (1610, 33381)	10
  (1610, 3030)	6
['763', '767th', '76m', '77', '770', '772', '777', '77th', '78', '780', '785', '787', '789', '79', '793', '796', '799', '7b', '7bn', '7m', '7million', '7p', '7pm', '7th', '7x', '7yrs', '80', '800', '8000', '800km', '800m', '800s', '800th', '802', '806', '80beats', '80bn', '80m', '80s', '80th', '81', '813', '82', '825', '827', '829', '82k', '82m', '83', '831', '832', '835', '837', '838', '83t', '84', '840m', '85', '850', '850m', '855', '857', '85mill', '85th', '85tn', '86', '862', '87', '870', '871', '872', '88', '882m', '885', '886k', '887', '888', '88bn', '88t', '88th', '89', '890', '8900', '893', '894', '895', '89m', '8bn', '8f', '8gb', '8k', '8kg', '8m', '8th', '8yo', '90', '900', '900k', '900m', '902', '908', '90k', '90kg', '90m', '90mph', '90s', '90th', '90x', '91', '910', '911', '91m', '92', '920', '922', '93', '930', '93k', '94', '940', '94th', '95', '950', '9500', '95th', '96', '960', '97', '979', '98', '980', '99', '999', '99x', '9b', '9bn', '9f', '9m', '9th', '__', 'a30', 'a320', 'a330', 'a340s', 'a350', 'a380', 'a380s', 'a447', 'aa', 'aaa', 'aaaw', 'aabo', 'aadmi', 'aafia', 'aali', 'aamer', 'aamir', 'aan', 'aap', 'aaron', 'ab', 'ababa', 'aback', 'abada', 'aban', 'abandon', 'abandoned', 'abandoning', 'abandonment', 'abandons', 'abattoir', 'abay', 'abaya', 'abba', 'abbas', 'abbasi', 'abbe', 'abbey', 'abbot', 'abbott', 'abbottaba', 'abbottabad', 'abbotts', 'abbreviation', 'abby', 'abc', 'abdel', 'abdelbaset', 'abdicate', 'abdicated', 'abdicates', 'abdicating', 'abdicatio', 'abdishakur', 'abdolfattah', 'abdolmalek', 'abdomen', 'abdomens', 'abdominal', 'abduct', 'abducted', 'abducting', 'abductio', 'abduction', 'abductions', 'abducto', 'abductors', 'abdul', 'abdulateef', 'abdulaziz', 'abdullah', 'abdullahi', 'abe', 'abed', 'abel', 'abercrombie', 'aberdee', 'aberdeen', 'aberdeenshire', 'abergil', 'aberrant', 'aberration', 'abetic', 'abhinav', 'abhor', 'abhorrent', 'abia', 'abian', 'abic', 'abide', 'abidine', 'abiding', 'abilit', 'ability', 'abim', 'abir', 'abirds', 'abitable', 'abject', 'abkhazi', 'abkhazia', 'abkhazian', 'ablaze', 'able', 'abnormal', 'abnormalities', 'aboar', 'aboard', 'abolish', 'abolished', 'abolishes', 'abolishing', 'abolition', 'aboriginal', 'aboriginals', 'aborigine', 'aborigines', 'abort', 'aborted', 'aborting', 'abortio', 'abortion', 'abortionists', 'abortions', 'abosolve', 'aboul', 'abound', 'abounds', 'about', 'above', 'abraha', 'abraham', 'abrahamian', 'abramoff', 'abramovay', 'abrams', 'abroa', 'abroad', 'abrogating', 'abrupt', 'abruptly', 'abscond', 'absence', 'absent', 'absenteeism', 'absentees', 'absentia', 'absolute', 'absolutely', 'absorb', 'absorbed', 'absorbing', 'abstain', 'abstaine', 'abstained', 'abstainin', 'abstains', 'abstentions', 'abstinence', 'absurd', 'absurdit', 'abu', 'abubakar', 'abuja', 'abundance', 'abundant', 'abus', 'abuse', 'abused', 'abuser', 'abusers', 'abuses', 'abusing', 'abusive', 'aby', 'abydos', 'abys', 'abysmal', 'abyss', 'ac', 'academ', 'academic', 'academically', 'academics', 'academie', 'academies', 'academy', 'acadian', 'acapulc', 'acapulco', 'accelerate', 'accelerated', 'accelerates', 'accelerating', 'accelerator', 'accent', 'accept', 'acceptabl', 'acceptable', 'acceptably', 'acceptance', 'accepted', 'accepting', 'accepts', 'acces', 'access', 'accessed', 'accesses', 'accessible', 'accessing', 'accession', 'accessories', 'accessory', 'acciden', 'accident', 'accidental', 'accidentally', 'accidently', 'accidents', 'acclaim', 'acclaimed', 'acclamation', 'accolade', 'accommodating', 'accommodation', 'accommodations', 'accompanied', 'accompanying', 'accomplice', 'accomplices', 'accomplish', 'accomplishe', 'accomplished', 'accomplishments', 'accor', 'accord', 'accordance', 'according', 'accordingly', 'accords', 'accosted', 'accoun', 'account', 'accountability', 'accountable', 'accountant', 'accounted', 'accounting', 'accounts', 'accreditation', 'accredited', 'accumulated', 'accumulating', 'accumulation', 'accuracy', 'accurate', 'accurately', 'accusation', 'accusations', 'accuse']

Put the words into a word cloud and show most occuring words by size

text = " ".join(all_heads for all_heads in training_data_rows)
print(len(text))

4346827

wordcloud = WordCloud(stopwords=STOPWORDS,
                      background_color='white',
                      width=2500,
                      max_words=100,
                      height=2000
                     ).generate(text)
plt.figure(1,figsize=(13, 13))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
plt.close()

n-gram model with n=1 since all words from every headline is treated equally and without any context of it's neighboring words.¶

Use a simple Logistic Regression model for the two-class classification problem

log_reg_model = LogisticRegression()
log_reg_model = log_reg_model.fit(training_data_transformed, df_train.Label)

Prepare the test data. One minor concern is with new words appearing in the test data but not in training data. How will the trained model account for new words? It will not account for any new words instead it will count words corresponding to the feature columns of the training data.

test_data = prepare_data(df_test)

Number of feature columns is the same for both training and test data shown below for the shape of transformed test data. Here only a transform is applied to test data and not a fit.

test_data_transformed = count_vectorizer.transform(test_data)
print(test_data_transformed.shape)
print(test_data_transformed)

(377, 37014)
  (0, 168)	1
  (0, 191)	1
  (0, 373)	1
  (0, 383)	1
  (0, 389)	1
  (0, 402)	1
  (0, 470)	1
  (0, 487)	1
  (0, 772)	1
  (0, 792)	1
  (0, 908)	1
  (0, 967)	1
  (0, 1212)	1
  (0, 1382)	1
  (0, 1481)	1
  (0, 1487)	1
  (0, 1496)	1
  (0, 1533)	1
  (0, 1773)	1
  (0, 1805)	1
  (0, 1912)	1
  (0, 1924)	1
  (0, 1960)	1
  (0, 2100)	1
  (0, 2137)	1
  :	:
  (376, 35401)	1
  (376, 35469)	1
  (376, 35485)	1
  (376, 35684)	1
  (376, 35757)	1
  (376, 35759)	1
  (376, 35825)	1
  (376, 35910)	1
  (376, 35960)	1
  (376, 35977)	2
  (376, 36086)	1
  (376, 36133)	1
  (376, 36156)	1
  (376, 36207)	1
  (376, 36245)	1
  (376, 36261)	1
  (376, 36303)	4
  (376, 36364)	1
  (376, 36371)	2
  (376, 36412)	1
  (376, 36426)	2
  (376, 36460)	1
  (376, 36688)	3
  (376, 36693)	1
  (376, 36783)	1

predict = log_reg_model.predict(test_data_transformed)

Confusion matrix that explores true and false positives

confusion_matrix_mod(df_test.Label, predict)
save_path = '../plots/confusion_matrix_logreg'
multipage(''.join([save_path, '.pdf']))
plt.show()
plt.close()

Confusion Matrix
 [[ 56 130]
 [101  90]]

Notice that the rate to hit a true positive is 'True Pos'/'Num Pos' = 0.47 which is a pretty poor result worse than flipping a coin. Ideally the confusion matrix should not have a majority of false negatives and false positives as is the case in above result with high values in off diagonal fields.

Check that top 10 coefficients used in the model match some of the words from word cloud

def feature_coefficients(count_vectorizer, log_reg_model):
    feature_names = count_vectorizer.get_feature_names()
    model_coefficients = log_reg_model.coef_.tolist()[0]
    coefficients_in_dataframe = pd.DataFrame({'feature': feature_names, 'Coefficient': model_coefficients})
    coefficients_in_dataframe = coefficients_in_dataframe.sort_values(['Coefficient', 'feature'], ascending=[0, 1])
    return coefficients_in_dataframe

coefficients_in_dataframe = feature_coefficients(count_vectorizer, log_reg_model)
coefficients_in_dataframe.head(10)

coefficients_in_dataframe.tail(10)

# prepare data for xgboost
print(type(training_data_transformed))
print(training_data_transformed.shape)
print(training_data_transformed)
print(test_data_transformed.shape)
#xgb_train_x, xgb_train_y, xgb_test = select_feats_of_testdata(df, df_test, 'Label')

<class 'scipy.sparse.csr.csr_matrix'>
(1611, 37014)
  (0, 1)	1
  (0, 169)	1
  (0, 1289)	1
  (0, 1474)	1
  (0, 1479)	1
  (0, 1708)	1
  (0, 1738)	1
  (0, 1787)	1
  (0, 1789)	2
  (0, 1995)	1
  (0, 2257)	1
  (0, 2299)	1
  (0, 2333)	1
  (0, 2383)	6
  (0, 2499)	1
  (0, 2632)	1
  (0, 2768)	1
  (0, 2858)	3
  (0, 3012)	1
  (0, 3030)	1
  (0, 3224)	2
  (0, 3283)	1
  (0, 3288)	1
  (0, 3630)	1
  (0, 3649)	1
  :	:
  (1610, 33657)	1
  (1610, 34127)	1
  (1610, 34329)	1
  (1610, 34349)	2
  (1610, 34874)	1
  (1610, 34996)	2
  (1610, 35011)	1
  (1610, 35053)	1
  (1610, 35257)	1
  (1610, 35833)	1
  (1610, 35902)	1
  (1610, 35936)	1
  (1610, 35957)	1
  (1610, 35958)	1
  (1610, 36029)	2
  (1610, 36033)	1
  (1610, 36160)	1
  (1610, 36245)	1
  (1610, 36303)	1
  (1610, 36338)	1
  (1610, 36425)	1
  (1610, 36688)	1
  (1610, 36708)	1
  (1610, 36772)	1
  (1610, 36988)	1
(377, 37014)

output = xgboost(training_data_transformed[:,:-2], df_train.Label, test_data_transformed[:,:-2])

[0]	train-auc:0.589665+0.0155091	test-auc:0.506561+0.0134287
[10]	train-auc:0.803336+0.0154418	test-auc:0.506833+0.0400626
[20]	train-auc:0.861032+0.0185895	test-auc:0.507453+0.026457
[30]	train-auc:0.898932+0.0119324	test-auc:0.498739+0.021967
[40]	train-auc:0.918226+0.0138356	test-auc:0.499164+0.0210628
[50]	train-auc:0.935097+0.0125667	test-auc:0.499171+0.0216895
[60]	train-auc:0.945407+0.0114606	test-auc:0.495736+0.0225118
[70]	train-auc:0.954159+0.0109332	test-auc:0.495605+0.0157706
[80]	train-auc:0.960006+0.0108295	test-auc:0.498233+0.0217966
[90]	train-auc:0.963939+0.0098956	test-auc:0.496752+0.0262607
[100]	train-auc:0.969625+0.00784811	test-auc:0.49491+0.0249992
[110]	train-auc:0.973394+0.00676391	test-auc:0.491687+0.024949
[120]	train-auc:0.976771+0.00610072	test-auc:0.496023+0.0297286
[130]	train-auc:0.979259+0.00597841	test-auc:0.496365+0.0287816
[140]	train-auc:0.981549+0.00577538	test-auc:0.495076+0.0215728
[150]	train-auc:0.98358+0.00523358	test-auc:0.495096+0.0215246
[160]	train-auc:0.9851+0.00456025	test-auc:0.498839+0.019565
[170]	train-auc:0.98672+0.00344387	test-auc:0.496095+0.020938
[180]	train-auc:0.987894+0.00363869	test-auc:0.497461+0.0222345
[190]	train-auc:0.988759+0.00333247	test-auc:0.499138+0.0217647
[200]	train-auc:0.989516+0.00293948	test-auc:0.498668+0.0245854
[210]	train-auc:0.990391+0.0022488	test-auc:0.496315+0.0228158
[220]	train-auc:0.991283+0.00219955	test-auc:0.496924+0.0202054
[230]	train-auc:0.99181+0.00219085	test-auc:0.496669+0.0202395
[240]	train-auc:0.992571+0.00222704	test-auc:0.49738+0.0187175
[250]	train-auc:0.993319+0.00207986	test-auc:0.496134+0.0167822
Ensemble-CV: 0.5266716+0.04012367064763641

prediction_binary = (output > 0.5).astype(int)
confusion_matrix_mod(df_test.Label, prediction_binary)
save_path = '../plots/confusion_matrix_xgboost'
multipage(''.join([save_path, '.pdf']))
plt.show()
plt.close()

Confusion Matrix
 [[ 48 138]
 [ 47 144]]

print(training_data_transformed[:,:-2].shape), print(test_data_transformed[:,:-2].shape)

(1611, 37012)
(377, 37012)

(None, None)

n-gram model with n=2 which corresponds to a models where words account for their neighboring words. This model holds all two-word combinations.¶

# n=2 vectorizer
count_vectorizer_n2 = CountVectorizer(ngram_range=(2,2))
train_data_transformed_n2 = count_vectorizer_n2.fit_transform(training_data_rows)
train_data_transformed_n2.shape

(1611, 377601)

The number of feature columns has highly increase from 37014 to 377601 which is a factor 10.

log_reg_model_n2 = LogisticRegression()
log_reg_model_n2 = log_reg_model_n2.fit(train_data_transformed_n2, df_train.Label)

feature_coefficients_df = feature_coefficients(count_vectorizer_n2, log_reg_model_n2)
feature_coefficients_df.head(10)

feature_coefficients_df.tail(10)

test_data_transformed_n2 = count_vectorizer_n2.transform(test_data)

predict_n2 = log_reg_model_n2.predict(test_data_transformed_n2)

Confusion matrix that explores true and false positives

confusion_matrix_mod(df_test.Label, predict_n2)
save_path = '../plots/confusion_matrix_logreg_n2'
multipage(''.join([save_path, '.pdf']))
plt.show()
plt.close()

Confusion Matrix
 [[ 65 121]
 [ 49 142]]

	Date	Label	Top1	Top2	Top3	Top4	Top5	Top6	Top7	Top8	...	Top16	Top17	Top18	Top19	Top20	Top21	Top22	Top23	Top24	Top25
0	2008-08-08	0	b"Georgia 'downs two Russian warplanes' as cou...	b'BREAKING: Musharraf to be impeached.'	b'Russia Today: Columns of troops roll into So...	b'Russian tanks are moving towards the capital...	b"Afghan children raped with 'impunity,' U.N. ...	b'150 Russian tanks have entered South Ossetia...	b"Breaking: Georgia invades South Ossetia, Rus...	b"The 'enemy combatent' trials are nothing but...	...	b'Georgia Invades South Ossetia - if Russia ge...	b'Al-Qaeda Faces Islamist Backlash'	b'Condoleezza Rice: "The US would not act to p...	b'This is a busy day: The European Union has ...	b"Georgia will withdraw 1,000 soldiers from Ir...	b'Why the Pentagon Thinks Attacking Iran is a ...	b'Caucasus in crisis: Georgia invades South Os...	b'Indian shoe manufactory - And again in a se...	b'Visitors Suffering from Mental Illnesses Ban...	b"No Help for Mexico's Kidnapping Surge"
1	2008-08-11	1	b'Why wont America and Nato help us? If they w...	b'Bush puts foot down on Georgian conflict'	b"Jewish Georgian minister: Thanks to Israeli ...	b'Georgian army flees in disarray as Russians ...	b"Olympic opening ceremony fireworks 'faked'"	b'What were the Mossad with fraudulent New Zea...	b'Russia angered by Israeli military sale to G...	b'An American citizen living in S.Ossetia blam...	...	b'Israel and the US behind the Georgian aggres...	b'"Do not believe TV, neither Russian nor Geor...	b'Riots are still going on in Montreal (Canada...	b'China to overtake US as largest manufacturer'	b'War in South Ossetia [PICS]'	b'Israeli Physicians Group Condemns State Tort...	b' Russia has just beaten the United States ov...	b'Perhaps the question about the Georgia - R...	b'Russia is so much better at war'	b"So this is what it's come to: trading sex fo...
2	2008-08-12	0	b'Remember that adorable 9-year-old who sang a...	b"Russia 'ends Georgia operation'"	b'"If we had no sexual harassment we would hav...	b"Al-Qa'eda is losing support in Iraq because ...	b'Ceasefire in Georgia: Putin Outmaneuvers the...	b'Why Microsoft and Intel tried to kill the XO...	b'Stratfor: The Russo-Georgian War and the Bal...	b"I'm Trying to Get a Sense of This Whole Geor...	...	b'U.S. troops still in Georgia (did you know t...	b'Why Russias response to Georgia was right'	b'Gorbachev accuses U.S. of making a "serious ...	b'Russia, Georgia, and NATO: Cold War Two'	b'Remember that adorable 62-year-old who led y...	b'War in Georgia: The Israeli connection'	b'All signs point to the US encouraging Georgi...	b'Christopher King argues that the US and NATO...	b'America: The New Mexico?'	b"BBC NEWS \| Asia-Pacific \| Extinction 'by man...
3	2008-08-13	0	b' U.S. refuses Israel weapons to attack Iran:...	b"When the president ordered to attack Tskhinv...	b' Israel clears troops who killed Reuters cam...	b'Britain\'s policy of being tough on drugs is...	b'Body of 14 year old found in trunk; Latest (...	b'China has moved 10 million quake survivors...	b"Bush announces Operation Get All Up In Russi...	b'Russian forces sink Georgian ships '	...	b'Elephants extinct by 2020?'	b'US humanitarian missions soon in Georgia - i...	b"Georgia's DDOS came from US sources"	b'Russian convoy heads into Georgia, violating...	b'Israeli defence minister: US against strike ...	b'Gorbachev: We Had No Choice'	b'Witness: Russian forces head towards Tbilisi...	b' Quarter of Russians blame U.S. for conflict...	b'Georgian president says US military will ta...	b'2006: Nobel laureate Aleksander Solzhenitsyn...
4	2008-08-14	1	b'All the experts admit that we should legalis...	b'War in South Osetia - 89 pictures made by a ...	b'Swedish wrestler Ara Abrahamian throws away ...	b'Russia exaggerated the death toll in South O...	b'Missile That Killed 9 Inside Pakistan May Ha...	b"Rushdie Condemns Random House's Refusal to P...	b'Poland and US agree to missle defense deal. ...	b'Will the Russians conquer Tblisi? Bet on it,...	...	b'Bank analyst forecast Georgian crisis 2 days...	b"Georgia confict could set back Russia's US r...	b'War in the Caucasus is as much the product o...	b'"Non-media" photos of South Ossetia/Georgia ...	b'Georgian TV reporter shot by Russian sniper ...	b'Saudi Arabia: Mother moves to block child ma...	b'Taliban wages war on humanitarian aid workers'	b'Russia: World "can forget about" Georgia\'s...	b'Darfur rebels accuse Sudan of mounting major...	b'Philippines : Peace Advocate say Muslims nee...

	Date	Open	High	Low	Close	Volume	Adj Close
0	2016-07-01	17924.240234	18002.380859	17916.910156	17949.369141	82160000	17949.369141
1	2016-06-30	17712.759766	17930.609375	17711.800781	17929.990234	133030000	17929.990234

	Coefficient	feature
11140	0.541139	ench
18139	0.458096	kills
10838	0.437502	egypt
29490	0.430494	self
18316	0.418497	korea
23353	0.414318	olympic
10074	0.401514	doctors
19409	0.387896	london
34157	0.386206	tv
5744	0.381123	canadian

	Coefficient	feature
9543	-0.396815	did
30702	-0.413128	society
31006	-0.415445	speech
26954	-0.433668	real
4130	-0.446348	begin
19514	-0.448687	low
8026	-0.478654	country
29686	-0.502724	sex
28929	-0.505898	sanctions
28678	-0.553007	run

	Coefficient	feature
279501	0.296023	right to
25329	0.279101	and other
294026	0.267622	set to
160539	0.238123	in egypt
325868	0.229538	the first
126697	0.223554	forced to
160144	0.220719	in china
367268	0.215048	will be
128113	0.211792	found in
332357	0.209793	this is

	word	word_count
0	tanks	1
1	russian	1
2	which	1
3	destroyed	1
4	ossetia	1
5	has	1
6	moving	1
7	capital	1
8	south	1
9	georgian	1
10	towards	1
11	the	1
12	reportedly	1
13	been	1
14	completely	1
15	are	1
16	of	1
17	by	1
18	artillery	1
19	fire	1

	word	word_count
0	region	1
1	hundred	1
2	crisis	1
3	raped	2
4	work	1
5	fire	1
6	strike	1
7	over	1
8	what	1
9	idea	1
10	nuclear	1
11	invades	3
12	start	1
13	trials	1
14	its	1
15	against	2
16	two	2
17	towards	1
18	the	11
19	as	1
20	from	5
21	osettain	1
22	was	1
23	iran	4
24	story	1
25	why	1
26	mental	1
27	youtube	1
28	have	1
29	for	4
...	...	...
201	protest	1
202	out	1
203	al	1
204	european	1
205	official	1
206	further	1
207	longer	1
208	green	1
209	impunity	1
210	completely	1
211	opening	1
212	other	1
213	sick	1
214	american	1
215	killed	1
216	old	1
217	yet	1
218	union	1
219	year	1
220	israel	2
221	your	1
222	troops	2
223	nato	1
224	military	2
225	top	1
226	veto	1
227	feel	1
228	manufactory	1
229	downs	1
230	roll	1

	Coefficient	feature
336985	-0.202350	to help
120940	-0.203686	fire on
225019	-0.204450	nuclear weapons
157449	-0.208993	if he
49949	-0.212255	bin laden
350876	-0.216789	up in
337273	-0.220129	to kill
32617	-0.236822	around the
331101	-0.239817	there is
325144	-0.339305	the country

Strucai's Data Exploration: NLP with Dow Jones Industrial Average (DJIA)

Measure Correlation between Dow Jones Industrial Average and Top25 news.¶

n-gram model with n=1 since all words from every headline is treated equally and without any context of it's neighboring words.¶

n-gram model with n=2 which corresponds to a models where words account for their neighboring words. This model holds all two-word combinations.¶