Home APIs DataExploration DataMining Showcase Challenges Contact
NaturalLanguageProcessing PortoSeguro TwoSigma

Strucai's Data Exploration: NLP with Dow Jones Industrial Average (DJIA)

What is data about?

nlp_notebook

Measure Correlation between Dow Jones Industrial Average and Top25 news.

Stock data is Dow Jones Industrial Average (DJIA) https://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI

Top25 news and works as a ranking of global news. Reddit WorldNews Channel (/r/worldnews). Ranked by reddit users' votes on single date. https://www.reddit.com/r/worldnews?hl

"1" when DJIA Adj Close value rise or stays the same

"0" when DJIA Adj Close value decrease.

Use data from 2008-08-08 to 2014-12-31 as training data, for test data use from 2015-01-02 to 2016-07-01. The split is roughly a 80/20.

AUC evaluation metric.

In [1]:
# nlp_notebook.ipynb
#  Assumes python vers. 3.6
# __author__ = 'mizio'

import csv as csv
import numpy as np
import pandas as pd
import pylab as plt
from wordcloud import WordCloud,STOPWORDS
from datetime import date
import datetime
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sys import path
import os
from os.path import dirname, abspath
path_of_doc = dirname(dirname(abspath(os.path.split(os.getcwd())[0])))
path_to_ml_lib = path_of_doc + '/machine_learning_modules/'
if path_to_ml_lib not in path:
    path.append(path_to_ml_lib)
    
from datamunging.explore import *
from machine_learning_estimators.tensorflow_NN import *
In [2]:
# Load data
df = pd.read_csv('../stocknews/Combined_News_DJIA.csv', header=0)
df_train = df[df.Date < '2015-01-01']
df_test = df[df.Date > '2015-01-02']
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1989 entries, 0 to 1988
Data columns (total 27 columns):
Date     1989 non-null object
Label    1989 non-null int64
Top1     1989 non-null object
Top2     1989 non-null object
Top3     1989 non-null object
Top4     1989 non-null object
Top5     1989 non-null object
Top6     1989 non-null object
Top7     1989 non-null object
Top8     1989 non-null object
Top9     1989 non-null object
Top10    1989 non-null object
Top11    1989 non-null object
Top12    1989 non-null object
Top13    1989 non-null object
Top14    1989 non-null object
Top15    1989 non-null object
Top16    1989 non-null object
Top17    1989 non-null object
Top18    1989 non-null object
Top19    1989 non-null object
Top20    1989 non-null object
Top21    1989 non-null object
Top22    1989 non-null object
Top23    1988 non-null object
Top24    1986 non-null object
Top25    1986 non-null object
dtypes: int64(1), object(26)
memory usage: 419.6+ KB
In [4]:
# clean for rows with null
df = df.dropna()
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1986 entries, 0 to 1988
Data columns (total 27 columns):
Date     1986 non-null object
Label    1986 non-null int64
Top1     1986 non-null object
Top2     1986 non-null object
Top3     1986 non-null object
Top4     1986 non-null object
Top5     1986 non-null object
Top6     1986 non-null object
Top7     1986 non-null object
Top8     1986 non-null object
Top9     1986 non-null object
Top10    1986 non-null object
Top11    1986 non-null object
Top12    1986 non-null object
Top13    1986 non-null object
Top14    1986 non-null object
Top15    1986 non-null object
Top16    1986 non-null object
Top17    1986 non-null object
Top18    1986 non-null object
Top19    1986 non-null object
Top20    1986 non-null object
Top21    1986 non-null object
Top22    1986 non-null object
Top23    1986 non-null object
Top24    1986 non-null object
Top25    1986 non-null object
dtypes: int64(1), object(26)
memory usage: 434.4+ KB
In [6]:
df.head()
Out[6]:
Date Label Top1 Top2 Top3 Top4 Top5 Top6 Top7 Top8 ... Top16 Top17 Top18 Top19 Top20 Top21 Top22 Top23 Top24 Top25
0 2008-08-08 0 b"Georgia 'downs two Russian warplanes' as cou... b'BREAKING: Musharraf to be impeached.' b'Russia Today: Columns of troops roll into So... b'Russian tanks are moving towards the capital... b"Afghan children raped with 'impunity,' U.N. ... b'150 Russian tanks have entered South Ossetia... b"Breaking: Georgia invades South Ossetia, Rus... b"The 'enemy combatent' trials are nothing but... ... b'Georgia Invades South Ossetia - if Russia ge... b'Al-Qaeda Faces Islamist Backlash' b'Condoleezza Rice: "The US would not act to p... b'This is a busy day: The European Union has ... b"Georgia will withdraw 1,000 soldiers from Ir... b'Why the Pentagon Thinks Attacking Iran is a ... b'Caucasus in crisis: Georgia invades South Os... b'Indian shoe manufactory - And again in a se... b'Visitors Suffering from Mental Illnesses Ban... b"No Help for Mexico's Kidnapping Surge"
1 2008-08-11 1 b'Why wont America and Nato help us? If they w... b'Bush puts foot down on Georgian conflict' b"Jewish Georgian minister: Thanks to Israeli ... b'Georgian army flees in disarray as Russians ... b"Olympic opening ceremony fireworks 'faked'" b'What were the Mossad with fraudulent New Zea... b'Russia angered by Israeli military sale to G... b'An American citizen living in S.Ossetia blam... ... b'Israel and the US behind the Georgian aggres... b'"Do not believe TV, neither Russian nor Geor... b'Riots are still going on in Montreal (Canada... b'China to overtake US as largest manufacturer' b'War in South Ossetia [PICS]' b'Israeli Physicians Group Condemns State Tort... b' Russia has just beaten the United States ov... b'Perhaps *the* question about the Georgia - R... b'Russia is so much better at war' b"So this is what it's come to: trading sex fo...
2 2008-08-12 0 b'Remember that adorable 9-year-old who sang a... b"Russia 'ends Georgia operation'" b'"If we had no sexual harassment we would hav... b"Al-Qa'eda is losing support in Iraq because ... b'Ceasefire in Georgia: Putin Outmaneuvers the... b'Why Microsoft and Intel tried to kill the XO... b'Stratfor: The Russo-Georgian War and the Bal... b"I'm Trying to Get a Sense of This Whole Geor... ... b'U.S. troops still in Georgia (did you know t... b'Why Russias response to Georgia was right' b'Gorbachev accuses U.S. of making a "serious ... b'Russia, Georgia, and NATO: Cold War Two' b'Remember that adorable 62-year-old who led y... b'War in Georgia: The Israeli connection' b'All signs point to the US encouraging Georgi... b'Christopher King argues that the US and NATO... b'America: The New Mexico?' b"BBC NEWS | Asia-Pacific | Extinction 'by man...
3 2008-08-13 0 b' U.S. refuses Israel weapons to attack Iran:... b"When the president ordered to attack Tskhinv... b' Israel clears troops who killed Reuters cam... b'Britain\'s policy of being tough on drugs is... b'Body of 14 year old found in trunk; Latest (... b'China has moved 10 *million* quake survivors... b"Bush announces Operation Get All Up In Russi... b'Russian forces sink Georgian ships ' ... b'Elephants extinct by 2020?' b'US humanitarian missions soon in Georgia - i... b"Georgia's DDOS came from US sources" b'Russian convoy heads into Georgia, violating... b'Israeli defence minister: US against strike ... b'Gorbachev: We Had No Choice' b'Witness: Russian forces head towards Tbilisi... b' Quarter of Russians blame U.S. for conflict... b'Georgian president says US military will ta... b'2006: Nobel laureate Aleksander Solzhenitsyn...
4 2008-08-14 1 b'All the experts admit that we should legalis... b'War in South Osetia - 89 pictures made by a ... b'Swedish wrestler Ara Abrahamian throws away ... b'Russia exaggerated the death toll in South O... b'Missile That Killed 9 Inside Pakistan May Ha... b"Rushdie Condemns Random House's Refusal to P... b'Poland and US agree to missle defense deal. ... b'Will the Russians conquer Tblisi? Bet on it,... ... b'Bank analyst forecast Georgian crisis 2 days... b"Georgia confict could set back Russia's US r... b'War in the Caucasus is as much the product o... b'"Non-media" photos of South Ossetia/Georgia ... b'Georgian TV reporter shot by Russian sniper ... b'Saudi Arabia: Mother moves to block child ma... b'Taliban wages war on humanitarian aid workers' b'Russia: World "can forget about" Georgia\'s... b'Darfur rebels accuse Sudan of mounting major... b'Philippines : Peace Advocate say Muslims nee...

5 rows × 27 columns

Observe that Label column with (0,1) is a two-class classification problem. News are displayed in columns Top1 up to Top25 and works as a ranking of global news. Problem is to identify which rows hold information that can be connected to a price in a stock.

In [7]:
# Visualize if the news stream only match bank days.
# Set start pt. to 2008-08-08 and convert all datetimes to days.

start_date = datetime.datetime.strptime("2008-08-08", "%Y-%m-%d")
chronological_days = df['Date'].apply(lambda x: (datetime.datetime.strptime(x, "%Y-%m-%d") - start_date).days)
# print(chronological_days)

plt.figure()
plt.plot(chronological_days, ".")
plt.plot(chronological_days.index)
plt.show()
plt.close()

This shows that the index is counting slower than the bank days which is to be expected. Ex. first 5 days gets enumerated 1-5, but then weekend causes the following Monday (day 8) to be enumerated with index value of 6.

In [8]:
df_djia = pd.read_csv('../stocknews/DJIA_table.csv', header=0)
In [9]:
df_djia.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1989 entries, 0 to 1988
Data columns (total 7 columns):
Date         1989 non-null object
Open         1989 non-null float64
High         1989 non-null float64
Low          1989 non-null float64
Close        1989 non-null float64
Volume       1989 non-null int64
Adj Close    1989 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 108.9+ KB
In [10]:
df_djia.head(2)
Out[10]:
Date Open High Low Close Volume Adj Close
0 2016-07-01 17924.240234 18002.380859 17916.910156 17949.369141 82160000 17949.369141
1 2016-06-30 17712.759766 17930.609375 17711.800781 17929.990234 133030000 17929.990234

Prepare data such that all words in row gets into a list with single word elements

In [11]:
df_train.head(1)
Out[11]:
Date Label Top1 Top2 Top3 Top4 Top5 Top6 Top7 Top8 ... Top16 Top17 Top18 Top19 Top20 Top21 Top22 Top23 Top24 Top25
0 2008-08-08 0 b"Georgia 'downs two Russian warplanes' as cou... b'BREAKING: Musharraf to be impeached.' b'Russia Today: Columns of troops roll into So... b'Russian tanks are moving towards the capital... b"Afghan children raped with 'impunity,' U.N. ... b'150 Russian tanks have entered South Ossetia... b"Breaking: Georgia invades South Ossetia, Rus... b"The 'enemy combatent' trials are nothing but... ... b'Georgia Invades South Ossetia - if Russia ge... b'Al-Qaeda Faces Islamist Backlash' b'Condoleezza Rice: "The US would not act to p... b'This is a busy day: The European Union has ... b"Georgia will withdraw 1,000 soldiers from Ir... b'Why the Pentagon Thinks Attacking Iran is a ... b'Caucasus in crisis: Georgia invades South Os... b'Indian shoe manufactory - And again in a se... b'Visitors Suffering from Mental Illnesses Ban... b"No Help for Mexico's Kidnapping Surge"

1 rows × 27 columns

In [12]:
# Print Top news headline
ranking_news = 4
news_index = 1 + ranking_news 
example_top_news = df_train.iloc[0, news_index]
print(example_top_news)
print(example_top_news[2:-1])
b'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire'
Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire
In [13]:
example_top_news.lower()
print(example_top_news.lower())
b'russian tanks are moving towards the capital of south ossetia, which has reportedly been completely destroyed by georgian artillery fire'

Clean phrase for abbreviations and punctuations and other non relevant parts

In [14]:
headline_words_as_vector = CountVectorizer().build_tokenizer()(example_top_news.lower())
print(CountVectorizer().build_tokenizer()(example_top_news.lower()))
['russian', 'tanks', 'are', 'moving', 'towards', 'the', 'capital', 'of', 'south', 'ossetia', 'which', 'has', 'reportedly', 'been', 'completely', 'destroyed', 'by', 'georgian', 'artillery', 'fire']

Build new dataframe with words and corresponding count

In [15]:
pd.DataFrame([[x, headline_words_as_vector.count(x)] for x in set(headline_words_as_vector)], columns=["word", "word_count"])
Out[15]:
word word_count
0 tanks 1
1 russian 1
2 which 1
3 destroyed 1
4 ossetia 1
5 has 1
6 moving 1
7 capital 1
8 south 1
9 georgian 1
10 towards 1
11 the 1
12 reportedly 1
13 been 1
14 completely 1
15 are 1
16 of 1
17 by 1
18 artillery 1
19 fire 1

Instead of taking only 1 news headline append all the 25 news headlines into 1 list of words and make a count

In [16]:
all_headline_words_as_vector = ''
for ranking_news in range(1,26):
    news_index = 1 + ranking_news
    top_news = df_train.iloc[0, news_index]
    all_headline_words_as_vector = ' '.join([all_headline_words_as_vector,top_news[2:-1]])

print(all_headline_words_as_vector)
all_headline_words_as_vector = CountVectorizer().build_tokenizer()(all_headline_words_as_vector.lower())
 Georgia 'downs two Russian warplanes' as countries move to brink of war BREAKING: Musharraf to be impeached. Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube) Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire Afghan children raped with 'impunity,' U.N. official says - this is sick, a three year old was raped and they do nothing 150 Russian tanks have entered South Ossetia whilst Georgia shoots down two Russian jets. Breaking: Georgia invades South Ossetia, Russia warned it would intervene on SO's side The 'enemy combatent' trials are nothing but a sham: Salim Haman has been sentenced to 5 1/2 years, but will be kept longer anyway just because they feel like it. Georgian troops retreat from S. Osettain capital, presumably leaving several hundred people killed. [VIDEO] Did the U.S. Prep Georgia for War with Russia? Rice Gives Green Light for Israel to Attack Iran: Says U.S. has no veto over Israeli military ops Announcing:Class Action Lawsuit on Behalf of American Public Against the FBI So---Russia and Georgia are at war and the NYT's top story is opening ceremonies of the Olympics?  What a fucking disgrace and yet further proof of the decline of journalism. China tells Bush to stay out of other countries' affairs Did World War III start today? Georgia Invades South Ossetia - if Russia gets involved, will NATO absorb Georgia and unleash a full scale war? Al-Qaeda Faces Islamist Backlash Condoleezza Rice: "The US would not act to prevent an Israeli strike on Iran." Israeli Defense Minister Ehud Barak: "Israel is prepared for uncompromising victory in the case of military hostilities." This is a busy day:  The European Union has approved new sanctions against Iran in protest at its nuclear programme. Georgia will withdraw 1,000 soldiers from Iraq to help fight off Russian forces in Georgia's breakaway region of South Ossetia Why the Pentagon Thinks Attacking Iran is a Bad Idea - US News &amp; World Report Caucasus in crisis: Georgia invades South Ossetia Indian shoe manufactory  - And again in a series of "you do not like your work?" Visitors Suffering from Mental Illnesses Banned from Olympics No Help for Mexico's Kidnapping Surge
In [17]:
pd.DataFrame([[x, all_headline_words_as_vector.count(x)] for x in set(all_headline_words_as_vector)], columns=["word", "word_count"])
Out[17]:
word word_count
0 region 1
1 hundred 1
2 crisis 1
3 raped 2
4 work 1
5 fire 1
6 strike 1
7 over 1
8 what 1
9 idea 1
10 nuclear 1
11 invades 3
12 start 1
13 trials 1
14 its 1
15 against 2
16 two 2
17 towards 1
18 the 11
19 as 1
20 from 5
21 osettain 1
22 was 1
23 iran 4
24 story 1
25 why 1
26 mental 1
27 youtube 1
28 have 1
29 for 4
... ... ...
201 protest 1
202 out 1
203 al 1
204 european 1
205 official 1
206 further 1
207 longer 1
208 green 1
209 impunity 1
210 completely 1
211 opening 1
212 other 1
213 sick 1
214 american 1
215 killed 1
216 old 1
217 yet 1
218 union 1
219 year 1
220 israel 2
221 your 1
222 troops 2
223 nato 1
224 military 2
225 top 1
226 veto 1
227 feel 1
228 manufactory 1
229 downs 1
230 roll 1

231 rows × 2 columns

In [18]:
def prepare_data(df):
    training_data_rows = []
    for row in range(0, df.shape[0]):
        all_headline_words_as_vector = ''

        for ranking_news in range(1,26):
            news_index = 1 + ranking_news
            top_news = df.iloc[row, news_index]
            all_headline_words_as_vector = ' '.join([all_headline_words_as_vector,str(top_news)[2:-1]])
        training_data_rows.append(all_headline_words_as_vector)
    return training_data_rows
In [19]:
training_data_rows = prepare_data(df_train)
In [20]:
print(len(training_data_rows))
1611

Create a count column for each of the words appearing

In [21]:
count_vectorizer = CountVectorizer()
training_data_transformed = count_vectorizer.fit_transform(training_data_rows)
print(training_data_transformed.shape)
(1611, 37014)

A quick comparison of size. If each row implies 230 new words then total size would be 230*1611 ~ 350000 but we expect less since not all words will be new.

In [22]:
type(training_data_transformed)
Out[22]:
scipy.sparse.csr.csr_matrix
In [23]:
print(training_data_transformed)
print(count_vectorizer.get_feature_names()[1000:1400])
  (0, 32206)	1
  (0, 18116)	1
  (0, 20660)	1
  (0, 3818)	1
  (0, 16129)	1
  (0, 20520)	1
  (0, 31995)	1
  (0, 35521)	1
  (0, 36406)	1
  (0, 36794)	1
  (0, 36783)	1
  (0, 29614)	1
  (0, 1787)	1
  (0, 19999)	1
  (0, 29982)	1
  (0, 16465)	1
  (0, 8228)	1
  (0, 6061)	1
  (0, 27619)	1
  (0, 2299)	1
  (0, 22252)	1
  (0, 15974)	1
  (0, 3649)	1
  (0, 3288)	1
  (0, 33105)	1
  :	:
  (1610, 23476)	1
  (1610, 3224)	2
  (1610, 22590)	2
  (1610, 19088)	2
  (1610, 13000)	3
  (1610, 18133)	2
  (1610, 19420)	1
  (1610, 23389)	3
  (1610, 17335)	1
  (1610, 14960)	1
  (1610, 2383)	1
  (1610, 35833)	1
  (1610, 36688)	1
  (1610, 17240)	1
  (1610, 29069)	2
  (1610, 36303)	1
  (1610, 6503)	1
  (1610, 5571)	3
  (1610, 14911)	1
  (1610, 33021)	16
  (1610, 2858)	1
  (1610, 17023)	1
  (1610, 23210)	11
  (1610, 33381)	10
  (1610, 3030)	6
['763', '767th', '76m', '77', '770', '772', '777', '77th', '78', '780', '785', '787', '789', '79', '793', '796', '799', '7b', '7bn', '7m', '7million', '7p', '7pm', '7th', '7x', '7yrs', '80', '800', '8000', '800km', '800m', '800s', '800th', '802', '806', '80beats', '80bn', '80m', '80s', '80th', '81', '813', '82', '825', '827', '829', '82k', '82m', '83', '831', '832', '835', '837', '838', '83t', '84', '840m', '85', '850', '850m', '855', '857', '85mill', '85th', '85tn', '86', '862', '87', '870', '871', '872', '88', '882m', '885', '886k', '887', '888', '88bn', '88t', '88th', '89', '890', '8900', '893', '894', '895', '89m', '8bn', '8f', '8gb', '8k', '8kg', '8m', '8th', '8yo', '90', '900', '900k', '900m', '902', '908', '90k', '90kg', '90m', '90mph', '90s', '90th', '90x', '91', '910', '911', '91m', '92', '920', '922', '93', '930', '93k', '94', '940', '94th', '95', '950', '9500', '95th', '96', '960', '97', '979', '98', '980', '99', '999', '99x', '9b', '9bn', '9f', '9m', '9th', '__', 'a30', 'a320', 'a330', 'a340s', 'a350', 'a380', 'a380s', 'a447', 'aa', 'aaa', 'aaaw', 'aabo', 'aadmi', 'aafia', 'aali', 'aamer', 'aamir', 'aan', 'aap', 'aaron', 'ab', 'ababa', 'aback', 'abada', 'aban', 'abandon', 'abandoned', 'abandoning', 'abandonment', 'abandons', 'abattoir', 'abay', 'abaya', 'abba', 'abbas', 'abbasi', 'abbe', 'abbey', 'abbot', 'abbott', 'abbottaba', 'abbottabad', 'abbotts', 'abbreviation', 'abby', 'abc', 'abdel', 'abdelbaset', 'abdicate', 'abdicated', 'abdicates', 'abdicating', 'abdicatio', 'abdishakur', 'abdolfattah', 'abdolmalek', 'abdomen', 'abdomens', 'abdominal', 'abduct', 'abducted', 'abducting', 'abductio', 'abduction', 'abductions', 'abducto', 'abductors', 'abdul', 'abdulateef', 'abdulaziz', 'abdullah', 'abdullahi', 'abe', 'abed', 'abel', 'abercrombie', 'aberdee', 'aberdeen', 'aberdeenshire', 'abergil', 'aberrant', 'aberration', 'abetic', 'abhinav', 'abhor', 'abhorrent', 'abia', 'abian', 'abic', 'abide', 'abidine', 'abiding', 'abilit', 'ability', 'abim', 'abir', 'abirds', 'abitable', 'abject', 'abkhazi', 'abkhazia', 'abkhazian', 'ablaze', 'able', 'abnormal', 'abnormalities', 'aboar', 'aboard', 'abolish', 'abolished', 'abolishes', 'abolishing', 'abolition', 'aboriginal', 'aboriginals', 'aborigine', 'aborigines', 'abort', 'aborted', 'aborting', 'abortio', 'abortion', 'abortionists', 'abortions', 'abosolve', 'aboul', 'abound', 'abounds', 'about', 'above', 'abraha', 'abraham', 'abrahamian', 'abramoff', 'abramovay', 'abrams', 'abroa', 'abroad', 'abrogating', 'abrupt', 'abruptly', 'abscond', 'absence', 'absent', 'absenteeism', 'absentees', 'absentia', 'absolute', 'absolutely', 'absorb', 'absorbed', 'absorbing', 'abstain', 'abstaine', 'abstained', 'abstainin', 'abstains', 'abstentions', 'abstinence', 'absurd', 'absurdit', 'abu', 'abubakar', 'abuja', 'abundance', 'abundant', 'abus', 'abuse', 'abused', 'abuser', 'abusers', 'abuses', 'abusing', 'abusive', 'aby', 'abydos', 'abys', 'abysmal', 'abyss', 'ac', 'academ', 'academic', 'academically', 'academics', 'academie', 'academies', 'academy', 'acadian', 'acapulc', 'acapulco', 'accelerate', 'accelerated', 'accelerates', 'accelerating', 'accelerator', 'accent', 'accept', 'acceptabl', 'acceptable', 'acceptably', 'acceptance', 'accepted', 'accepting', 'accepts', 'acces', 'access', 'accessed', 'accesses', 'accessible', 'accessing', 'accession', 'accessories', 'accessory', 'acciden', 'accident', 'accidental', 'accidentally', 'accidently', 'accidents', 'acclaim', 'acclaimed', 'acclamation', 'accolade', 'accommodating', 'accommodation', 'accommodations', 'accompanied', 'accompanying', 'accomplice', 'accomplices', 'accomplish', 'accomplishe', 'accomplished', 'accomplishments', 'accor', 'accord', 'accordance', 'according', 'accordingly', 'accords', 'accosted', 'accoun', 'account', 'accountability', 'accountable', 'accountant', 'accounted', 'accounting', 'accounts', 'accreditation', 'accredited', 'accumulated', 'accumulating', 'accumulation', 'accuracy', 'accurate', 'accurately', 'accusation', 'accusations', 'accuse']

Put the words into a word cloud and show most occuring words by size

In [24]:
text = " ".join(all_heads for all_heads in training_data_rows)
print(len(text))
4346827
In [25]:
wordcloud = WordCloud(stopwords=STOPWORDS,
                      background_color='white',
                      width=2500,
                      max_words=100,
                      height=2000
                     ).generate(text)
plt.figure(1,figsize=(13, 13))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
plt.close()

n-gram model with n=1 since all words from every headline is treated equally and without any context of it's neighboring words.

Use a simple Logistic Regression model for the two-class classification problem

In [26]:
log_reg_model = LogisticRegression()
log_reg_model = log_reg_model.fit(training_data_transformed, df_train.Label)

Prepare the test data. One minor concern is with new words appearing in the test data but not in training data. How will the trained model account for new words? It will not account for any new words instead it will count words corresponding to the feature columns of the training data.

In [27]:
test_data = prepare_data(df_test)

Number of feature columns is the same for both training and test data shown below for the shape of transformed test data. Here only a transform is applied to test data and not a fit.

In [28]:
test_data_transformed = count_vectorizer.transform(test_data)
print(test_data_transformed.shape)
print(test_data_transformed)
(377, 37014)
  (0, 168)	1
  (0, 191)	1
  (0, 373)	1
  (0, 383)	1
  (0, 389)	1
  (0, 402)	1
  (0, 470)	1
  (0, 487)	1
  (0, 772)	1
  (0, 792)	1
  (0, 908)	1
  (0, 967)	1
  (0, 1212)	1
  (0, 1382)	1
  (0, 1481)	1
  (0, 1487)	1
  (0, 1496)	1
  (0, 1533)	1
  (0, 1773)	1
  (0, 1805)	1
  (0, 1912)	1
  (0, 1924)	1
  (0, 1960)	1
  (0, 2100)	1
  (0, 2137)	1
  :	:
  (376, 35401)	1
  (376, 35469)	1
  (376, 35485)	1
  (376, 35684)	1
  (376, 35757)	1
  (376, 35759)	1
  (376, 35825)	1
  (376, 35910)	1
  (376, 35960)	1
  (376, 35977)	2
  (376, 36086)	1
  (376, 36133)	1
  (376, 36156)	1
  (376, 36207)	1
  (376, 36245)	1
  (376, 36261)	1
  (376, 36303)	4
  (376, 36364)	1
  (376, 36371)	2
  (376, 36412)	1
  (376, 36426)	2
  (376, 36460)	1
  (376, 36688)	3
  (376, 36693)	1
  (376, 36783)	1
In [29]:
predict = log_reg_model.predict(test_data_transformed)

Confusion matrix that explores true and false positives

In [30]:
confusion_matrix_mod(df_test.Label, predict)
save_path = '../plots/confusion_matrix_logreg'
multipage(''.join([save_path, '.pdf']))
plt.show()
plt.close()
Confusion Matrix
 [[ 56 130]
 [101  90]]

Notice that the rate to hit a true positive is 'True Pos'/'Num Pos' = 0.47 which is a pretty poor result worse than flipping a coin. Ideally the confusion matrix should not have a majority of false negatives and false positives as is the case in above result with high values in off diagonal fields.

Check that top 10 coefficients used in the model match some of the words from word cloud

In [31]:
def feature_coefficients(count_vectorizer, log_reg_model):
    feature_names = count_vectorizer.get_feature_names()
    model_coefficients = log_reg_model.coef_.tolist()[0]
    coefficients_in_dataframe = pd.DataFrame({'feature': feature_names, 'Coefficient': model_coefficients})
    coefficients_in_dataframe = coefficients_in_dataframe.sort_values(['Coefficient', 'feature'], ascending=[0, 1])
    return coefficients_in_dataframe
In [32]:
coefficients_in_dataframe = feature_coefficients(count_vectorizer, log_reg_model)
coefficients_in_dataframe.head(10)
Out[32]:
Coefficient feature
11140 0.541139 ench
18139 0.458096 kills
10838 0.437502 egypt
29490 0.430494 self
18316 0.418497 korea
23353 0.414318 olympic
10074 0.401514 doctors
19409 0.387896 london
34157 0.386206 tv
5744 0.381123 canadian
In [33]:
coefficients_in_dataframe.tail(10)
Out[33]:
Coefficient feature
9543 -0.396815 did
30702 -0.413128 society
31006 -0.415445 speech
26954 -0.433668 real
4130 -0.446348 begin
19514 -0.448687 low
8026 -0.478654 country
29686 -0.502724 sex
28929 -0.505898 sanctions
28678 -0.553007 run
Make new predictions using xgboost model
In [34]:
# prepare data for xgboost
print(type(training_data_transformed))
print(training_data_transformed.shape)
print(training_data_transformed)
print(test_data_transformed.shape)
#xgb_train_x, xgb_train_y, xgb_test = select_feats_of_testdata(df, df_test, 'Label')
<class 'scipy.sparse.csr.csr_matrix'>
(1611, 37014)
  (0, 1)	1
  (0, 169)	1
  (0, 1289)	1
  (0, 1474)	1
  (0, 1479)	1
  (0, 1708)	1
  (0, 1738)	1
  (0, 1787)	1
  (0, 1789)	2
  (0, 1995)	1
  (0, 2257)	1
  (0, 2299)	1
  (0, 2333)	1
  (0, 2383)	6
  (0, 2499)	1
  (0, 2632)	1
  (0, 2768)	1
  (0, 2858)	3
  (0, 3012)	1
  (0, 3030)	1
  (0, 3224)	2
  (0, 3283)	1
  (0, 3288)	1
  (0, 3630)	1
  (0, 3649)	1
  :	:
  (1610, 33657)	1
  (1610, 34127)	1
  (1610, 34329)	1
  (1610, 34349)	2
  (1610, 34874)	1
  (1610, 34996)	2
  (1610, 35011)	1
  (1610, 35053)	1
  (1610, 35257)	1
  (1610, 35833)	1
  (1610, 35902)	1
  (1610, 35936)	1
  (1610, 35957)	1
  (1610, 35958)	1
  (1610, 36029)	2
  (1610, 36033)	1
  (1610, 36160)	1
  (1610, 36245)	1
  (1610, 36303)	1
  (1610, 36338)	1
  (1610, 36425)	1
  (1610, 36688)	1
  (1610, 36708)	1
  (1610, 36772)	1
  (1610, 36988)	1
(377, 37014)
In [35]:
output = xgboost(training_data_transformed[:,:-2], df_train.Label, test_data_transformed[:,:-2])
[0]	train-auc:0.589665+0.0155091	test-auc:0.506561+0.0134287
[10]	train-auc:0.803336+0.0154418	test-auc:0.506833+0.0400626
[20]	train-auc:0.861032+0.0185895	test-auc:0.507453+0.026457
[30]	train-auc:0.898932+0.0119324	test-auc:0.498739+0.021967
[40]	train-auc:0.918226+0.0138356	test-auc:0.499164+0.0210628
[50]	train-auc:0.935097+0.0125667	test-auc:0.499171+0.0216895
[60]	train-auc:0.945407+0.0114606	test-auc:0.495736+0.0225118
[70]	train-auc:0.954159+0.0109332	test-auc:0.495605+0.0157706
[80]	train-auc:0.960006+0.0108295	test-auc:0.498233+0.0217966
[90]	train-auc:0.963939+0.0098956	test-auc:0.496752+0.0262607
[100]	train-auc:0.969625+0.00784811	test-auc:0.49491+0.0249992
[110]	train-auc:0.973394+0.00676391	test-auc:0.491687+0.024949
[120]	train-auc:0.976771+0.00610072	test-auc:0.496023+0.0297286
[130]	train-auc:0.979259+0.00597841	test-auc:0.496365+0.0287816
[140]	train-auc:0.981549+0.00577538	test-auc:0.495076+0.0215728
[150]	train-auc:0.98358+0.00523358	test-auc:0.495096+0.0215246
[160]	train-auc:0.9851+0.00456025	test-auc:0.498839+0.019565
[170]	train-auc:0.98672+0.00344387	test-auc:0.496095+0.020938
[180]	train-auc:0.987894+0.00363869	test-auc:0.497461+0.0222345
[190]	train-auc:0.988759+0.00333247	test-auc:0.499138+0.0217647
[200]	train-auc:0.989516+0.00293948	test-auc:0.498668+0.0245854
[210]	train-auc:0.990391+0.0022488	test-auc:0.496315+0.0228158
[220]	train-auc:0.991283+0.00219955	test-auc:0.496924+0.0202054
[230]	train-auc:0.99181+0.00219085	test-auc:0.496669+0.0202395
[240]	train-auc:0.992571+0.00222704	test-auc:0.49738+0.0187175
[250]	train-auc:0.993319+0.00207986	test-auc:0.496134+0.0167822
Ensemble-CV: 0.5266716+0.04012367064763641
In [36]:
prediction_binary = (output > 0.5).astype(int)
confusion_matrix_mod(df_test.Label, prediction_binary)
save_path = '../plots/confusion_matrix_xgboost'
multipage(''.join([save_path, '.pdf']))
plt.show()
plt.close()
Confusion Matrix
 [[ 48 138]
 [ 47 144]]
In [37]:
print(training_data_transformed[:,:-2].shape), print(test_data_transformed[:,:-2].shape)
(1611, 37012)
(377, 37012)
Out[37]:
(None, None)

n-gram model with n=2 which corresponds to a models where words account for their neighboring words. This model holds all two-word combinations.

In [38]:
# n=2 vectorizer
count_vectorizer_n2 = CountVectorizer(ngram_range=(2,2))
train_data_transformed_n2 = count_vectorizer_n2.fit_transform(training_data_rows)
train_data_transformed_n2.shape
Out[38]:
(1611, 377601)

The number of feature columns has highly increase from 37014 to 377601 which is a factor 10.

In [39]:
log_reg_model_n2 = LogisticRegression()
log_reg_model_n2 = log_reg_model_n2.fit(train_data_transformed_n2, df_train.Label)
In [40]:
feature_coefficients_df = feature_coefficients(count_vectorizer_n2, log_reg_model_n2)
feature_coefficients_df.head(10)
Out[40]:
Coefficient feature
279501 0.296023 right to
25329 0.279101 and other
294026 0.267622 set to
160539 0.238123 in egypt
325868 0.229538 the first
126697 0.223554 forced to
160144 0.220719 in china
367268 0.215048 will be
128113 0.211792 found in
332357 0.209793 this is
In [41]:
feature_coefficients_df.tail(10)
Out[41]:
Coefficient feature
336985 -0.202350 to help
120940 -0.203686 fire on
225019 -0.204450 nuclear weapons
157449 -0.208993 if he
49949 -0.212255 bin laden
350876 -0.216789 up in
337273 -0.220129 to kill
32617 -0.236822 around the
331101 -0.239817 there is
325144 -0.339305 the country
In [42]:
test_data_transformed_n2 = count_vectorizer_n2.transform(test_data)
In [43]:
predict_n2 = log_reg_model_n2.predict(test_data_transformed_n2)

Confusion matrix that explores true and false positives

In [44]:
confusion_matrix_mod(df_test.Label, predict_n2)
save_path = '../plots/confusion_matrix_logreg_n2'
multipage(''.join([save_path, '.pdf']))
plt.show()
plt.close()
Confusion Matrix
 [[ 65 121]
 [ 49 142]]