Home APIs DataExploration DataMining Showcase Challenges Contact
DataReddit DataYahooFinanceDJIA DataYahooFinanceDanskeBank RedditWordCloud

Strucai's Data Mining: Reddit WordCloud

nlp_scraped_data_notebook

Explore datamined Top8 news.

Top8 news and works as a ranking of global news. Use 1 week of data from 2019-07-04 to 2019-07-11 Reddit WorldNews Channel (/r/worldnews). Ranked by reddit users' votes on single date. https://www.reddit.com/r/worldnews?hl

In [1]:
# nlp_notebook.ipynb
#  Assumes python vers. 3.6
# __author__ = 'mizio'

import csv as csv
import numpy as np
import pandas as pd
import pylab as plt
from wordcloud import WordCloud,STOPWORDS
from datetime import date
import datetime
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sys import path
import os
from os.path import dirname, abspath
path_of_doc = dirname(dirname(abspath(os.path.split(os.getcwd())[0])))
path_to_ml_lib = path_of_doc + '/machine_learning_modules/'
if path_to_ml_lib not in path:
    path.append(path_to_ml_lib)
    
from datamunging.explore import *
# from machine_learning_estimators.tensorflow_NN import *
In [2]:
# Load data
df = pd.read_csv('../scraped_data/reddit_scraped_data2019-07-11_18:21:22.788215.csv', header=0)
df_train = df
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2016 entries, 0 to 2015
Data columns (total 10 columns):
TimeStamp    2016 non-null object
Url          2016 non-null object
top1         2002 non-null object
top2         2002 non-null object
top3         2002 non-null object
top4         2002 non-null object
top5         2002 non-null object
top6         2002 non-null object
top7         2002 non-null object
top8         2002 non-null object
dtypes: object(10)
memory usage: 157.6+ KB
In [4]:
# clean for rows with null
# df = df.dropna()
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2016 entries, 0 to 2015
Data columns (total 10 columns):
TimeStamp    2016 non-null object
Url          2016 non-null object
top1         2002 non-null object
top2         2002 non-null object
top3         2002 non-null object
top4         2002 non-null object
top5         2002 non-null object
top6         2002 non-null object
top7         2002 non-null object
top8         2002 non-null object
dtypes: object(10)
memory usage: 157.6+ KB
In [6]:
df.head()
Out[6]:
TimeStamp Url top1 top2 top3 top4 top5 top6 top7 top8
0 2019-07-11 18:20:06 https://www.reddit.com/r/worldnews?hl Endangered rhino numbers ‘soar by 1,000%’ in T... Toronto hospital fired 150 employees after mul... Foreign ambassadors working in Washington have... Conspiracy theory about slain DNC staffer was ... Former ambassador to US says Boris Johnson is ... Trump cut a deal with China to mute US support... Kenya`s first coal plant construction paused i... Brazil deforestation 88% higher in June than l...
1 2019-07-11 18:15:05 https://www.reddit.com/r/worldnews?hl Endangered rhino numbers ‘soar by 1,000%’ in T... Toronto hospital fired 150 employees after mul... Conspiracy theory about slain DNC staffer was ... Foreign ambassadors working in Washington have... Trump cut a deal with China to mute US support... Former ambassador to US says Boris Johnson is ... Kenya`s first coal plant construction paused i... Brazil deforestation 88% higher in June than l...
2 2019-07-11 18:10:06 https://www.reddit.com/r/worldnews?hl Endangered rhino numbers ‘soar by 1,000%’ in T... Toronto hospital fired 150 employees after mul... Conspiracy theory about slain DNC staffer was ... Foreign ambassadors working in Washington have... Trump cut a deal with China to mute US support... Former ambassador to US says Boris Johnson is ... Brazil deforestation 88% higher in June than l... Kenya`s first coal plant construction paused i...
3 2019-07-11 18:05:06 https://www.reddit.com/r/worldnews?hl Endangered rhino numbers ‘soar by 1,000%’ in T... Toronto hospital fired 150 employees after mul... Conspiracy theory about slain DNC staffer was ... Foreign ambassadors working in Washington have... Trump cut a deal with China to mute US support... Brazil deforestation 88% higher in June than l... Former ambassador to US says Boris Johnson is ... In first year in power in Ontario, conservativ...
4 2019-07-11 18:00:06 https://www.reddit.com/r/worldnews?hl Endangered rhino numbers ‘soar by 1,000%’ in T... Toronto hospital fired 150 employees after mul... Conspiracy theory about slain DNC staffer was ... Foreign ambassadors working in Washington have... Trump cut a deal with China to mute US support... Brazil deforestation 88% higher in June than l... In first year in power in Ontario, conservativ... Kenya`s first coal plant construction paused i...

Observe that Label column with (0,1) is a two-class classification problem. News are displayed in columns Top1 up to Top25 and works as a ranking of global news. Problem is to identify which rows hold information that can be connected to a price in a stock.

Prepare data such that all words in row gets into a list with single word elements

In [7]:
df_train.head(1)
Out[7]:
TimeStamp Url top1 top2 top3 top4 top5 top6 top7 top8
0 2019-07-11 18:20:06 https://www.reddit.com/r/worldnews?hl Endangered rhino numbers ‘soar by 1,000%’ in T... Toronto hospital fired 150 employees after mul... Foreign ambassadors working in Washington have... Conspiracy theory about slain DNC staffer was ... Former ambassador to US says Boris Johnson is ... Trump cut a deal with China to mute US support... Kenya`s first coal plant construction paused i... Brazil deforestation 88% higher in June than l...
In [8]:
# Print Top news headline
ranking_news = 3
news_index = 2 + ranking_news 
example_top_news = df_train.iloc[0, news_index]
print(example_top_news)
#print(example_top_news[2:-1])
Conspiracy theory about slain DNC staffer was planted by Russian intelligence, report finds
In [9]:
example_top_news.lower()
print(example_top_news.lower())
conspiracy theory about slain dnc staffer was planted by russian intelligence, report finds

Clean phrase for abbreviations and punctuations and other non relevant parts

In [10]:
headline_words_as_vector = CountVectorizer().build_tokenizer()(example_top_news.lower())
print(CountVectorizer().build_tokenizer()(example_top_news.lower()))
['conspiracy', 'theory', 'about', 'slain', 'dnc', 'staffer', 'was', 'planted', 'by', 'russian', 'intelligence', 'report', 'finds']

Build new dataframe with words and corresponding count

In [11]:
pd.DataFrame([[x, headline_words_as_vector.count(x)] for x in set(headline_words_as_vector)], columns=["word", "word_count"])
Out[11]:
word word_count
0 dnc 1
1 intelligence 1
2 about 1
3 staffer 1
4 theory 1
5 conspiracy 1
6 was 1
7 planted 1
8 russian 1
9 report 1
10 by 1
11 finds 1
12 slain 1

Instead of taking only 1 news headline append all the 8 news headlines into 1 list of words and make a count

In [12]:
number_of_news = 8
In [13]:
all_headline_words_as_vector = ''
for ranking_news in range(1,number_of_news + 1):
    news_index = 1 + ranking_news
    top_news = str(df_train.iloc[0, news_index])
    #print(top_news)
    if top_news != 'nan':
        all_headline_words_as_vector = ' '.join([all_headline_words_as_vector,top_news])

print(all_headline_words_as_vector)
all_headline_words_as_vector = CountVectorizer().build_tokenizer()(all_headline_words_as_vector.lower())
 Endangered rhino numbers ‘soar by 1,000%’ in Tanzania after crackdown on poaching gangs - And elephant populations have risen by nearly half in five years, thanks to a blitz on illegal ivory hunters, the president’s office said. Toronto hospital fired 150 employees after multimillion-dollar insurance fraud Foreign ambassadors working in Washington have revealed they share similar views to British envoy Sir Kim Darroch, who described Trump’s administration as “inept” and “dysfunctional” in leaked diplomatic cables. Conspiracy theory about slain DNC staffer was planted by Russian intelligence, report finds Former ambassador to US says Boris Johnson is a `paid-up member of the Trump fan club` who risks doing `great harm` to UK Trump cut a deal with China to mute US support for Hong Kong protests in exchange for progress in the trade war Kenya`s first coal plant construction paused in climate victory Brazil deforestation 88% higher in June than last year, data shows
In [14]:
pd.DataFrame([[x, all_headline_words_as_vector.count(x)] for x in set(all_headline_words_as_vector)], columns=["word", "word_count"])
Out[14]:
word word_count
0 harm 1
1 than 1
2 working 1
3 gangs 1
4 ambassador 1
5 victory 1
6 to 5
7 sir 1
8 by 3
9 us 2
10 employees 1
11 president 1
12 administration 1
13 cut 1
14 uk 1
15 last 1
16 trade 1
17 and 2
18 higher 1
19 poaching 1
20 ambassadors 1
21 fired 1
22 years 1
23 plant 1
24 risks 1
25 china 1
26 described 1
27 deforestation 1
28 finds 1
29 support 1
... ... ...
97 member 1
98 report 1
99 kong 1
100 dnc 1
101 war 1
102 multimillion 1
103 dollar 1
104 paused 1
105 planted 1
106 nearly 1
107 have 2
108 half 1
109 similar 1
110 shows 1
111 five 1
112 hunters 1
113 thanks 1
114 numbers 1
115 kenya 1
116 deal 1
117 88 1
118 blitz 1
119 for 2
120 revealed 1
121 crackdown 1
122 russian 1
123 office 1
124 construction 1
125 share 1
126 staffer 1

127 rows × 2 columns

In [15]:
def prepare_data(df):
    training_data_rows = []
    for row in range(0, df.shape[0]):
        all_headline_words_as_vector = ''

        for ranking_news in range(1,number_of_news + 1):
            news_index = 1 + ranking_news
            top_news = str(df.iloc[row, news_index])
            if top_news != 'nan':
                all_headline_words_as_vector = ' '.join([all_headline_words_as_vector,str(top_news)])
        training_data_rows.append(all_headline_words_as_vector)
    return training_data_rows
In [16]:
training_data_rows = prepare_data(df_train)
In [17]:
print(len(training_data_rows))
2016

Create a count column for each of the words appearing

In [18]:
count_vectorizer = CountVectorizer()
training_data_transformed = count_vectorizer.fit_transform(training_data_rows)
print(training_data_transformed.shape)
(2016, 1485)
In [19]:
type(training_data_transformed)
Out[19]:
scipy.sparse.csr.csr_matrix
In [20]:
#print(training_data_transformed)
#print(count_vectorizer.get_feature_names()[1000:1400])

Put the words into a word cloud and show most occuring words by size

In [21]:
text = " ".join(all_heads for all_heads in training_data_rows)
text = transformed_str = text.replace("`","'")
#print(text)
#print(len(text))
In [22]:
wordcloud = WordCloud(stopwords=STOPWORDS,
                      background_color='white',
                      width=2500,
                      max_words=100,
                      height=2000
                     ).generate(text)
plt.figure(1,figsize=(13, 13))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
plt.close()