Explore datamined Top8 news.¶

Top8 news and works as a ranking of global news. Use 1 week of data from 2019-07-04 to 2019-07-11 Reddit WorldNews Channel (/r/worldnews). Ranked by reddit users' votes on single date. https://www.reddit.com/r/worldnews?hl

# nlp_notebook.ipynb
#  Assumes python vers. 3.6
# __author__ = 'mizio'

import csv as csv
import numpy as np
import pandas as pd
import pylab as plt
from wordcloud import WordCloud,STOPWORDS
from datetime import date
import datetime
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sys import path
import os
from os.path import dirname, abspath
path_of_doc = dirname(dirname(abspath(os.path.split(os.getcwd())[0])))
path_to_ml_lib = path_of_doc + '/machine_learning_modules/'
if path_to_ml_lib not in path:
    path.append(path_to_ml_lib)
    
from datamunging.explore import *
# from machine_learning_estimators.tensorflow_NN import *

# Load data
df = pd.read_csv('../scraped_data/reddit_scraped_data2019-07-11_18:21:22.788215.csv', header=0)
df_train = df

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2016 entries, 0 to 2015
Data columns (total 10 columns):
TimeStamp    2016 non-null object
Url          2016 non-null object
top1         2002 non-null object
top2         2002 non-null object
top3         2002 non-null object
top4         2002 non-null object
top5         2002 non-null object
top6         2002 non-null object
top7         2002 non-null object
top8         2002 non-null object
dtypes: object(10)
memory usage: 157.6+ KB

# clean for rows with null
# df = df.dropna()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2016 entries, 0 to 2015
Data columns (total 10 columns):
TimeStamp    2016 non-null object
Url          2016 non-null object
top1         2002 non-null object
top2         2002 non-null object
top3         2002 non-null object
top4         2002 non-null object
top5         2002 non-null object
top6         2002 non-null object
top7         2002 non-null object
top8         2002 non-null object
dtypes: object(10)
memory usage: 157.6+ KB

df.head()

Observe that Label column with (0,1) is a two-class classification problem. News are displayed in columns Top1 up to Top25 and works as a ranking of global news. Problem is to identify which rows hold information that can be connected to a price in a stock.

Prepare data such that all words in row gets into a list with single word elements

df_train.head(1)

# Print Top news headline
ranking_news = 3
news_index = 2 + ranking_news 
example_top_news = df_train.iloc[0, news_index]
print(example_top_news)
#print(example_top_news[2:-1])

Conspiracy theory about slain DNC staffer was planted by Russian intelligence, report finds

example_top_news.lower()
print(example_top_news.lower())

conspiracy theory about slain dnc staffer was planted by russian intelligence, report finds

Clean phrase for abbreviations and punctuations and other non relevant parts

headline_words_as_vector = CountVectorizer().build_tokenizer()(example_top_news.lower())
print(CountVectorizer().build_tokenizer()(example_top_news.lower()))

['conspiracy', 'theory', 'about', 'slain', 'dnc', 'staffer', 'was', 'planted', 'by', 'russian', 'intelligence', 'report', 'finds']

Build new dataframe with words and corresponding count

pd.DataFrame([[x, headline_words_as_vector.count(x)] for x in set(headline_words_as_vector)], columns=["word", "word_count"])

Instead of taking only 1 news headline append all the 8 news headlines into 1 list of words and make a count

number_of_news = 8

all_headline_words_as_vector = ''
for ranking_news in range(1,number_of_news + 1):
    news_index = 1 + ranking_news
    top_news = str(df_train.iloc[0, news_index])
    #print(top_news)
    if top_news != 'nan':
        all_headline_words_as_vector = ' '.join([all_headline_words_as_vector,top_news])

print(all_headline_words_as_vector)
all_headline_words_as_vector = CountVectorizer().build_tokenizer()(all_headline_words_as_vector.lower())

 Endangered rhino numbers ‘soar by 1,000%’ in Tanzania after crackdown on poaching gangs - And elephant populations have risen by nearly half in five years, thanks to a blitz on illegal ivory hunters, the president’s office said. Toronto hospital fired 150 employees after multimillion-dollar insurance fraud Foreign ambassadors working in Washington have revealed they share similar views to British envoy Sir Kim Darroch, who described Trump’s administration as “inept” and “dysfunctional” in leaked diplomatic cables. Conspiracy theory about slain DNC staffer was planted by Russian intelligence, report finds Former ambassador to US says Boris Johnson is a `paid-up member of the Trump fan club` who risks doing `great harm` to UK Trump cut a deal with China to mute US support for Hong Kong protests in exchange for progress in the trade war Kenya`s first coal plant construction paused in climate victory Brazil deforestation 88% higher in June than last year, data shows

pd.DataFrame([[x, all_headline_words_as_vector.count(x)] for x in set(all_headline_words_as_vector)], columns=["word", "word_count"])

def prepare_data(df):
    training_data_rows = []
    for row in range(0, df.shape[0]):
        all_headline_words_as_vector = ''

        for ranking_news in range(1,number_of_news + 1):
            news_index = 1 + ranking_news
            top_news = str(df.iloc[row, news_index])
            if top_news != 'nan':
                all_headline_words_as_vector = ' '.join([all_headline_words_as_vector,str(top_news)])
        training_data_rows.append(all_headline_words_as_vector)
    return training_data_rows

training_data_rows = prepare_data(df_train)

print(len(training_data_rows))

2016

Create a count column for each of the words appearing

count_vectorizer = CountVectorizer()
training_data_transformed = count_vectorizer.fit_transform(training_data_rows)
print(training_data_transformed.shape)

(2016, 1485)

type(training_data_transformed)

scipy.sparse.csr.csr_matrix

#print(training_data_transformed)
#print(count_vectorizer.get_feature_names()[1000:1400])

Put the words into a word cloud and show most occuring words by size

text = " ".join(all_heads for all_heads in training_data_rows)
text = transformed_str = text.replace("`","'")
#print(text)
#print(len(text))

wordcloud = WordCloud(stopwords=STOPWORDS,
                      background_color='white',
                      width=2500,
                      max_words=100,
                      height=2000
                     ).generate(text)
plt.figure(1,figsize=(13, 13))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
plt.close()

	TimeStamp	Url	top1	top2	top3	top4	top5	top6	top7	top8
0	2019-07-11 18:20:06	https://www.reddit.com/r/worldnews?hl	Endangered rhino numbers ‘soar by 1,000%’ in T...	Toronto hospital fired 150 employees after mul...	Foreign ambassadors working in Washington have...	Conspiracy theory about slain DNC staffer was ...	Former ambassador to US says Boris Johnson is ...	Trump cut a deal with China to mute US support...	Kenya`s first coal plant construction paused i...	Brazil deforestation 88% higher in June than l...
1	2019-07-11 18:15:05	https://www.reddit.com/r/worldnews?hl	Endangered rhino numbers ‘soar by 1,000%’ in T...	Toronto hospital fired 150 employees after mul...	Conspiracy theory about slain DNC staffer was ...	Foreign ambassadors working in Washington have...	Trump cut a deal with China to mute US support...	Former ambassador to US says Boris Johnson is ...	Kenya`s first coal plant construction paused i...	Brazil deforestation 88% higher in June than l...
2	2019-07-11 18:10:06	https://www.reddit.com/r/worldnews?hl	Endangered rhino numbers ‘soar by 1,000%’ in T...	Toronto hospital fired 150 employees after mul...	Conspiracy theory about slain DNC staffer was ...	Foreign ambassadors working in Washington have...	Trump cut a deal with China to mute US support...	Former ambassador to US says Boris Johnson is ...	Brazil deforestation 88% higher in June than l...	Kenya`s first coal plant construction paused i...
3	2019-07-11 18:05:06	https://www.reddit.com/r/worldnews?hl	Endangered rhino numbers ‘soar by 1,000%’ in T...	Toronto hospital fired 150 employees after mul...	Conspiracy theory about slain DNC staffer was ...	Foreign ambassadors working in Washington have...	Trump cut a deal with China to mute US support...	Brazil deforestation 88% higher in June than l...	Former ambassador to US says Boris Johnson is ...	In first year in power in Ontario, conservativ...
4	2019-07-11 18:00:06	https://www.reddit.com/r/worldnews?hl	Endangered rhino numbers ‘soar by 1,000%’ in T...	Toronto hospital fired 150 employees after mul...	Conspiracy theory about slain DNC staffer was ...	Foreign ambassadors working in Washington have...	Trump cut a deal with China to mute US support...	Brazil deforestation 88% higher in June than l...	In first year in power in Ontario, conservativ...	Kenya`s first coal plant construction paused i...

	word	word_count
0	dnc	1
1	intelligence	1
2	about	1
3	staffer	1
4	theory	1
5	conspiracy	1
6	was	1
7	planted	1
8	russian	1
9	report	1
10	by	1
11	finds	1
12	slain	1

	word	word_count
0	harm	1
1	than	1
2	working	1
3	gangs	1
4	ambassador	1
5	victory	1
6	to	5
7	sir	1
8	by	3
9	us	2
10	employees	1
11	president	1
12	administration	1
13	cut	1
14	uk	1
15	last	1
16	trade	1
17	and	2
18	higher	1
19	poaching	1
20	ambassadors	1
21	fired	1
22	years	1
23	plant	1
24	risks	1
25	china	1
26	described	1
27	deforestation	1
28	finds	1
29	support	1
...	...	...
97	member	1
98	report	1
99	kong	1
100	dnc	1
101	war	1
102	multimillion	1
103	dollar	1
104	paused	1
105	planted	1
106	nearly	1
107	have	2
108	half	1
109	similar	1
110	shows	1
111	five	1
112	hunters	1
113	thanks	1
114	numbers	1
115	kenya	1
116	deal	1
117	88	1
118	blitz	1
119	for	2
120	revealed	1
121	crackdown	1
122	russian	1
123	office	1
124	construction	1
125	share	1
126	staffer	1

Strucai's Data Mining: Reddit WordCloud

Explore datamined Top8 news.¶