Skip to main content

Final Project Progress Report

Title: Sentiment Analysis of Social Media and News Data for predicting Stock price movements


Abstract:

Twitter and New York Times are two very popular websites on the internet. The project aims at using data from these two sources to analyze the sentiments relating to stock and hence present a predictive relationship between opinions and news on the internet and stock price movements. The project will be using historical stock data obtained from Yahoo finances to correlate the market sentiment collected from twitter and NY Times. Positive market sentiment should mean that the stock prices will go up. A combination of Natural Language Processing and machine learning algorithms will be used for the sentiment analysis of data from Twitter and NY Times news headlines. The sentiment analysis performed from these twitter and NY Times will be then compared.

Related domain of Study:

The project will be addressing the domain of Predictive Analysis in Stock Market.

Algorithms

The following algorithms will be used for the project.

Prediction Algorithms:

In the project, the historical price data and the sentiment values obtained after sentiment analysis (using NLTK package) will be fed into 3 different pattern recognition algorithms namely – Random Forest, Linear Regression and MLP Classifier – will be used generate the output prediction in graphs.

Data Pre-processing:

Since the data from twitter, NY Times and the stock indices are different, and have a lot of content that might not be useful or relevant to the project, the data needs to be filtered and preprocessed so that the data can be analyzed per the needs of the project. Thus, data pre-processing algorithms such as tokenization, stop-word removal, regex matching will be used. Before these steps, records with null values should be removed as well. Since the stock market is closed on weekends, the articles from weekends need to be handled so that the corresponding pricing of those days do not show as zero or null.

How will these algorithms be used?

Stock indices data and news articles and tweets from past 10 years will be used in the project. From the historical stock data, the closing price of each will be taken. The data from 8 years will be used as training data and the rest 2 years data will be used for testing. And for each day the sentiments from news articles and twitter data will be analyzed and categorized as positive, negative, neutral or compound. This data will then be fed into the prediction algorithms discussed above to generate a prediction of stock closing prices. The graphs generated will be used to interpret the results. The graphs will show a comparison between the prediction for the 2 years and the historical data of those 2 years. From all 3 algorithms, the one that results the closest will be selected.

Data Source:

Two types of data gathered from as follows:
Stock indices: The initial idea was to make a predictive analysis on the top companies of US, so the use of DJIA index looked to be the best option. But now the project aims at working on the stocks of a single company so the individual company’s historical data will be used instead. The data will have daily stock price data – open price, high price, low price, close price, adj close price, and volume. The project will only be using the close price value from this data source for each day.
Source: https://finance.yahoo.com/quote/MSFT.MX/history?p=MSFT.MX
News data: The news articles will be obtained from the NY Times Archive API. https://developer.nytimes.com/
Here’s a sample of the kind of data that will be used. The following sample has been obtained from https://www.kaggle.com/therohk/million-headlines
The news data source will have the date of publishing and the headline of articles relating to the company. The sentiment values for each day will be analyzed and categorized as neutral, compound, positive or negative.
Tweeter Data: Tweets will be taken from Twitter API (URL - https://github.com/tweepy/tweepy). The kind of data from twitter is as follows. The tweeter data will give us the tweets about the company and the date. From this data as well, the sentiment will be analyzed as neutral, positive, negative or compound.
(Sample obtained from: https://www.kaggle.com/kazanova/sentiment140 )
The main idea here is to get two sets of predictions – one from NY Times articles and another from Twitter data – and compare them.

Graphics

Since the project is about predicting the stock price trends, the most optimal way of presenting would be using a line graph. Each graph should show the predicted trend and the actual trend (for the testing time period).
Here’s an example:

Current Challenges

Data Collection:
Twitter API does not provide access to historical data. Historical data can however be achieved programmatically. This is the most important Challenge. The contingency plan here is to use data that others have collected and provided. But it would be more challenging to find the right amount of data relevant to the project.
Merging results:
The project aims at getting 2 sets of predictions – one from NY times articles and the other from Twitter data. Instead of just presenting them as separate results, it would be great if the results could be merged and one single and better result could be obtained. Merging the results from each set would be another challenge.

References to be Cited in the Project

Abdulaziz Sulaiman Almohaimeed, “Using Tweets Sentiment Analysis to Predict Stock Market Movement”. URL -https://etd.auburn.edu/bitstream/handle/10415/5759/Using%20Tweets%20Sentiment%20Analysis%20to%20Predict%20Stock%20Market%20Movement.pdf?sequence=2
Goel, Mittal, “Stock Prediction Using Twitter Sentiment Analysis.” URL-  http://cs229.stanford.edu/proj2011/GoelMittal-StockMarketPredictionUsingTwitterSentimentAnalysis.pdf
Intel Corporation, “Stock Predictions through News Sentiment Analysis”. URL- https://www.codeproject.com/Articles/1201444/Stock-Predictions-through-News-Sentiment-Analysis
Pagolu, Challa, Panda, “Sentiment Analysis of Twitter Data for Predicting Stock Market Movements.” URL - https://arxiv.org/pdf/1610.09225.pdf
Quantinsti, “Machine Learning For Stock Price Prediction Using Regression.” URL - https://www.quantinsti.com/blog/machine-learning-trading-predict-stock-prices-regression
Chen, Lazer, “Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement.” URL - http://cs229.stanford.edu/proj2011/ChenLazer-SentimentAnalysisOfTwitterFeedsForThePredictionOfStockMarketMovement.pdf

Comments

Popular posts from this blog

Tutorial 6 - Statistics and Probability

Statistics and Probability with Python HW 6 Statistics and probability homework ¶ Complete homework notebook in a homework directory with your name and zip up the homework directory and submit it to our class blackboard/elearn site. Complete all the parts 6.1 to 6.5 for score of 3. Investigate plotting, linearegression, or complex matrix manipulation to get a score of 4 or cover two additional investigations for a score of 5. 6.1 Coin flipping ¶ 6.1.1 ¶ Write a function, flip_sum, which generates $n$ random coin flips from a fair coin and then returns the number of heads. A fair coin is defined to be a coin where $P($heads$)=\frac{1}{2}$ The output type should be a numpy integer, hint: use random.rand() In [4]: import numpy as np import random """def random_flip(): return random.choice(["H", "T"]) def flip_sum(n): heads_count = 0 ...

Tutorial 5 - Matplotlib

Matplotlib Tutorial In [13]: % matplotlib inline import pandas as pd import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt # Setting some Pandas options pd . set_option ( 'display.notebook_repr_html' , False ) pd . set_option ( 'display.max_columns' , 25 ) pd . set_option ( 'display.max_rows' , 25 ) Homework 4 ¶ Couple of reference site: http://matplotlib.org/examples/pylab_examples/ http://docs.scipy.org/doc/numpy/ Homework 4.1 ¶ 4.1.a Create a figure with two subplots in a row. One shows a sine wave of x from with x = 0 ... 2*pi the other shows the tagent of x with the same range. Label the figures. Should look something like: We can follow the following steps to get the required graphs showing sine and tangents of x: Create a numpy array x with values from 0 to 2*pi with 0.001 as step value Set the height and w...

Domain Research - Stock Market Prediction

Hi, as part of my research on a domain of Big Data implementation, I chose Stock Market Prediction. Here I present to you the things that I have learned during my research in the domain. Can stock market be predicted? Early researches on stock market prediction revolved around whether it could be predicted. One of such researches suggested that “short term stock price movements were governed by the  random walk hypothesis  and thus were unpredictable”. Another stated that “the stock price reflected completed market information and the market behaved efficiently so that instantaneous price corrections to equilibrium would make stock prediction useless.” In simple terms, the researches inferred that since the market was affected by a lot of factors which were random predicting the stock market is almost impossible. However, researches carried out later (Brown & Jennings 1998; Abarbanel & Bushee 1998) made use of ...