Title: Sentiment Analysis of Social Media and News Data for predicting Stock price movements
Abstract:
Twitter and New York Times are two very popular websites on the internet. The project aims at using data from these two sources to analyze the sentiments relating to stock and hence present a predictive relationship between opinions and news on the internet and stock price movements. The project will be using historical stock data obtained from Yahoo finances to correlate the market sentiment collected from twitter and NY Times. Positive market sentiment should mean that the stock prices will go up. A combination of Natural Language Processing and machine learning algorithms will be used for the sentiment analysis of data from Twitter and NY Times news headlines. The sentiment analysis performed from these twitter and NY Times will be then compared.Related domain of Study:
The project will be addressing the domain of Predictive Analysis in Stock Market.Algorithms
The following algorithms will be used for the project.Prediction Algorithms:
In the project, the historical price data and the sentiment values obtained after sentiment analysis (using NLTK package) will be fed into 3 different pattern recognition algorithms namely – Random Forest, Linear Regression and MLP Classifier – will be used generate the output prediction in graphs.Data Pre-processing:
Since the data from twitter, NY Times and the stock indices are different, and have a lot of content that might not be useful or relevant to the project, the data needs to be filtered and preprocessed so that the data can be analyzed per the needs of the project. Thus, data pre-processing algorithms such as tokenization, stop-word removal, regex matching will be used. Before these steps, records with null values should be removed as well. Since the stock market is closed on weekends, the articles from weekends need to be handled so that the corresponding pricing of those days do not show as zero or null.How will these algorithms be used?
Stock indices data and news articles and tweets from past 10 years will be used in the project. From the historical stock data, the closing price of each will be taken. The data from 8 years will be used as training data and the rest 2 years data will be used for testing. And for each day the sentiments from news articles and twitter data will be analyzed and categorized as positive, negative, neutral or compound. This data will then be fed into the prediction algorithms discussed above to generate a prediction of stock closing prices. The graphs generated will be used to interpret the results. The graphs will show a comparison between the prediction for the 2 years and the historical data of those 2 years. From all 3 algorithms, the one that results the closest will be selected.Data Source:
Two types of data gathered from as follows:Stock indices: The initial idea was to make a predictive analysis on the top companies of US, so the use of DJIA index looked to be the best option. But now the project aims at working on the stocks of a single company so the individual company’s historical data will be used instead. The data will have daily stock price data – open price, high price, low price, close price, adj close price, and volume. The project will only be using the close price value from this data source for each day.
Source: https://finance.yahoo.com/quote/MSFT.MX/history?p=MSFT.MX

Here’s a sample of the kind of data that will be used. The following sample has been obtained from https://www.kaggle.com/therohk/million-headlines

Tweeter Data: Tweets will be taken from Twitter API (URL - https://github.com/tweepy/tweepy). The kind of data from twitter is as follows. The tweeter data will give us the tweets about the company and the date. From this data as well, the sentiment will be analyzed as neutral, positive, negative or compound.
(Sample obtained from: https://www.kaggle.com/kazanova/sentiment140 )

Graphics
Since the project is about predicting the stock price trends, the most optimal way of presenting would be using a line graph. Each graph should show the predicted trend and the actual trend (for the testing time period).Here’s an example:

Current Challenges
Data Collection:Twitter API does not provide access to historical data. Historical data can however be achieved programmatically. This is the most important Challenge. The contingency plan here is to use data that others have collected and provided. But it would be more challenging to find the right amount of data relevant to the project.
Merging results:
The project aims at getting 2 sets of predictions – one from NY times articles and the other from Twitter data. Instead of just presenting them as separate results, it would be great if the results could be merged and one single and better result could be obtained. Merging the results from each set would be another challenge.
References to be Cited in the Project
Abdulaziz Sulaiman Almohaimeed, “Using Tweets Sentiment Analysis to Predict Stock Market Movement”. URL -https://etd.auburn.edu/bitstream/handle/10415/5759/Using%20Tweets%20Sentiment%20Analysis%20to%20Predict%20Stock%20Market%20Movement.pdf?sequence=2Goel, Mittal, “Stock Prediction Using Twitter Sentiment Analysis.” URL- http://cs229.stanford.edu/proj2011/GoelMittal-StockMarketPredictionUsingTwitterSentimentAnalysis.pdf
Intel Corporation, “Stock Predictions through News Sentiment Analysis”. URL- https://www.codeproject.com/Articles/1201444/Stock-Predictions-through-News-Sentiment-Analysis
Pagolu, Challa, Panda, “Sentiment Analysis of Twitter Data for Predicting Stock Market Movements.” URL - https://arxiv.org/pdf/1610.09225.pdf
Quantinsti, “Machine Learning For Stock Price Prediction Using Regression.” URL - https://www.quantinsti.com/blog/machine-learning-trading-predict-stock-prices-regression
Chen, Lazer, “Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement.” URL - http://cs229.stanford.edu/proj2011/ChenLazer-SentimentAnalysisOfTwitterFeedsForThePredictionOfStockMarketMovement.pdf
Comments
Post a Comment