A Link Prediction Strategy for Personalized Tweet Recommendation through Doc2Vec Approach

,

In general recommender systems are programs which attempt to predict items that users may be interested in.
Recommender systems are worked almost in a same way through different domains, which by using users historical interests or ratings, predict the items that user might like them. But in some specific domains like News, the story is different.
News are so timed depended and after short period of time their freshness are gone. So the news recommender system should be able to recommend fresh news as well as related to user interests.
People were followed news from different sources like old traditional way but nowadays the most common way that people read news are social networks channels that provide with people most recent and fresh news. One of the most famous social networks that focused in news is Twitter. Every user has profile in twitter and will follow different Channels, Celebrities or his/her friends. Since every user has many following pages, his/her twitter timeline has many relevant or irrelevant subjects which forced users to finding his interested news.
As a consequence, the role of user modeling and personalized information access is becoming crucial: users need a personalized support in sifting through large amounts of available information, according to their interests and tastes.
For this reason in this paper we study different users tweets for modeling users and introduce a framework for our tweet recommender system that semantically enriched each user tweets and detecting his/her interest and recommend to users some fresh and relevant news.
In this propose approach and in the first step we find user's explicit interests, User's profile build with implementing Doc2vec method on user's tweets After build each user's explicit profile based on Doc2vec models, similar user's are fined with similarity of theirs vectors and also we can find each similar semantic tweets.

Literature Review
With the growing impact of the Social Web, or Web 2.0, on our everyday life, people start to use more and more different web based services like Facebook, Twitter, Flickr or blogs. They use these services to express their opinion, communicate with others and share pictures with friends. Thereby, they generate and distribute personal and social information like interests, social contacts, preferences and personal goals.
Twitter is an online social networking service that enables users to send and read short 140-character messages called "tweets". Registered users can read and post tweets, but those who are unregistered can only read them. Users access Twitter through the website interface, SMS or mobile device app.
As mention before, Twitter is one of the famous Content-Centric social network, which enables users to send and read short 140-character messages called "tweets". Due to the extensive usage of twitter, a large volume of text is being generated on a daily activity of users. Such a huge volume of user generated data had to be processed to utilized them effectively. www.scholink.org/ojs/index.php/rem Research in Economics and Management Vol. 2, No. 4, 2017 65 Published by SCHOLINK INC.
These data could be used in a variety of applications to enhance human life. For processing such huge amount of textual data, more advanced algorithms are required to learn the hidden patterns in the data.
Text analytics is the method to process this huge corpus of unstructured text to get high quality data.
Text Analytics is defined in Wikipedia as follows: "Text Analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation".
A text analytics framework consists of three stages: Text preprocessing, Text representation and Knowledge discovery.
In text preprocessing, textual data that are produced by social media sites could not be analyzed directly because these are raw input texts. Paragraph Vector is an unsupervised framework that learns continuous distributed vector representations for pieces of texts. The texts can be of variable-length, ranging from sentences to documents. The name Paragraph Vector is to emphasize the fact that the method can be applied to variable-length pieces of texts, anything from a phrase or sentence to a large document.
In Bag of words approach, the text is divided into words. This process is called as tokenization. The structure of the text is not maintained in this approach. Each word is represented as one single variable with different numeric weights. TF-IDF (Term Frequency/Inverse Document frequency) is commonly used as the weighing mechanism. In string of words approach, sequence of the words is maintained. In most applications, Bag of words is used due to its simplicity.
Once the textual data are transformed into numeric vectors, machine learning or data mining algorithms could be used to identify hidden patterns in the text. The most common approaches followed are classification and clustering. Clustering fall under the category of unsupervised learning and classification falls under the category of supervised learning. In unsupervised learning, training data are not required. The documents which contain the textual data are segmented into different partitions such that each partition belongs to a single topic. This process is termed as clustering. In supervised learning, training data are required to make a machine learning method to learn a classifier to classify unseen data. Classification is used in various applications like news filtering, document organization and retrieval, opinion mining, email classification and spam filtering.

Related Work
One of the most viewed social network channel is Twitter. Twitter pose a question to its users "What is happening?" and user can answer to this question in 140 characters.
In twitter user have different opportunity to demonstrate their mine like: post tweet or update their following post or re-tweet them. Users can also using different Tags to show their feeling.
Although tweets may contain precious information, many of them have no relatedness to the users. This can annoy users to find their own interests between big amount of information. To this end, different work have been accomplished to response to this challenge.
In the propose approach classified web pages by calculating the respective weights of terms. The user interest and preference models are generated by analyzing the user's navigational history. The similarity between Web content and the user's models is used to determine whether the content will be provided to the user. A user's navigational data is monitored and analyzed to conduct user modeling.
An automatic classification method is utilized to categorize the Web contents browsed by a user.
In the proposed Web page classification method, the terms are determined by the ontology base WordNet (Miller, 2009), and the weights of terms are calculated by the TF-IDF (term frequency-inverse document frequency) method.
Some others researcher work on the hybrid approach of recommendation, like that propose a new methodology for recommending interesting news to users by exploiting the information in their twitter persona which model relevance between users and news articles using a mix of signals drawn from the news stream and from twitter: Profile of social neighborhood of the user, Content of their own tweet stream, Topic popularity in news and in the whole twitter-land.
In the main focus is on the dynamic recommendation system that mentioned to have a successful recommendation system for active users, we should introduce ''somewhat novel'' articles to users. In this work by combining long-term interest of user with short-time interest, recommending a novel news to users.
The inspiring research is which present a content-based approach to modeling user interests based on Twitter. Personalization techniques are often classified into one of two categories: explicit and implicit.
Explicit personalization requires active and conscious data entry from the user, such as through a series of checkboxes or rating devices. In contrast, implicit personalization aims to automatically learn user preferences. Content-based approaches typically monitor the behavior of a user in the scope of an individual site or system and make recommendations based on their historical behavior.
In general, implicit personalization is considered more desirable from a user experience perspective because it does not burden users with data input tasks. In the other hand some other researcher work on the semantic of the tweets created by the users.
Semantic relatedness, which computes the association degree of two objects such as words, entities and texts, is fundamental for many applications. It has long been thought that when human measure the relatedness between a pair of words, a deeper reasoning is triggered to compare the concepts behind the words.
In investigate this question and introduce a framework for user modeling on Twitter which enriches the semantics of Twitter messages (tweets) and identifies topics and entities (e.g., persons, events, products) mentioned in tweets.
In other work investigate semantic user modeling based on Twitter posts which introduce and analyze methods for linking Twitter posts with related news articles in order to contextualize Twitter activities.
While many semantic relatedness researches in the past utilized lexical databases such as Word Net and

Methods
In Figure 1 architecture of our proposed approach is showed and we describe how this system worked together. Then we describe each part of this respectively.
The Architectural design for the proposed approach consists of the following stages:

Figure 1. Architecture of Proposed Approach
In Content Gathering step, Twitter provides many REST APIs to acquire data from Twitter using screen name. Using the "user_timeline" API, tweets from a particular user is acquired. This API takes many parameters as input. Screen name and number of tweets are two important parameters required to acquire a certain number of tweets from a particular user. Using this API, tweets for all the users in all the categories are acquired and stored in a file. Tweets, which are collected in the previous step cannot be used to train the classifier directly. Because these textual data contain some unwanted text. In preprocessing, these unwanted symbols and meaningless words are removed from the original tweet.
Most Tweets contain URLs, links, special symbols, abbreviations, hashtags, mentions and incorrect spellings. The following rules have been followed to preprocess the tweets: Rule 1: Remove all special characters except "#" and "@"; Tweets express emotions. So people use special characters to express their emotions. So all these special characters are replaced with null characters. "Hashtags" are the keywords in the tweets followed by the "#" symbol (e.g., #SuperBowl). Many users would be using the same hashtag for a particular event. So these hashtags are retained in the tweets. @ Symbol is used to specify the username. This is being handled by Rule 2.
Rule 2: Remove all URLs and @mentions; Shortened URLs are used in tweets. These URLs do not provide much information for us. For example, consider this shortened URL "bit. ly/12Jkw6U". These URL strings do not contain much text to predict the category. So these shortened URLs are removed from the text during preprocessing. "@" symbol is used to specify a screen name of a user in the tweets (e.g., @BarackObama). The words prefixed with "@" symbol is called as "mentions". These words cannot be used to predict the category of the tweet.
Because these words usually contain only user name.
Rule 3: Convert all words to lower case; Tweets are written in an inconsistent format. All the characters in the tweets can be either capital or small or mixed. To make the training data more efficient, all the words in the tweets are converted to lower cases. In the Representation step, the vector representations of words learned by Word2vec and Doc2vec models have been shown to carry semantic meanings and are useful in various NLP tasks. As mentioned before we used Doc2vec model to represent each tweets.
Paragraph Vector is an unsupervised framework that learns continuous distributed vector representations for pieces of texts. The texts can be of variable-length, ranging from sentences to documents. The name Paragraph Vector is to emphasize the fact that the method can be applied to variable-length pieces of texts, anything from a phrase or sentence to a large document.
Paragraph2Vec, which can be called in many names such as Doc2vec, paragraph vector or sentence embedding, is the algorithm that was modified from Word2Vec. The main purpose of Doc2Vec is associating arbitrary documents with labels, so labels are required. Doc2vec is an extension of word2vec that learns to correlate labels and words, rather than words with other words. In Figure 2, the abstract of Doc2vec is demonstrated.

Figure 2. Doc2Vec
After preprocessing and represent the tweet as vector,we can find explicit interest of user's through the tweet's he/she were post or retweet or like. User's have their own tweets and similarity of user's can find through Doc2vec similarity function, which Doc2vec similarity function is cosine-similarity by default. Cosine-similarity computes each user's vector with each other and find most-similar of each user.

Experiment
We user a graph based link prediction to build user explicit graph and infer explicit interest of user's to recommend similar tweets. Link prediction is the problem of predicting the presence or absence of edges between nodes of a graph. There are two types of link prediction: (i) structural, where the input is a partially observed graph, and we wish to predict the status of edges for unobserved pairs of nodes, and (ii) temporal, where we have a sequence of fully observed graphs at various time steps as input, and our goal is to predict the graph state at the next time step.
The underlying graph of our proposed approach use three type of information: user's relationship with each other, user's relationship with tweets and tweet's relationship with each other.
Base on our underlying representation for user model can be formalized as follow: The representation model G=(G U , G T , G UT ) is a heterogeneous graph composed of three sub graphs, G U , The proposed approach in this paper is implemented in Python and run under Windows platform. We perform our experiment to answer this question: Is the hybrid approach has a better performance than implementing the explicit approach and implicit approach independently.

Dataset
In this proposed approach we collected our tweets dataset from Twitter using Twitter's API. The first step to using Twitter's API is to be authenticated by Twitter. After registered the application in Twitter, the following parameters are provided to access the Twitter to collect tweets: Consumer token, Consumer secret key, access token and access secret key.
We used this parameters to get tweets through using Python Twitter library called tweepy. We collect our data set from timelines of followers of most visited pages in Twitter (such as bbctech, bloomberg, espn, f1, microsoft, newyork times, washington post, cnni, euronews and, etc.). For each channel we collect the follower of that channel user's id and take that user's id to extracted his/her timelines tweets.
We crawled over 900 user's and 160,000 tweets. Twitter constrains that for each user, we can only crawl his/her last 3200 tweets. However this is sufficient for our experiment.
To generate the dataset, we use Sqlite as our data warehouse. All extracted tweets are went to the preprocessing step, which remove all URL, Non-English words, remove all @(mention) and transform all words to lower case. In the next step, we illustrate how to implemented Doc2Vec on tweets and user's profile.
After preprocessing step, to represent each tweets as a vector, we implemented Doc2Vec model on tweets dataset. We build a JSON file which contain all tweets and in each line the file look like this ( "Tweet-Id", "Tweet-text"). Doc2Vec model was trained in this data set which the label was tweet-id.

Tweet-Tweet Graph
In our proposed approach the Doc2Vec features are as follow, size of 35 and windows of 5. The result is similar semantic tweets with their tweet-id and similarity degree. This tweet's Doc2vec model is used to build Tweet-Tweet graph as mentioned in recommending task. We collect 30 most similar tweets for each tweet, to build the link between tweets in graph.

User-User Graph
After training all tweets through Doc2Vec model, to find similarity between users we should trained all user. To evaluate our proposed approach, we collected 100 active user, whom at least posted 400 tweets.
T0 predict our proposed model accuracy, we implement Cross-Validation. Cross validation is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset). Cross validation has different types but in our approach we implement Leave-p-cross-validation. Through used Cross-validation, remove randomly 30% of tweets from all collected user's tweets, which they were liked or retweet by user to test our approach and the other user's tweets are used to trained our user's explicit model.
The Doc2Vec model build on this 100 users. The user's Doc2Vec model input is a JSON file, that the format look like this ("User-Id", "All user's tweet"). Doc2Vec model was trained in this data set which the label was user-id.
The result is similar semantic users with their user-id and similarity degree. This user's Doc2vec model is used to build User-User graph as mentioned in recommending task. We collect 30 most similar users for each user, to build the link between users in graph.

User-Tweet Graph
After building user-user and tweet-tweet graphs, it's time to build user-tweet graph. To this end, we model all tweets and collected user with each other through Doc2Vec. The result is similar semantic tweets for each user. We collect 30 most similar tweets for each user, to build the link between user and tweets in graph.

Explicit Recommendation
To build explicit profile of user we should build G=(G U , G T , G UT ) which contains of user-user graph, tweet-tweet graph and user-tweet graph. After build the G graph, the recommendation will build based on the link-prediction strategy. Our problem is to infer whether a user u is explicitly interested in tweet t . In other words, we are going to find missing links by adopting an unsupervised link prediction strategy over links in G Most of unsupervised link prediction strategies either generate scores based on vertex neighborhoods or path information. Vertex neighborhood methods are based on the idea that two www.scholink.org/ojs/index.php/rem Research in Economics and Management Vol. 2, No. 4, 2017 72 Published by SCHOLINK INC. vertices are more likely to have s link of they have many neighbors in common. Path-based approach consider all paths between two vertices. All these approaches are based on a predictive score function for ranking links that are likely to occur. There is no single superior approach and the structure of the specific graph indicate their quality. In our approach we used Jaccard's Coefficient strategy for inferring explicit interests of a user.
The Jaccard's Coefficient is defined as follow: The explicit profile of a user is link between tweet t and user u , which the link is computed through the link-prediction approach (L): If U={u 1 , u 2 ,…, u n }, u i ={tweet1 ui , tweet2 ui ,…} and T={t 1 , t 2 , …, t n }: The explicit profile of a user is: E(u)={u, L(u,t)/t∈T, u∈U}.

Evaluation and Metrics
To handle information overload and helping users to find items based on their interests, some kind of personalization techniques are used in personalized recommender systems. To figure out, that how much recommended items are suited and relevant to users, we should test and evaluate our propose recommender system.
The main question is a recommender system efficient with respect to a specific criteria like accuracy, user satisfaction, response time, serendipity or do in some other domain, do customer like/buy our recommended items?
Three typical measures which are used for evaluating the performance of the recommender systems are: Precision, Recall and F-measure. In information retrieval contexts, precision and recall are defined in term of set or retrieved documents and a set of relevant documents.
For classification and recommendation tasks, the term true positives, true negatives, false positives and false negatives compare the results of the classifier or recommender under test. These for outcomes can be define in Contingency matrix as follow in Figure 3:  In our evaluation we test our recommendations in different K and find out that in K=30, the recommendation results are better based on precision and recall @K=30. The results of recommendation @k shown in Table 1.

Conclusion and Future Work
To response the information overloading through internet and help users to find their interested items among these crowed of relevant/irrelevant information, recommender systems have appeared. In this paper, introduce history of recommender systems and different kind of them, like Content-Base recommender system, Collaborative recommender system and Hybrid recommender system. Also, in social media domain, focus on Twitter, which in this research is the main source of information for our proposed personalized recommender system. The next domain, is Text analysis which in text processing step and in representation we used Doc2Vec model. All tweets represent as an words vector.
For represent tweets as a vector, in preprocessing step, all links and non-useful symbols and non-English tweets are removed and also convert all words to lower case. We recommend similar tweets for each user's based on his/her interests through Link-Prediction strategy and the result show that in k=30 the proposed approach has a better performances. For future work, to regard in this matter that user's interests are changed through passing time, with adding user's short-term interests and build a dynamic personalized recommender system, have a better personalized recommendation for each user.