Reputation Analysis of E-commerce Products Based on Online Reviews—Take Amazon as an Example

Based on the product review information of amazon from 2006 to 2016, this paper measures the correlation between star rating and text review. First, a supervised SVM model was established to classify text emotions. Second, the word2vec model was built to analyze the unsupervised emotion of the comment text. Third, the relationship between the special text and the score is obtained by using the grey metric model. In addition, the article has also carried on the principal component analysis to the commodity reputation, and has determined the representative good commodity and the inferior commodity. Finally, this paper puts forward some Suggestions for merchants, e-commerce platforms and governments.


Published by SCHOLINK INC.
For enterprises, only by timely, accurate and reasonable analysis of online reviews, and understanding of consumers' shopping experience and emotional tendency, can they better promote attractive products.

Review
In the online comment research of Chinese e-commerce platform, Ye (2016) used neural network to transfer online comment text from traditional document space to high-dimensional vector semantic space, and conducted clustering analysis on mined comment feature words. Cui (2018) calculated the emotional intensity of online comments based on semantic dictionaries. Guo (2019) analyzed the relationship between online reviews and product sales.
In terms of online commentary research on e-commerce platforms outside China, Zhao (2018) analyzed the relationship between the distribution of evaluation resources in Amazon and the usefulness of online customer reviews. Sun (2018) used R language and word cloud analysis technology to conduct text mining on amazon's household products.
It can be seen that there are few text mining articles for online reviews of e-commerce in Amazon, and the application of text mining technology can play a great role in tracking consumers' demands and preferences in real time.
This paper considers the establishment of a supervised support vector machine (SVM) model, then the establishment of an unsupervised text emotion analysis model, and the analysis of the impact of scoring on subsequent comments, and then the establishment of a commodity reputation evaluation model based on principal component analysis (PCA), and the differentiation of successful and failed products.

Descriptive Statistics
The data used in this paper includes 30,000 online reviews of three categories of products on Amazon from 2006 to 2016. The basic description is shown in Table 1.

Handling of Outliers and Missing Values
In order to ensure the scientific nature and accuracy of the research data, this paper took targeted replacement and supplement of abnormal, vacant and wrong data, and deleted the number of wrong comments that could not be counted and analyzed.
1. Delete comments with missing values, as shown in Table 2. This paper found 60 misplaced comments in the pacifier data set, as shown in Table 3. Considering that such data only accounted for 0.32% of the total number of comments, it had little impact on subsequent studies, and was also excluded in this paper.

Build text vectors
The reference text is represented as a vector: represents the number of times the ith word in the thesaurus appears in the reference text in clause k.
According to the above, all the collected texts are represented in the form of vectors, as follows, If a word or phrase appears in a text with a high frequency of tf and a small frequency in other texts, it is considered that the word or phrase has a good ability of categorization, so the word is selected as the Published by SCHOLINK INC.

Establishment of Support Vector Machine Model
Based on corpus, this paper establishes the model of linear support vector machine.
First, we divided the text review and star rating into training set and test set.
Secondly, based on the training set data, the hyperplane linear equation of the classifier is calculated by using python program: + = 0, as shown in Figure 1. The synthesis of the model is evaluated by k-folding cross validation. The basic idea is to divide the total sample set into K points, in which k-1 point is used as the training set and the other 1 point as the test set to obtain a test accuracy.
In order to better reflect the robustness of the model, this operation will be carried out K times in turn, each time different training sets and test sets are selected, and finally the average of all accuracy is calculated, which represents the comprehensive score of this model. Here, 20% fold cross-validation is used.
2. The solution of SVM USES python sklearn library, and tries different kernel functions of SMV, including linear kernel function, polynomial kernel function and gaussian radial basis kernel function.
By adjusting parameters, the linear kernel function is fast and has good effect.
3. The first classification is defined: the five levels of star rating are divided into two categories, greater than or equal to 3 for one category, and less than 3 for one category, representing high rating and low rating respectively.
4. The second classification is defined: the five registrations with star rating are divided into three categories: greater than 3 is one category, equal to 3 is one category, and less than 3 is one category, which respectively represent high score, medium score, and low score.
score of the first classification was 0.921, and that of the second classification was 0.862.
Considering that there was a significant difference between low and high scores, the average score was actually more positive.
Therefore, the second division method is used to predict the star rating of some samples according to the text comment, and the data with the predicted result of 3 is analyzed to calculate the difference between the actual star rating greater than 3 and less than 3. The results are shown in Table 4.

Sample Size 8767
The Number Predicted to Be 3 947 Star Rating>3 37 Star Rating<3 7 Conclusion: It can be seen that the number of star rating greater than 3 is significantly greater than the number of star rating less than 3. Comparatively speaking, under the above training conditions, the SVM model is more inclined to positive emotion when the result of star rating training is 3.

Modeling Ideas
SVM is a supervised machine learning model that can monitor the differences between text comments and star ratings in real time. In order to carry out emotion analysis on text comment alone, we need to choose an unsupervised model to classify the intensity of text emotion. Therefore, this paper establishes an emotion analysis model based on word2vec.
Word2vec is a correlation model used to generate word vectors, which is based on a two-layer neural network, and is used for training to reconstruct linguistic word text. After training, the word2vec model can be used to map each word to a vector, which can be used to represent the relationship between word pairs, which is the hidden layer of the neural network.
After analysis, it can reflect the emotional words usually associated with adjectives and verbs, so only to extract text comments adjectives, verbs, and negative word is analyzed, and then use word2vec skip the similar degree between the words -gramm model training, in order to determine the direction of sentiment analysis, select "great" and "disappointed" for the root node, the two words of breadth-first Search BFS (Breadth First Search), establish a thesaurus of positive affection and negative affection.
Based on these two word bases, a comprehensive evaluation model was established to score the emotional intensity of each comment. frequency of text words is counted and the most frequent words are selected to form a vocabulary.
Generate a one-hot vector for each word, which represents whether the word has ever appeared in the vocabulary.

Establish a skim-gram neural network structure
Let the current word be w_t, then the context is represented as w_(t-1), w_(t-2), w_(t+1), w_(t+2), and the neural network model including the input layer, the hidden layer and the output layer is established to predict the context when the current word is known.
3. Use gradient descent method to solve the updated formula.
4. Divide the positive and negative word bases. With "great" and "disappointed" respectively as the root node, the word similarity matrix to perform breadth-first search, search count set to 100 layers, each time to extract the maximum correlation between parent and more than ten words, run continually add thesaurus lists, ensure no repeat words appear at the same time, at the same time, many times to adjust parameters, makes the number of words in two words library, the resulting word library size 223 and 220 respectively, partial results as shown in Table 5. Let the words in the positive thesaurus be , the words in the negative thesaurus be , the function F(x,y) be the degree of similarity between word x and word y, remember each comment , is the number of words under the comment, 0 is the total number of words in the comment, len1 is the total number of positive thesaurus, 2 is the total number of negative thesaurus, 1 ≈ 2.
Then calculate the number of negative and positive words contained in the comment text. If the number of positive words is greater than the number of negative words then the comment tends to be positive and less than negative. This paper USES Emotion to measure the degree of Emotion tendency, and defines the following evaluation methods: Then standardize all comments into {-1, -0.5, 0, 0.5, 1}, five emotional bias, positive means positive, negative means negative.

Influence of Cumulative Special Scores
First, by using the word2vec emotion analysis and breadth-first search emotion analysis model, the degree of emotional inclination of all text comments was quantified, ranging from -1 to 1. In order to reflect the relationship between cumulative star rating and review of goods, the product parent with large amount of data was selected for analysis. After the statistics of all product parent, it was found that the amount of data with high rating was far more than that with low rating. After the accumulation of the quantity, there is no situation that the data volume of low score is higher than that of high score, so the influence of high score accumulation on the emotional tendency of the current text comment is analyzed here.
2. Combine the time into months, count each value of star rating respectively, and sum up the emotion of each day.
3. Since there is no cumulative value for the goods that have just been put on the shelves in the previous months, the cumulative value of five stars in the first nine months of other months can be calculated without considering these data.
4. considering the low grade and high grade will influence each other, such as although sometimes low score accumulation value is bigger, but because of high grade cumulative value far more than the low score accumulation, leading to the text of the overall comments this month emotions tend to be positive, so the analysis of star rating greater than or equal to 4 and cumulative value of star rating less than or equal to 2 cumulative value of the difference and relationship between emotional tendencies.
5. Reduce the cumulative difference of monthly star rating in the unit 1 range, because the emotional range is -1 to 1, so this range is defined as -1 to 1.
6. Compare the relationship between cumulative score and text emotion, as shown in Figures 2 and 3.

Figure 3. Cumulative Scoring and Emotion Analysis
As can be seen from the figure, on the whole, the text emotion of the two commodities is positive, and the emotional tendency of each month is greater than 0, indicating that consumers have been very fond of the two commodities. With the increase of cumulative star rating, the emotion of each month is always inclined to the positive side, indicating that the previous accumulation of high rating will cause consumers in the current month to be inclined to the positive emotion of the product. Therefore, this paper also concludes that the low score based on the historical accumulation will make consumers tend to have negative comments on the product in the current month.

The Relationship between Rating and Special Text
After the analysis of the influence of cumulative score on text comments, this paper further analyzes the relationship between score and special text. In order to ensure the validity and representativeness of data, the comments of verified buyers and green label reviewers are selected. Then, positive emotional correlation analysis was conducted for comments with a star rating of 4 or 5, negative emotional correlation analysis was conducted for comments with a star rating of 1 or 2, and correlation analysis was conducted for comments with an overall sample of star ratings and emotions. The results are shown in Table 6. From the Comprehensive Correlation degree, it can be seen that both positive emotion and negative emotion have a high correlation degree with star rating, and the correlation degree between negative emotion and star rating is relatively higher than that between positive emotion and star rating. The Gray Relative Correlation reflects the correlation degree of the sequence change trend, from which it can be seen that the change of positive emotion is more closely related to the change of star rating.

Principal Component Model
The principle of principal component analysis is to try to recombine the original variables into a new set of several integrated variables that are unrelated to each other, and at the same time, according to the actual needs, a few less sum variables can be taken out to reflect the information of the original variables as much as possible. The specific steps are shown in Figure 4.

Figure 4. Flow Chart of Principal Component Analysis
Based on the comprehensive evaluation method of principal component analysis (pca) mentioned above, this paper establishes a principal component model to evaluate the reputation of commodities by using the eight variables of 5 levels of star rating, number of helpful comments and emotion analysis.
In this paper, the first few characteristic roots of the correlation coefficient matrix and their contribution rates are obtained based on the blower index data, as shown in Tables 7 and 8.  Extraction method: principal component analysis.
It is known that the cumulative contribution rate of the first four indexes has reached 87.50%, so the first four indexes are selected as the main components in this paper to obtain the following expression: Similarly, by the same method can be obtained ID number of the microwave oven for 423421857, ID for 450475749 pacifier reputation expression. According to the above analysis, this paper found that although the microwave oven reputation value is low, but its reputation value of the trend of change in the rise. So it has a certain appreciation of space.
For hair dryer, although the average reputation value is larger, but it presents a downward trend.
Therefore, it can not be considered a success. However, for the pacifier, the paper found that its average reputation value is large, and presents an upward trend, is a typical successful commodity.

The Businessman
1. Businesses can track users' comments in real time, timely follow up and deal with consumers' negative comments, and minimize the impact of negative comments and negative comments.
According to the analysis in this paper, the accumulation of high ratings can greatly affect consumers' impression of the product, while the accumulation of low ratings can lead to more reviews, similar to the Matthew effect.
2. Merchants can establish a commodity reputation evaluation system and a product sales forecast system to analyze consumers' shopping preferences and timely push marketable products. Based on the product's historical sales information and reputation information, you can make inventory preparation in advance to deal with the situation of out of stock.
3. In any case, producing high quality products at low prices is the only way to win over consumers.
Besides commodity quality itself, producer still can establish after sale service system to dispel the doubt of consumer, win the favour of consumer. In the era of mobile Internet, merchants can take advantage of the traffic dividend, establish a social economy model, and enhance customer loyalty through the interaction with consumers.

Platform
1. Establish a mechanism to identify malicious comments and false comments. Avoid merchants to pursue the ranking of stores and goods to make false comments, disturb the market order.
2. Establish a scientific three-level credit evaluation system. For each transaction on the platform, the platform can enrich the evaluation indicators, such as product quality, merchant attitude, logistics satisfaction, etc. For the validity of the comment, the transaction situation, transaction value and the recognition degree of other users can be considered comprehensively. For the credit of the business, the comprehensive consideration of its sales, praise rate.
Published by SCHOLINK INC.
3. Appropriately invite senior customers to comment on the product for evaluation, and provide reference evaluation for the product. Independent from merchants and platforms, commentators have strong objectivity and authenticity in their comments, which helps platforms and consumers to identify the quality of goods.
4. Establish the entry threshold of the mall, conduct qualification test for the settled merchants, regularly check the business license of the merchants, and timely clear the unqualified merchants.

The Government
The country should face up to the rapidly developing electronic commerce and make perfect and effective laws and regulations. The government can strengthen the real-name system and the management of credit files to eliminate the hidden dangers caused by the virtual and anonymous nature of online transactions, so as to safeguard the legitimate rights and interests of consumers, producers and platforms.

Conclusion
To sum up, this paper mines consumers' text comments based on online comment data, and measures consumers' text emotions by combining supervised and unsupervised classification models. Based on the emotion analysis model and the scoring data of commodities, the evaluation model of commodity reputation is established, which provides a basis for consumers to identify superior products and inferior products, provides a basis for producers to track consumers' preferences to develop marketing strategies, and provides constructive Suggestions for online shopping platforms to build a perfect trading environment.