Master Project Developer Note
Master Project Developer Notes
Data Visualization of User Engagements
I want to visualize user engagements in terms of the numbers of comments, suggestions, posts, votes and page views for a selected number of stories from each category.
Data Visualization of Author and Story-telling
I want to visualize the growth of content in each category, such as the number of stories and the number of words. The number of daily-published words is an important visualization to study the platform and the story-telling process of authors. The number of daily-generated word over a period of time can also explain the trend of digital publishing platform.
Data Visualization of Content
Visualizing the number of monthly published chapters, stories and associated number of page views over a couple of months can give readers an overview of the content generating speed and their contribution to the overall page views. The change of ranking, particularly data from the red list and the black list, is another factor to evaluate the quality of content and their engagement.
Profitability of the Platform
To show the advantage of this new publishing platform, it is important to study its profitability or growth of market shares. I want to visualize the number ads and profits generated by the platform. I also interested in studying the profitability of a selected number of authors. I want to visualize their monthly-produced words and their monthly-generated profits. I want to map these number of authors’ profitability to the visualization of monthly-produced words and the monthly-generated profits of the entire platform.
Since most stories on the platform are free of charge, there will be limited paid content and data available for this analysis.
I can use Data-Driven Document (D3) JavaScript Library or Google Chart Tools to implement the above four visualizations. I might need to clean data and write them into appropriate format, such as JSON or CSV formats. The current data is in comma separated value (csv) format, as shown in figure 1.
Figure 1: Story statistic data format.
I can use Python or Java to read such txt files and reorganize them into formats that are recognizable by visualization libraries. To read and reorganize these data files using Python, I need to import the following libraries:
If there are any missing data or content, I might need to import the BeautifulSoup library to fetch additional data from the website. Most of visualization libraries require developers to have data in JSON or CSV formats.
I recently found a number of visualizations from the D3 example gallery that are suitable for comparing page views or number of comments of each category. Some examples are the D3 Show Reel, Multi-Series Line Chart, Difference Chart and Crossfilter. I decided to test the D3 Show Reel visualization and implement a Python app, as shown in figure 2, to read the page view data and output to format that is supported by the D3 Show Reel visualization, as shown in figure 3.
Figure 2: Python app for reading page view and output to csv format.
Figure 3: the output file of the D3 Show Reel visualization.
The D3 Show Reel visualization for page views can be found from the following figures. The visualization enables the transition between a numbers of visualizations.
Key Word Evolution and Frequency Analysis
I plan to study the evolution of frequently used words of the user-enhanced stories and user-generated comments using TF-IDF algorithm. These bags of words can explain the trend of the each category and the interests of audiences.
I can use the Gensim topic-modeling tool to do the TF-IDF analyses. Gensim is a free Python framework designed to automatically extract semantic topics from documents. Gensim aims at processing raw, unstructured digital texts in English. The algorithms in gensim, such as Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA) or Term Frequency–Inverse Document Frequency (TF-IDF), discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents.
Since all content is in Chinese, I might need to use a topic-modeling tool that can parse Chinese vocabularies and phrases, such as jieba. Jieba is a Python Chinese word segmentation module. To test the tool, I downloaded opinion articles from the People’s Daily Online since 2010, and organized them to a monthly collection. I run the Jieba text segmentation tool using Python on the grouped article collections and performed a TF-IDF analysis on the result. Here is the link to all opinion articles from People’s Daily since 2010: http://opinion.people.com.cn/GB/8213/49160/49179/index.html
After downloading all articles from the site, I read all txt files into Python:
Figure 4: Reading People’s Daily opinion text file into Python.
I implemented a TF-IDF analysis application using the jieba library:
Figure 5: TF-IDF analysis app using jieba.
The TF-IDF results can be found at figure 6. The result shows the change of trend in official press.
Figure 6: Result of the TF-IDF analysis using jieba.
I can apply the same algorithm to the user-enhanced stories and predict the trend of user-enhanced self-publishing stories in different categories. I can also visualize the TF-IDF results using a heat map.