Text Mining With OctoparseThursday, August 22, 2019
Undoubtedly, this is an age of information explosion. It is said that by 2020, there would be 44 zettabytes of data in the entire digital universe.
According to Domo’s data never sleeps 7.0, an unbelievable amount of data is created every single minute:
- Twitter users send 511,200 tweets
- 188,000,000 emails are sent
- Google conducts 4,497,420 searches
- 18,100,000 texts are sent
Many people are plagued by information overload. Perhaps it would take several hours to go through all the news, emails or tweets every day even though 80% of them are not the information they need. Some people start to get tired of information overload. However, they would miss the 20% important information if they just ignore all of them. Therefore, figuring out some way to extract only the useful information really matters at this moment. That’s when text mining comes into being.
What is Text Mining?
Text mining is a technique that could mine high-quality information among a large number of texts. Text mining is based on Natural Language Processing (NLP) and combined with some typical data mining algorithms such as classification, clustering, neural network, etc. In addition, there are some other typical text mining applications such as sentimental analysis, information extraction, topic modeling, etc.
How to do a text mining project?
Before doing a project with text mining skills, we need to first obtain raw data from somewhere. Text Acquisition is the first and the most important step before text mining.
For people who want to conduct a text mining project, they could find many open-source data from data platforms such as Kaggle. However, the datasets on such platforms have been widely used, so it is difficult to conduct a unique project based on these sources. Nowadays, more people would prefer to build a web spider and scrape first-hand and up-to-date data from the internet.
Many people would write their own spiders using python or other languages to scrape data on websites. Libraries such as BeautifulSoup4, request or Tweepy have been widely used. But for those who don’t have a high-level programming skill or don’t understand web structure so well, programming seems to be the biggest obstacle to their projects. In this case, another option is to use some 0-coding-needed web scraping tools such as Octoparse.
Usually human would process texts in our brains by reading them line by line to understand and conclude them. While in text mining, the computer would automatically get rid of some useless information and quantify the useful texts by transforming them into numbers.
1) Linguistic Processing of Texts
For text mining project, the computer couldn’t understand the semantics of the words so it could only recognize words based on the structure. Therefore, the whole passage of texts would be divided into specific text units such as a sentence or, more frequently, a word. Tokenization, Lemmazation or Stemming are the most common ways to separate the whole text.
After we split the texts into words, we could categorize them by their part of speech. As we know, there would be some meaningless string in texts such as “a”, “the” or some punctuations. These texts are called stopword. One last thing to do when processing text is to remove all stopwords and keep only the meaningful data.
2) Mathematical Processing of Texts
After separating the texts and remove all stopwords, we could start to do some mathematical processing, which is to quantify the texts by transforming them into numbers based on different parameters. The most common parameter is word frequency (Countvectorizer). Simply calculating how many times each word appears. There are some other parameters such as TF-IDF and Word2vector.
After we process all the data, we could start our text mining projects. Here are some of the most common examples of text mining:
Word cloud: Build a word cloud based on word frequency. All the words would appear within a cloud. The high-frequency words would appear larger than low-frequency words. We would conduct a word cloud analysis later. (e.g text visualization)
Sentimental analysis: Sentimental analysis is a process that could help us identify the sentiment from opinions based on the words. A python library called TextBlob could help us analyze them and generate a positive or negative sentiment strength score. (e.g. product or brand monitoring)
Topic modeling: Topic modeling could help us identify the topic of a piece of text. Latent Dirichlet Allocation (LDA) is an example of topic modeling which could classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. (e.g. tagging for reviews/news/articles)
Here we will show you how to do a simple word cloud analysis. We will scrape the tweets about “#fashion” and build a word cloud to see what are the high-frequency vocabularies. Let’s figure what people are talking about fashion on Twitter!
We would use data extraction tools Octoparse to scrape the tweets. This is the link we want to scrape: https://twitter.com/search?q=fashion&src=typed_query.
First, we should input the link into Octoparse.
On Twitter, we need to keep scrolling down the pages to load more information. Therefore in Octoparse, we usually deal with this infinitive scroll by setting a scroll time (here we set 50 times ) to load tweets about #fashion.
Next, we try to loop through all the tweets and extract them.
With several simple steps, we now have the tweets we need, and we are ready for a word cloud.
Go to https://wordart.com and input the text we just extract. In Word Art, we could filter all numbers, stopwords and also do stemming for the text. And we also remove “fashion”, which is our keyword, since it would be undoubtedly the most frequently occurring word.
Finally, here is the word cloud for #fashion.
As we can see, “Style” is the most frequently mentioned word. Also, there are some other high-frequent words such as “Dress”, “Love”, “New”, “Shirt”, ”Shop”, “Seventeen”... The words in this cloud could somehow provide us a simple and direct visualization about a large number of tweets we just scraped.
So let’s make some wild guess based on the cloud. Normally, fashion and style are the opposite. Fashion is what is being adored or followed by a large group of people while Style is unique for relatively few people. Since “Style” has been widely discussed on Twitter, we could reasonably guess that more people would consider sharing their own styles.
Who will we follow on Twitter? Celebrities. Their fashion styles would be unique for themselves but their words or tweets would become popular. As more people start to pay attention to their style, their style would become a new fashion trend. The influence of celebrities on fashion is huge.
To get the idea of what people are talking about #fashion on Twitter, we don't need to spend a lot of time reading each tweet. Instead. the word cloud could give an intuitive sense of the related topic of the texts. By linking multiple high-frequency words, we could even take some wild guess on the text and try to infer some possible results.
We could come out with a lot of other interest guesses with the help of a word cloud. But usually, we would combine word cloud with other analyses such as sentimental analysis or topic modeling to further dig more clues hidden in the texts. The world of text is amazing. Why not start your data project now from Octoparse!
Author: Eric W.
Artículo en español: Minería de Texto con Octoparse
También puede leer artículos de web scraping en el sitio web oficial