NLP on Twitter Corpus

April 2022

1871 Words, 11 Minutes

Problem Statement

According to research, roughly 80% of data is unstructured text data, therefore text analytics are essential for analysing the richness of data.
Until recently, categorising text by hand and eyeballing it was the only method to undertake text analytics, which is time-consuming and inefficient.
Additionally, this is not a practical approach for managing millions of papers, like the data from Twitter.
Twiter corpus can be used to find trends related to a specific keyword, measure brand sentiment or gather feedback about new products and services.

Importing Libraries

We start the project by importing libraries
pandas: Used for data manipulation and analysis.
numpy: Used for adding support for multi-dimensional arrays and matrices.
nltk: Is a suite of libraries for statistical NLP for English language.
tweepy:Python library for accessing the Twitter API
re: Checks if a particular string matches a given regular expression

import pandas as pd
import numpy as np
import nltk
import re
import string
import tweepy
import os
from tweepy import OAuthHandler
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin, urlparse
from requests_oauthlib import OAuth1

Twitter Authentication Keys

We will pass the consumer key , cunsumer secret key , access token and access secret token linked with our account for authentication.
Then using tweepy library the twitter corpus will be extrated

consumer_key = '0kpuv5mkHLpyCfWwhFwByZZ2g'
consumer_secret = 'nP4glPRfY05FsJ0RJSj44ZE3RIISvbaqytbfZ4LwnIrc4jLut0'
access_token = '1117402284730642432-m9hPPjsCOv62Uxl715MWzder76qOfA'
access_secret = '1f6VP99xM0fGhT8QSD2Fu1eLZwxwT9SAqFqFCsFvd1hpt' 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)
userID = "@CopywriterMaven"
tweets = api.user_timeline(screen_name=userID, count=300,include_rts = False)
print("Number of Tweets Extracted: {}.\n".format(len(tweets)))

Format the data in form of a datafram

data = pd.DataFrame(data=[[len(tweet.text), tweet.text, tweet.created_at, tweet.user.screen_name] for tweet in tweets], columns = ['Tweet_Length', 'Tweet_Text', 'Tweet_Date', 'UserName'])
data.head(100)

The output of this is:

Tweet_Length	Tweet_Text	Tweet_Date	UserName
59	My answer to @MuckRack Daily's trivia question...	2022-02-14 18:23:10+00:00	CopywriterMaven
70	My answer to @MuckRack Daily's trivia question...	2022-02-11 19:47:15+00:00	CopywriterMaven
90	My answer to @MuckRack Daily's trivia question...	2022-02-07 18:10:35+00:00	CopywriterMaven
62	My answer to @MuckRack Daily's trivia question...	2022-02-03 18:16:52+00:00	CopywriterMaven
63	My answer to @MuckRack Daily's trivia question...	2022-01-24 18:17:27+00:00	CopywriterMaven
...	...	...	...	...
80	My answer to @MuckRack Daily's trivia question...	2020-12-29 18:04:33+00:00	CopywriterMaven
124	Turning Times Square Rocking Eve into the ulti...	2020-12-27 13:39:32+00:00	CopywriterMaven
84	My answer to @MuckRack Daily's trivia question...	2020-12-25 18:50:13+00:00	CopywriterMaven
34	@alfranken https://t.co/wauy9xnHaP	2020-12-25 17:16:32+00:00	CopywriterMaven
73	@CleverTitleTK I had to look up 'waxelene' ......	2020-12-25 16:47:04+00:00	CopywriterMaven
rows × 4 columns

Preprocessing of Tweets

Since the tweets are a form of unstructured data we need to preprocess it so that it is free of all the irrelavant data such as spaces , punctuations , urls etc. We will perform the following

Removal of extra spaces
Tokenization
Stop word removal
Stemming
Lemmatization

Removal of Urls

import re
urls= []
for i in data.Tweet_Text:
    urls.append(re.findall(r'\b(?:http.*)',i))
urls
remove_urls = []
for i in data.Tweet_Text:
    remove_urls.append(re.sub(r'\b(?:http.*)','',i))
remove_urls

Converting characters to lower case

 data.Tweet_Text = data.Tweet_Text.apply(lambda x: x.lower())

The output is

Tweet_Length	Tweet_Text	Tweet_Date	UserName
59	my answer to @muckrack daily's trivia question...	2022-02-14 18:23:10+00:00	CopywriterMaven
70	my answer to @muckrack daily's trivia question...	2022-02-11 19:47:15+00:00	CopywriterMaven
90	my answer to @muckrack daily's trivia question...	2022-02-07 18:10:35+00:00	CopywriterMaven
62	my answer to @muckrack daily's trivia question...	2022-02-03 18:16:52+00:00	CopywriterMaven
63	my answer to @muckrack daily's trivia question...	2022-01-24 18:17:27+00:00	CopywriterMaven

### Removal of special characters

 data.Tweet_Text = data.Tweet_Text.apply(lambda  x : re.sub("’" , "" , x))
 print(string.punctuation)
special = set(string.punctuation)
data.Tweet_Text=data.Tweet_Text.apply(lambda x: ''.join(ch for ch in x if ch not in special))

Removal of extra spaces

data.Tweet_Text=data.Tweet_Text.apply(lambda x: x.strip())
data.Tweet_Text = data.Tweet_Text.apply(lambda x: re.sub(" +", " ", x))

Tokenization

Is a way to break down a sentence into smaller units called tokens separated by comma

all_tweets_text=set()
for tweet in data.Tweet_Text:
    for word in tweet.split():
        if word not in all_tweets_text:
            all_tweets_text.add(word)

Filtering stop words

words = list(all_tweets_text)
sorted_list = sorted(words)
nltk.download("stopwords")
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
stop_words
filtered_list = []
for word in sorted_list:
    if word not in stop_words:
        filtered_list.append(word)

Stemming

In linguistic morphology, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form

from nltk.stem import PorterStemmer
stem_words = []
ps = PorterStemmer()
for i in filtered_list:
    stem_words.append(ps.stem(i))

Lemmatization

Is a text normalization technique used in Natural Language Processing (NLP), that switches any kind of a word to its base root mode.

from nltk.stem import WordNetLemmatizer
lem_words = []
lem =  WordNetLemmatizer()
for i in filtered_list:
    lem_words.append(lem.lemmatize(i))
lem_words

POS Tagging

Is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.

stem_pos = nltk.pos_tag(stem_words)
lem_pos = nltk.pos_tag(lem_words)
stem_pos
lem_pos

Finding top 10 most frequent words

freq = nltk.FreqDist(lem_words)
top10 = freq.most_common(10)
top10  

This gives us the output:

[('family', 2),
 ('joke', 2),
 ('movie', 2),
 ('reason', 2),
 ('right', 2),
 ('thing', 2),
 ('time', 2),
 ('today', 2),
 ('word', 2),
 ('year', 2)]

Word Cloud Generation

Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.
Significant textual data points can be highlighted using a word cloud.
Word clouds are widely used for analyzing data from social network websites.

! pip install wordcloud

from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(background_color = 'black').generate(str(top10))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

The output is

input

N-Gram Model

An N-gram language model predicts the probability of a given N-gram within any sequence of words in the language.
A good N-gram model can predict the next word in the sentence i.e the value of p(w|h)
Example of N-gram such as unigram (“This”, “article”, “is”, “on”, “NLP”) or bi-gram (‘This article’, ‘article is’, ‘is on’,’on NLP’).

Uni-Gram

#Uni-gram
from nltk.util import ngrams
n = 1
unigrams = ngrams(lem_words, n)

for item in unigrams:
    print(item)

Bi-Gram

#Bi-gram
from nltk.util import ngrams
n = 2
unigrams = ngrams(lem_words, n)

for item in unigrams:
    print(item)

Tri-Gram

#Tri-gram
from nltk.util import ngrams
n = 3
unigrams = ngrams(lem_words, n)

for item in unigrams:
    print(item)