Machine Learning Part 8: Natural Language Processing (NLP)

Machine Learning Part 8: Natural Language Processing (NLP)


Please Subscribe Youtube| Like Facebook | Follow Twitter

Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field of study that focuses on enabling computers to understand and process human language. It plays a vital role in various applications such as language translation, sentiment analysis, chatbots, and text classification. In this article, we will explore the fundamentals of NLP and delve into essential topics like text preprocessing techniques, bag-of-words, word embeddings, sentiment analysis, and text classification. Additionally, we will provide Python code examples and examine the corresponding outputs to illustrate the concepts effectively.

Text Preprocessing Techniques

Before applying any NLP algorithms, it is crucial to preprocess the raw text data. Text preprocessing involves several steps, including tokenization, removing stopwords, stemming or lemmatization, and handling special characters or symbols. Python provides powerful libraries such as NLTK (Natural Language Toolkit) and SpaCy for implementing these techniques. Below is an example code snippet demonstrating the text preprocessing process:

First install nltk library by running below command

pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Tokenization
text = "This is an example sentence for tokenization."
tokens = word_tokenize(text)
print("Tokenization:")
print(tokens)

# Stopword removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
print("\nStopword Removal:")
print(filtered_tokens)

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens]
print("\nStemming:")
print(stemmed_tokens)

Output

Tokenization:
['This', 'is', 'an', 'example', 'sentence', 'for', 'tokenization', '.']

Stopword Removal:
['example', 'sentence', 'tokenization', '.']

Stemming:
['exampl', 'sentenc', 'token', '.']

The code tokenizes the text, removes stopwords, and applies stemming to the remaining tokens.

These techniques are often employed as initial steps in text analysis and natural language processing tasks to clean and transform textual data into a more suitable format for further analysis or modeling.

The specific techniques used in the code snippet are as follows:

Tokenization: The text is tokenized using the word_tokenize() function from the NLTK library, which splits the input sentence into individual words or tokens. The resulting tokens are stored in the tokens variable.

Tokenization: [‘This’, ‘is’, ‘an’, ‘example’, ‘sentence’, ‘for’, ‘tokenization’, ‘.’]

In tokenization, the sentence is divided into individual tokens or words. Each word is considered a separate token, including punctuation marks.

Stopword Removal: A set of stopwords specific to the English language is obtained using the stopwords.words(‘english’) function from NLTK. The list comprehension [token for token in tokens if token.lower() not in stop_words] is used to filter out stopwords from the tokens, resulting in a list of filtered tokens stored in the filtered_tokens variable.

Stopword Removal: [‘example’, ‘sentence’, ‘tokenization’, ‘.’]

In stopword removal, common words known as stopwords are eliminated from the tokenized list. In this case, the stopwords ‘This’, ‘is’, ‘an’, ‘for’ are removed, and only meaningful words like ‘example’, ‘sentence’, ‘tokenization’, and the punctuation mark ‘.’ remain.

Stemming: The Porter stemming algorithm is applied to each token in the filtered_tokens list using the PorterStemmer().stem(token) function from NLTK. This step reduces words to their base or root form, aiming to unify similar terms. The stemmed tokens are stored in the stemmed_tokens variable.

Stemming: [‘exampl’, ‘sentenc’, ‘token’, ‘.’]

In stemming, the words are reduced to their base or root form. The words ‘example’ and ‘sentence’ are stemmed to ‘exampl’ and ‘sentenc’, respectively. ‘Tokenization’ is stemmed to ‘token’, and the punctuation mark ‘.’ remains unchanged.

Code Explanation

The code utilizes the NLTK (Natural Language Toolkit) library to perform tokenization, stopword removal, and stemming. The following steps are executed:

Step 1: Importing the necessary libraries

  • The `nltk` library is imported for natural language processing tasks.
  • The ‘punkt’ resource from `nltk` is downloaded using the `nltk.download()` function, which provides data for tokenization.
  • The ‘stopwords’ resource from `nltk` is downloaded using the `nltk.download()` function, which provides a list of common stopwords.
  • The `stopwords` module is imported from the `nltk.corpus` module, allowing access to the stopwords.
  • The `PorterStemmer` class is imported from the `nltk.stem` module, which provides stemming capabilities.
  • The `word_tokenize` function is imported from the `nltk.tokenize` module for tokenization.

Step 2: Tokenization

  • A sample text is defined as a string variable named ‘text’.
  • The `word_tokenize()` function is applied to the ‘text’ variable, which tokenizes the text into individual words.
  • The tokens are stored in a list variable named ‘tokens’.
  • The tokenized words are printed using the `print()` function.

Step 3: Stopword Removal

  • A set of stopwords in English is created using the `set()` function and the `stopwords.words(‘english’)` from the `stopwords` module.
  • A list comprehension is used to filter out the tokens that are not present in the set of stopwords, and the filtered tokens are stored in the ‘filtered_tokens’ variable.
  • The filtered tokens are printed using the `print()` function.

Step 4: Stemming

  • A `PorterStemmer` object named ‘stemmer’ is created.
  • A list comprehension is used to apply stemming to each filtered token using the `stem()` method of ‘stemmer’, and the stemmed tokens are stored in the ‘stemmed_tokens’ variable.
  • The stemmed tokens are printed using the `print()` function.

Bag-of-Words and Word Embeddings

Bag-of-Words (BoW) is a simple and effective technique to represent text data numerically. It creates a vocabulary of unique words from the corpus and constructs feature vectors based on word frequencies. On the other hand, word embeddings, such as Word2Vec and GloVe, capture the semantic meaning of words by representing them as dense vectors in a continuous space. Let’s take a look at an example of both approaches using Python:

First install gensim library by running below command

pip install gensim
import nltk
nltk.download('punkt')
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Bag-of-Words
corpus = ["This is the first document.",
          "This document is the second document.",
          "And this is the third one.",
          "Is this the first document?"]
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(corpus)

print("Bag-of-Words:")
print(vectorizer.get_feature_names_out())
print(bow_matrix.toarray())

# Word Embeddings (Word2Vec)
sentences = [word_tokenize(sentence) for sentence in corpus]
model = Word2Vec(sentences, min_count=1)

print("\nWord Embeddings (Word2Vec):")
print(model.wv['document'])

Output

Bag-of-Words:
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

Word Embeddings (Word2Vec):
[-5.3719373e-04  2.3564025e-04  5.1032072e-03  9.0087978e-03
 -9.3030687e-03 -7.1161659e-03  6.4590201e-03  8.9723496e-03
 -5.0149704e-03 -3.7629646e-03  7.3800068e-03 -1.5337272e-03
 -4.5364527e-03  6.5543428e-03 -4.8607541e-03 -1.8154723e-03
  2.8768894e-03  9.9251140e-04 -8.2848091e-03 -9.4479648e-03
  7.3122070e-03  5.0710961e-03  6.7575406e-03  7.6371187e-04
  6.3518700e-03 -3.4045773e-03 -9.4574125e-04  5.7679312e-03
 -7.5220214e-03 -3.9361469e-03 -7.5112856e-03 -9.2932110e-04
  9.5380470e-03 -7.3192324e-03 -2.3338017e-03 -1.9378344e-03
  8.0773551e-03 -5.9308852e-03  4.5955741e-05 -4.7531622e-03
 -9.6027311e-03  5.0074221e-03 -8.7597482e-03 -4.3925634e-03
 -3.4894314e-05 -2.9593913e-04 -7.6621571e-03  9.6147386e-03
  4.9823090e-03  9.2323041e-03 -8.1581669e-03  4.4963211e-03
 -4.1377009e-03  8.2409283e-04  8.4992992e-03 -4.4621448e-03
  4.5174314e-03 -6.7875157e-03 -3.5483374e-03  9.3987724e-03
 -1.5778032e-03  3.2129668e-04 -4.1407510e-03 -7.6824366e-03
 -1.5074047e-03  2.4690721e-03 -8.8874612e-04  5.5328705e-03
 -2.7421480e-03  2.2598163e-03  5.4561328e-03  8.3449977e-03
 -1.4540153e-03 -9.2086149e-03  4.3703164e-03  5.7112006e-04
  7.4424543e-03 -8.1403070e-04 -2.6381719e-03 -8.7525798e-03
 -8.5636874e-04  2.8255966e-03  5.4020807e-03  7.0528518e-03
 -5.7037808e-03  1.8588157e-03  6.0894871e-03 -4.7974368e-03
 -3.1081438e-03  6.7969938e-03  1.6309945e-03  1.9023121e-04
  3.4745417e-03  2.1739620e-04  9.6181417e-03  5.0611519e-03
 -8.9166462e-03 -7.0412592e-03  9.0192736e-04  6.3927770e-03]

In the code:

Bag-of-Words

  • The CountVectorizer from sklearn.feature_extraction.text is used to convert the text corpus into a matrix of token counts.
  • The fit_transform method is called on the vectorizer object to transform the corpus into a bag-of-words representation.
  • The feature names are obtained using the get_feature_names_out method of the vectorizer object.
  • The bag-of-words matrix is printed using bow_matrix.toarray().

Word Embeddings (Word2Vec)

  • The Word2Vec model from gensim.models is used to train word embeddings on the tokenized sentences.
  • The word_tokenize function from nltk.tokenize is used to tokenize the sentences.
  • The Word2Vec model is trained on the tokenized sentences using Word2Vec(sentences, min_count=1).
  • The word embedding vector for the word ‘document’ is obtained using model.wv[‘document’].
  • The output showcases the bag-of-words representation and the word embedding vector for the word ‘document’.

Please note that the output represents the Bag-of-Words matrix and the Word2Vec representation of the word ‘document’.

Code Explanation

Step 1: Importing the necessary libraries

  • The `nltk` library is imported for natural language processing tasks.
  • The ‘punkt’ resource from `nltk` is downloaded using the `nltk.download()` function to access tokenization data.
  • The `CountVectorizer` class from the `sklearn.feature_extraction.text` module is imported for converting text documents into a matrix of token counts.
  • The `Word2Vec` class from the `gensim.models` module is imported for generating word embeddings.
  • The `word_tokenize` function from `nltk.tokenize` is imported for tokenizing sentences.

Step 2: Bag-of-Words

  • The `corpus` variable is defined as a list of strings representing the text documents.
  • A `CountVectorizer` object named ‘vectorizer’ is created.
  • The `fit_transform()` method of ‘vectorizer’ is applied to the `corpus`, transforming it into a matrix of token counts called ‘bow_matrix’.
  • The unique feature names (words) in the corpus are obtained using the `get_feature_names_out()` method of ‘vectorizer’.
  • The Bag-of-Words representation is printed using the `print()` function, displaying the feature names and their corresponding counts.

Step 3: Word Embeddings (Word2Vec)

  • The `sentences` variable is created as a list comprehension, tokenizing each sentence in the `corpus` using the `word_tokenize()` function from `nltk.tokenize`.
  • A `Word2Vec` model named ‘model’ is created using the `sentences` and a `min_count` value of 1.
  • The word embedding vector for the word ‘document’ is retrieved using the `wv` attribute of ‘model’.
  • The Word2Vec representation for the word ‘document’ is printed using the `print()` function.

Sentiment Analysis and Text Classification

Sentiment analysis aims to determine the sentiment or opinion expressed in a piece of text. Text classification, on the other hand, involves categorizing text documents into predefined classes or categories. These tasks are essential in various domains, including social media analysis and customer feedback analysis. Python provides libraries like NLTK and scikit-learn that offer pre-trained models and algorithms for sentiment analysis and text classification. Below is an example showcasing sentiment analysis and text classification using the Naive Bayes classifier:

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('punkt')
nltk.download('vader_lexicon')

# Sentiment Analysis
sentiment_analyzer = SentimentIntensityAnalyzer()
text = "I'm really happy with this product. It exceeded my expectations!"
sentiment_scores = sentiment_analyzer.polarity_scores(text)

print("Sentiment Scores:")
print(sentiment_scores)

# Text Classification
def classify_sentiment(text):
    sentiment_scores = sentiment_analyzer.polarity_scores(text)
    compound_score = sentiment_scores['compound']
    if compound_score >= 0:
        return 'positive'
    else:
        return 'negative'

text = "The product was great!"
classification = classify_sentiment(text)

print("\nText Classification:")
print(classification)

Output

Sentiment Scores:
{'neg': 0.0, 'neu': 0.677, 'pos': 0.323, 'compound': 0.6468}

Text Classification:
positive

Code demonstrates both sentiment analysis and text classification.

Sentiment Analysis:

In the code, the SentimentIntensityAnalyzer from NLTK is used to perform sentiment analysis on a given text. The polarity_scores method of the analyzer provides a dictionary of sentiment scores, including negative (neg), neutral (neu), positive (pos), and compound (compound) scores. These scores indicate the sentiment intensity of the text, where a higher compound score indicates a more positive sentiment.

Text Classification:

For text classification, a custom function classify_sentiment is defined to classify the sentiment of a given text. It utilizes the SentimentIntensityAnalyzer to calculate the compound score for the text. If the compound score is greater than or equal to 0, the text is classified as “positive”; otherwise, it is classified as “negative”.

The example demonstrates how to apply sentiment analysis to evaluate the sentiment of a text and how to perform text classification by defining a custom classification function.

Code Explanation

Step 1: Importing the necessary libraries

  • The `nltk` library is imported for natural language processing tasks.
  • The `SentimentIntensityAnalyzer` class is imported from `nltk.sentiment` for sentiment analysis.

Step 2: Downloading required resources

  • The ‘punkt’ resource from `nltk` is downloaded using the `nltk.download()` function, which provides data for tokenization.
  • The ‘vader_lexicon’ resource from `nltk` is downloaded using the `nltk.download()` function, which provides the VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon for sentiment analysis.

Step 3: Sentiment Analysis

  • An instance of the `SentimentIntensityAnalyzer` class named ‘sentiment_analyzer’ is created.
  • A sample text is defined as a string variable named ‘text’.
  • The `polarity_scores()` method of ‘sentiment_analyzer’ is applied to the ‘text’, which calculates the sentiment scores.
  • The sentiment scores are printed using the `print()` function.

Step 4: Text Classification

  • A function named ‘classify_sentiment’ is defined that takes a ‘text’ parameter.
  • Inside the function, sentiment scores are calculated using the `polarity_scores()` method of ‘sentiment_analyzer’.
  • The compound score, representing the overall sentiment, is extracted from the sentiment scores.
  • If the compound score is greater than or equal to 0, the function returns ‘positive’; otherwise, it returns ‘negative’.
  • A sample text is defined as a string variable named ‘text’.
  • The ‘classify_sentiment()’ function is called with the ‘text’ as an argument, and the result is stored in the ‘classification’ variable.
  • The classification result is printed using the `print()` function.

Conclusion

Natural Language Processing encompasses a wide range of techniques and tools that enable machines to understand and analyze human language. In this article, we covered fundamental concepts such as text preprocessing techniques, bag-of-words, word embeddings, sentiment analysis, and text classification. By leveraging Python and its powerful libraries, you can apply these techniques effectively in real-world applications. With the provided code examples and corresponding outputs, you have a solid starting point for exploring NLP and building your own text analysis solutions.

Machine Learning In Python Beginner Tutorial Series

Please Subscribe Youtube| Like Facebook | Follow Twitter


Leave a Reply

Your email address will not be published. Required fields are marked *