Understanding Sentence and Document Embeddings in NLP

Table of Contents

What is Sentence and Document Embeddings

Sentence and document embeddings are at the core of Natural Language Processing (NLP). These embeddings are represented in a close real-valued vectors format and are created in such a way that these contains the meaning of the whole sentences, paragraphs, or documents. Word embeddings were able to express meaning of single word only, but sentence and document embeddings can contain the meaning of the larger text structure like a complete sentence, or a document. These embeddings form the foundation for applications where contextual understanding is very important to perform the task, for example in semantic search, text clustering, sentiment analysis and document classification.

Why Use Embeddings?

Semantic understanding is one of the key use case for embeddings. Traditional approaches like one-hot encoding or bag-of-words were not able to correctly understand the contextual meaning given in the sentence. But these embeddings are created to understand context and meaning in the the sentence.
Dimensionality Reduction is one of the benefit of embeddings. The embeddings map high-dimensional data (means text data) into lower dimensions (like close real-number vectors). This take less memory to store and improve efficiency.
Improved Similarity Matching: Since these embeddings understand the contextual meaning of the whole sentence or document, these can be used to compare different sentences or documents. This helps in categorizing the documents and text clustering.

Types of Embeddings and How They Work

There are various types of embeddings used in model NLP. Each embedding technique has its own characteristics and use cases.

Word embeddings allows words to be represented in the form of real-valued vectors. Popular word embedding techniques are:
- Word2Vec was developed by researchers at Google in 2013. It represents each word as a vector such that syntactically and semantically similar words will be mapped into nearby points in the vector space. In other words it captures the semantic relationship between the words.
- GloVe was developed by researchers at Stanford University. It uses word re-usability matrices, which calculate how frequently a pairs of words appear together across the whole text.
- FastText was developed by Facebook’s AI Research (FAIR) team. This is an extension of the Word2Vec model. This embedding uses subword (character n-grams) information to improve over traditional word embeddings.
Sentence Embeddings captures the meaning of all words used in a sentence and summarize them into a single vector that represents the entire sentence and its meaning. Popular sentence embedding models are:
- Sentence-BERT (SBERT) is an improvement over BERT model. This embedding is more useful for semantic similarity tasks.
Document Embeddings are similar to sentence embeddings, but it is used to aggregate multiple sentences that are present in the entire document.
- Doc2Vec embedding is an extension of Word2Vec. Doc2Vec has two variants – Distributed Memory (DM) and Distributed Bag of Words (DBOW).

Generating Sentence and Document Embeddings

Let’s explore how we can generate sentence and document embeddings using different models & popular libraries in Python.

Popular Models for Generating Embeddings

Bag-of-Words (BoW): This model represent the text by taking frequency of each word from the given text. This does not give any weight-age to grammar or word order used in the text.
TF-IDF (Term Frequency-Inverse Document Frequency): This is a simple model, which is effective for document-level features, particularly to get information from a document.
Doc2Vec: This model is an extension of Word2Vec to capture document-level information.
Transformers: Models like BERT, RoBERTa, and GPT are popular for fine-tuning on sentence and document embedding tasks. Hugging Face provides easy access to these models.
Universal Sentence Encoder (USE): This model was developed by Google. This was designed to provides general-purpose sentence embeddings.

Bag-of-Words (BoW) Model

This model represent the text by taking frequency of each word from the given text. This does not give any weight-age to grammar or word order used in the text. This model creates wide-range read-valued vectors which corresponds to a word in the vocabulary. Bag-of-words approach can be represented in 3 steps:

Tokenize the Text: Split the given sentences or documents into individual words, where each word is taken as a token.
Build Vocabulary: Remove the duplicate words from the given sentence or document, thus create a vocabulary of unique words.
Create Vectors: Count the number of times each word is used in each sentence or document and store these counts in a vector.

Example Code for Bag-of-Words (BoW) Model

# pip install scikit-learn
from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
sentences = ["Sentence and document embeddings are fundamental to Natural Language Processing",
             "Sentence and document embeddings capture the meaning of text"]

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform sentences to create BoW vectors
bow_vectors = vectorizer.fit_transform(sentences)

# Display results
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Vectors:\n", bow_vectors.toarray())

# Output
# Vocabulary: {'sentence': 11, 'and': 0, 'document': 3, 'embeddings': 4, 'are': 1, 'fundamental': 5, 'to': 14, 'natural': 8, 'language': 6, 'processing': 10, 'capture': 2, 'the': 13, 'meaning': 7, 'of': 9, 'text': 12}
# BoW Vectors:
#  [[1 1 0 1 1 1 1 0 1 0 1 1 0 0 1]
#  [1 0 1 1 1 0 0 1 0 1 0 1 1 1 0]]

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF model is an improvement over BoW. This model adjusts word frequencies based on their usage and importance in the given text. In this model, words that are frequently used in a particular document but are not used in many other documents can be given more weight-age. This helps to classify the documents on their unique terms. TF-IDF approach can be represented in 3 steps:

Term Frequency (TF) is to calculate the frequency of each word (term) in a document. This shows how frequently a word is used in the document. Higher frequency words are given higher score.

Inverse Document Frequency (IDF) is used to determine the importance of the term across the entire corpus. The IDF value reduces the weight of terms that are common across many documents. Example “the” or “is,” are assigned lower IDF score and unique terms are given higher score.
Calculate TF-IDF Score: multiply TF by IDF for each word to create weighted vectors.

TF-IDF (Term Frequency-Inverse Document Frequency) Formula - Document Embeddings — **TF-IDF (Term Frequency-Inverse Document Frequency) Formula – Document Embeddings**

Example Code for TF-IDF

# pip install scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences
sentences = ["Sentence and document embeddings are fundamental to Natural Language Processing",
             "Sentence and document embeddings capture the meaning of text"]

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform sentences to create TF-IDF vectors
tfidf_vectors = vectorizer.fit_transform(sentences)

# Display results
print("Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Vectors:\n", tfidf_vectors.toarray())

# Output
# Vocabulary: {'sentence': 11, 'and': 0, 'document': 3, 'embeddings': 4, 'are': 1, 'fundamental': 5, 'to': 14, 'natural': 8, 'language': 6, 'processing': 10, 'capture': 2, 'the': 13, 'meaning': 7, 'of': 9, 'text': 12}
# TF-IDF Vectors:
#  [[0.25116439 0.35300279 0.         0.25116439 0.25116439 0.35300279
#   0.35300279 0.         0.35300279 0.         0.35300279 0.25116439
#   0.         0.         0.35300279]
#  [0.26844636 0.         0.37729199 0.26844636 0.26844636 0.
#   0.         0.37729199 0.         0.37729199 0.         0.26844636
#   0.37729199 0.37729199 0.        ]]

Doc2Vec

Doc2Vec is an extension of the Word2Vec embedding model. This is designed to generate embeddings for whole sentences, paragraphs, and documents. Unlike Word2Vec, during model training Doc2Vec can capture semantic information by considering the context of the document. This makes it effective for tasks related to text similarity and classification. There are two variants of Doc2Vec embedding model:

Distributed Memory (DM): This model uses document and word vectors to predict the next word with-in the given context. The idea is to predict words in the given context window (sequence of words) based on the surrounding words and a unique document vector (paragraph ID).
Distributed Bag of Words (DBOW): This model ignores the word order. DBOW treats each document as a unique entity with its own vector (also known as a “document vector” or “paragraph vector”).

Example Code for Doc2Vec

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Sample documents
documents = ["Sentence and document embeddings are fundamental to Natural Language Processing",
             "Sentence and document embeddings capture the meaning of text"]

# Prepare tagged documents
tagged_docs = [TaggedDocument(words=doc.split(), tags=[str(i)]) for i, doc in enumerate(documents)]

# Train Doc2Vec model
model = Doc2Vec(tagged_docs,
                vector_size=50, # Dimensions of the word vectors
                window=2,       # Maximum distance between the current and predicted word within a sentence
                min_count=1,    # Ignores all words with a total frequency lower than this
                workers=4)      # Use these many worker threads to train the model

# Obtain embeddings for a document
vector = model.infer_vector("Natural Language Processing is fascinating".split())
print("Doc2Vec Embedding:", vector)

# Output
# Doc2Vec Embedding: [ 0.00902946 -0.00518695 -0.00370304 -0.00914641  0.00105161  0.00624333
#  -0.00553964 -0.00713476  0.00331546 -0.00070394 -0.00828523 -0.00077945
#  -0.00827082 -0.00695372  0.00644662 -0.00731054  0.00989284 -0.00881199
#   0.00061673  0.00813323  0.00850085  0.00954109  0.00507595  0.00174437
#   0.0034364   0.003292    0.00108508 -0.0080248  -0.00848993  0.0077961
#   0.00923685  0.00371492  0.0002801  -0.00664406 -0.00984772 -0.00373417
#   0.00963134 -0.00376201 -0.00385672  0.00354381 -0.00094189 -0.0012201
#   0.00931827 -0.00975919 -0.00565285  0.00674632  0.00764528  0.005375
#   0.0045256  -0.00128563]

Transformer-based Sentence Embeddings

Transformer architecture based models are the most advanced models in NLP. These models are most effective while capturing long-range dependencies, semantics and contextual relationships in text. These models generate sentence and document embeddings which are useful and provide most accurate predictions and text generation.

Universal Sentence Encoder (USE)

This model was developed by Google. This was designed to provides general-purpose sentence embeddings. This is more useful for wide range of NLP tasks. Universal Sentence Encoder (USE) is trained on diverse types of inputs, including conversational input. This diverse dataset based training makes it suitable for different types of NLP tasks.

Example Code – USE with sentence_transformers

# pip install sentence_transformers
# pip install tf-keras==2.16.0 --no-dependencies
# pip install tensorflow-cpu==2.16.1
from sentence_transformers import SentenceTransformer

# Load Universal Sentence Encoder
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Sample sentences
sentences = ["Sentence and document embeddings are fundamental to Natural Language Processing",
             "Sentence and document embeddings capture the meaning of text"]

# Generate sentence embeddings
embeddings = model.encode(sentences)
print("USE Embeddings:\n", embeddings)

# Output
# USE Embeddings:
#  [[ 2.64762118e-02 -1.29387714e-02  8.34194273e-02  1.39148422e-02
#    4.49235030e-02  5.88709787e-02  8.17405060e-03  1.82299055e-02
#    5.51757738e-02 -5.31634875e-02  1.33789070e-02  2.63189133e-02
#    6.96781501e-02  2.17816774e-02 -1.43869044e-02  4.22834419e-02
# ......
# ......
#    4.95735593e-02 -5.67018799e-03 -2.04745363e-02 -2.92272698e-02
#    1.37664704e-02  5.47013171e-02  3.32472324e-02 -1.30022531e-02
#    2.47585531e-02  1.99648552e-02 -2.18799971e-02  3.97324940e-04
#    1.80490538e-02  6.09547049e-02  9.78075042e-02 -2.97105499e-02]]

Example Code – Universal Sentence Encoder (USE) with TensorFlow

# pip install tensorflow_hub
# pip install tensorflow

import tensorflow as tf
import tensorflow_hub as hub

# Load Universal Sentence Encoder model
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# Example sentence
sentences = ["Sentence and document embeddings are fundamental to Natural Language Processing",
             "Sentence and document embeddings capture the meaning of text"]

# Generate sentence embedding
embedding = embed(sentences)

print("Embedding Shape:", embedding.shape)
print("Embedding:", embedding)

# Output
# Embedding Shape: (2, 512)
# Embedding: tf.Tensor(
# [[-0.03685547 -0.0234395   0.01266829 ...  0.05053782 -0.03162806
#   -0.0085871 ]
#  [ 0.01263339 -0.04275355  0.06820477 ...  0.04539199 -0.01522196
#    0.04380015]], shape=(2, 512), dtype=float32)

Sentence-BERT (SBERT)

Sentence-BERT (SBERT) is a modified upgraded version of BERT. This was specifically designed to create sentence embeddings. It uses a Siamese network structure to fine-tune BERT for sentence similarity tasks. This is more effective in semantic search, clustering, and text similarity.

Example Code – SBERT with sentence_transformers

# pip install sentence_transformers
# pip install tf-keras==2.16.0 --no-dependencies
# pip install tensorflow-cpu==2.16.1
from sentence_transformers import SentenceTransformer

# Load Sentence-BERT model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Sample sentences
sentences = ["Sentence and document embeddings are fundamental to Natural Language Processing",
             "Sentence and document embeddings capture the meaning of text"]

# Generate sentence embeddings
sentence_embeddings = model.encode(sentences)
print("SBERT Embeddings:\n", sentence_embeddings)

# Output
# SBERT Embeddings:
#  [[-2.72467405e-01  6.55197501e-02 -9.68914405e-02 -2.89553314e-01
#    3.44820827e-01 -2.67881304e-01 -1.62187889e-01 -3.61999929e-01
#    1.37213007e-01  3.67059916e-01 -8.67417082e-02  3.54015231e-02
#    5.43699563e-01 -4.56121042e-02  1.12542279e-01  1.11673616e-01
#    3.48674476e-01  4.24488872e-01 -5.75616896e-01 -5.10968983e-01
# ......
# ......
#    4.44245994e-01 -3.22503656e-01  2.43170336e-01 -3.25154573e-01
#    2.40026981e-01  2.33099937e-01  1.39374211e-01 -2.95164883e-01
#    1.08170450e-01  9.97239575e-02  4.27583307e-01  2.24687885e-02
#   -2.58610658e-02  2.58651972e-01  6.08575881e-01 -2.47781843e-01]]

Example Code – SBERT with Hugging Face Transformers

from transformers import AutoTokenizer, AutoModel
import torch

# Load pre-trained Sentence-BERT model and tokenizer
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Example sentence
sentence = "Sentence and document embeddings are fundamental to Natural Language Processing"

# Tokenize and encode the sentence
inputs = tokenizer(sentence, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# Extract embeddings from the last hidden state
embedding = outputs.last_hidden_state.mean(dim=1).squeeze()

print("Embedding Shape:", embedding.shape)
print("Embedding:", embedding)

# Output
# Embedding Shape: torch.Size([384])
# Embedding: tensor([ 1.4393e-01, -7.0336e-02,  4.5347e-01,  7.5642e-02,  2.4421e-01,
#          3.2003e-01,  4.4435e-02,  9.9099e-02,  2.9994e-01, -2.8900e-01,
#          7.2729e-02,  1.4307e-01,  3.7878e-01,  1.1841e-01, -7.8208e-02,
#          2.2986e-01,  2.0121e-01,  1.7003e-01, -5.3624e-01, -2.0893e-01,
# ......
# ......
#         -4.0043e-02, -4.0156e-01,  4.8177e-02,  2.2681e-01, -1.1275e-01,
#          1.1733e-01, -6.3902e-02, -1.8911e-01, -1.1112e-01, -1.1332e-02,
#         -2.4947e-02,  5.0459e-01,  4.4430e-01,  1.8361e-02])

Applications of Sentence and Document Embeddings

Semantic Search – With the help of sentence embeddings, we can create a semantic search portal. This search portal will be able to find documents based on the context of the query, rather than just searching for query keywords.

from sentence_transformers import SentenceTransformer, util
import numpy as np

# Load a pre-trained Sentence-BERT model
model = SentenceTransformer("all-MiniLM-L6-v2")

# List of documents to search
documents = [
    "Machine learning models are revolutionizing healthcare.",
    "Natural language processing enables machines to understand text.",
    "Data science involves extracting insights from data.",
    "Sentence and document embeddings are fundamental to Natural Language Processing",
    "Sentence and document embeddings capture the meaning of text"
]

# Example query
query = "How sentence embedding is more effective?"

# Generate embeddings for documents
document_embeddings = model.encode(documents, convert_to_tensor=True)

# Generate embedding for the query
query_embedding = model.encode(query, convert_to_tensor=True)

# Compute cosine similarities
cosine_scores = util.cos_sim(query_embedding, document_embeddings)

# Find the top N most similar documents
top_k = 3  # Number of top documents to retrieve
top_results = np.argsort(-cosine_scores[0])[:top_k]

# Display the top results
print("Query:", query)
print("\nTop Documents:")

for idx in top_results:
    print(f"Document: {documents[idx]}")
    print(f"Score: {cosine_scores[0][idx].item():.4f}")
    print("-" * 50)

# Output
# Query: How sentence embedding is more effective?
# 
# Top Documents:
# Document: Sentence and document embeddings are fundamental to Natural Language Processing
# Score: 0.6793
# --------------------------------------------------
# Document: Sentence and document embeddings capture the meaning of text
# Score: 0.6661
# --------------------------------------------------
# Document: Natural language processing enables machines to understand text.
# Score: 0.3207
# --------------------------------------------------

Document classification – Sentence and document embeddings can be used for document classification. These embeddings can help to find the documents belongs to which topic, topic identification and can group similar text entries.
Sentiment Analysis – Sentence embeddings can be used for sentiment analysis.
Document Summarization – Sentence embeddings can assist in document or text summarization. This can identify the most important sentences or phrases in a document.

Conclusion

Sentence and document embeddings are numerical representations of text that capture the meaning of text and thereby make it easier to process and understand by machines. These embeddings play an important role in various different tasks related to natural language processing, ranging from transforming words, sentences, or documents into dense vectors in high-dimensional space to enabling fast comparisons of semantic similarities.

There are numerous ways to generate embeddings, ranging from the more traditional models, like Bag-of-Words or TF-IDF, dependent on word frequency, to most recent advanced ones, including Doc2Vec, using word sequence for context capture. More lately, transformer-based models keep improving embedding quality; models such as the Universal Sentence Encoder and Sentence-BERT capture complex relationships between words. These models enable effective sentence and document comparisons that can enable applications in search, recommendation systems, clustering, and sentiment analysis, where the understanding of subtle nuances in language is crucial.

LLM code snippets and programs related to Sentence and Document Embeddings in NLP, can be accessed from GitHub Repository. This GitHub repository all contains programs related to other topics in LLM tutorial.