Table of contents

Introduction

Next Sentence Prediction Using BERT

NSP In Code

3.1.

Tokenization

3.2.

Create class label

3.3.

Calculate loss

3.4.

Prediction

FAQs

Key Takeaways

Last Updated: Mar 27, 2024

Easy

Sentence Prediction with BERT

Author Siddhant Verma

Do you think IIT Guwahati certified course can help you in your career?

Yes

Introduction

This page goes into great detail on Bert's Sentence Prediction. Bidirectional Representation for Transformers (BERT) is an acronym for Bidirectional Representation for Transformers. It was proposed by Google Research experts in 2018. However, the main goal was to increase understanding of the meaning of Google Search queries. According to a survey, Google receives 15% of new requests every day. As a result, to comprehend the search query, the Google search engine must have a far greater comprehension of the language.

On the other hand, BERT is trained on various tasks to increase the model's language understanding. This essay will go over the functions for BERT's next phrase prediction.

Next Sentence Prediction Using BERT

For the following sentence prediction task, BERT is fine-tuned on three methods: We have sentences as input and just one class label output in the first kind, as in the following task:

A large-scale classification challenge is called MNLI (Multi-Genre Natural Language Inference). We've provided you with a couple of sentences to complete this homework. Concerning the first sentence, the purpose is to determine if the second sentence is entailment, contradiction, or neutral.
QQP (Quora Question Pairs): This dataset aims to see if two questions are semantically equivalent.
Inquiry Natural Language Inference (QNLI): In this challenge, the model must assess whether the second sentence is the answer to the first sentence's question.
SWAG (Situations With Adversarial Generations) is an acronym for Situations With Adversarial Generations. There are 113k sentence classifications in this dataset. The goal is to figure out whether or not the second sentence is a continuation of the first.

We only have one sentence as input in the second kind, but the output is comparable to the following class label. The following are the tasks/datasets that were used:

The Stanford Sentiment Treebank (SST-2): It's a binary sentence classification challenge that involves extracting sentences from movie reviews and annotating them with sentiment annotations. On SST-2, BERT produced state-of-the-art findings.
The binary classification job is CoLA (Corpus of Linguistic Acceptability). This exercise aims to determine whether or not a given English sentence is linguistically acceptable.
We are given a question and a paragraph in the third type of following sentence, prediction, and it outputs a sentence from the section that is the response to the query. SQuAD (Stanford Question Answer D) v2.0 and 1.1 datasets are used.
In the above architecture, The [CLS] token is the first token in the input. This indicates the arrival of an input sentence; the [SEP] denotes the separation of the various inputs. The input sentences are tokenized using BERT vocab, and the output is also tokenized.

NSP In Code

Let's look into how we can use code to show NSP.

Along with the bert-base-uncased model, we'll be using HuggingFace's transformers and PyTorch. Let's start by importing and initializing everything:=

CODE-

from transformers import BertTokenizer, BertForNextSentencePrediction
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')
text = ("After Abraham Lincoln won the November 1860 presidential election on an "
        "anti-slavery platform, an initial seven slave states declared their "
        "secession from the country to form the Confederacy.")
text2 = ("War broke out in April 1861 when secessionist forces attacked Fort "
         "Sumter in South Carolina, just over a month after Lincoln's "
         "inauguration.")

You can also try this code with Online Python Compiler

Run Code

It's worth noting that we have two strings: text for sentence A and text2 for sentence B. Keeping them distinct allows our tokenizer to accurately process both of them, as we'll see in a moment.

We must now take the following three steps:

Tokenization
Make a label for categorization.
Make a loss calculation.

To begin, we'll look at tokenization.

Tokenization

We use our initialized tokenizer to do tokenization, passing both text and text2.

inputs = tokenizer(text, text2, return_tensors='pt')
inputs.keys()

You can also try this code with Online Python Compiler

Run Code

output

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

You can also try this code with Online Python Compiler

Run Code

Few things to keep in mind regarding NSP. First, our two sentences are blended into the same collection of tensors — yet BERT can tell that they are two separate statements in several ways.

Between the two sentences, a [SEP] token is added. In our input ids tensor, this separator token is represented by 102.
The token type ids tensor holds segment ids that indicate which token belongs to which segment. Sentence A has a value of 0, and sentence B has a value of 1.

Create class label

The next step is simple: we need to generate a new labels tensor that determines if phrase B comes after sentence A.

labels = torch.LongTensor([0])
labels

You can also try this code with Online Python Compiler

Run Code

output

tensor([0])

You can also try this code with Online Python Compiler

Run Code

IsNextSentence has a value of 0 and NotNextSentence has a value of 1. We'll also need to utilize the torch. LongTensor is a tensor format.

Calculate loss

Finally, we figure out how much we've lost. We begin by running our model over our inputs and labels.

outputs = model(**inputs, labels=labels)
outputs.keys()
outputs.loss
outputs.loss.item()

You can also try this code with Online Python Compiler

Run Code

output

odict_keys(['loss', 'logits'])
tensor(3.2186e-06, grad_fn=<NllLossBackward>)
3.2186455882765586e-06

You can also try this code with Online Python Compiler

Run Code

Our model will return the loss tensor, which we'll focus on during training — which we'll get to shortly.

Prediction

We may not need to train our model and instead want to use it to make inferences. We wouldn't have a labels tensor in this scenario, so we'd change the last section of our method to extract the logits tensor as follows:

outputs = model(**inputs)
outputs.keys()
And take the argmax to get our prediction:
torch.argmax(outputs.logits)

You can also try this code with Online Python Compiler

Run Code

output

odict_keys(['logits'])
tensor(0)

You can also try this code with Online Python Compiler

Run Code

The activation for the IsNextSentence class in index 0 and the activation for the NotNextSentence class in index one will be returned by our model as a logits tensor.

To return our model's forecast, we take the argmax of the output logits. It returns 0 in this situation, indicating that BERT believes sentence B follows sentence A. (correct).

FAQs

1. Can BERT predict the next sentence?

BERT is fine-tuned on three methods for the following sentence prediction task: We have sentences as input and just one class label output in the first kind, as in the following task: MNLI (Multi-Genre Natural Language Inference) is a technique for inferring between different types of natural language. It's a massive classification project.

2. How do you predict the BERT model?

Because BERT is a bidirectional model, it tries to look in both left-right and right-left directions. To deduce the meaning of a masked word For prediction, BERT considers both the following and preceding tokens of the masked word.

3. Can BERT be used for text generation?

The instructor BERT model is no longer required once the student model has been trained, and only the student model is used to generate the text. This means that Distill-BERT does not need any additional resources during generation.

4. What is BERT ML?

BERT is a machine learning framework for natural language processing that is open source (NLP). BERT is a program that uses surrounding text to help computers grasp the meaning of ambiguous words in the text.

Key Takeaways

So that's the end of the article.

In this article, we have extensively discussed Sentence Prediction with BERT.

Isn't Machine Learning exciting!! We hope that this blog has helped you enhance your knowledge regarding Sentence Prediction with BERT and if you would like to learn more, check out our articles on the MACHINE LEARNING COURSE. Do upvote our blog to help other ninjas grow. Happy Coding!

Live masterclass

Zomato Data Analysis Case Study: Ace 25L+ Roles in FoodTech

by Abhishek Soni

16 Mar, 2026

01:30 PM

39+ registered

Data Analysis for 20L+ CTC@Flipkart: End-Season Sales dataset

by Sumit Shukla

15 Mar, 2026

06:30 AM

267+ registered

Beginner to GenAI Engineer Roadmap for 30L+ CTC at Amazon

by Shantanu Shubham

15 Mar, 2026

08:30 AM

55+ registered

Multi-Agent AI Systems: Live Workshop for 25L+ CTC at Google

by Saurav Prateek

16 Mar, 2026

03:00 PM

8+ registered

Zomato Data Analysis Case Study: Ace 25L+ Roles in FoodTech

by Abhishek Soni

16 Mar, 2026

01:30 PM

39+ registered

Data Analysis for 20L+ CTC@Flipkart: End-Season Sales dataset

by Sumit Shukla

15 Mar, 2026

06:30 AM

267+ registered

View more events