引言
自然语言处理(NLP)是人工智能领域中的重要研究分支。文本分类是NLP中的一个基础任务,其应用广泛,包括情感分析、垃圾邮件过滤等。而BERT(Bidirectional Encoder Representations from Transformers)是一种先进的预训练语言模型,具有强大的语义理解能力。本文将介绍如何使用PyTorch和BERT实现文本分类任务,代码简洁高效,只需要不到80行。
步骤一:准备数据集
首先,我们需要准备一个用于文本分类的数据集。可以使用任何一个合适的数据集,例如IMDB电影评论数据集。这里我们假设已经准备好一个包含文本和对应标签的数据集。可以使用Python的pandas库来方便地读取和处理数据。这里添加代码如下:
import pandas as pd
# 读取数据集
data = pd.read_csv("dataset.csv")
# 查看数据集的前几行
print(data.head())
步骤二:数据预处理
接下来,我们需要对数据进行预处理,以便用于训练。首先,我们需要将文本转换为BERT所需的输入格式 —— token IDs。此外,我们还需要将标签转换为数字。可以使用Hugging Face的transformers库来方便地进行这些转换。添加以下代码:
from transformers import BertTokenizer
# 加载BERT的tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", do_lower_case=True)
# 对文本进行tokenize和编码
tokens = tokenizer.batch_encode_plus(
data["text"].tolist(),
max_length=256,
pad_to_max_length=True,
truncation=True
)
# 将标签转换为数字
labels = data["label"].values
# 打印转换后的示例
print(tokens)
print(labels)
步骤三:构建模型
接下来,我们需要构建文本分类模型。这里我们使用BERT作为预训练的模型,并在顶部添加一个全连接层用于分类。可以使用PyTorch的torch.nn模块来方便地定义模型。添加以下代码:
import torch
import torch.nn as nn
from transformers import BertModel
class TextClassifier(nn.Module):
def __init__(self):
super(TextClassifier, self).__init__()
self.bert = BertModel.from_pretrained("bert-base-uncased")
self.dropout = nn.Dropout(0.2)
self.fc = nn.Linear(768, 2) # 2个类别
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids, attention_mask=attention_mask)
pooled_output = outputs[1]
dropped_output = self.dropout(pooled_output)
logits = self.fc(dropped_output)
return logits
# 创建模型实例
model = TextClassifier()
步骤四:训练模型
在训练之前,我们需要将数据转换为PyTorch的张量,并划分为训练集和验证集。然后,我们可以使用PyTorch自带的优化器和损失函数来训练模型。添加以下代码:
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
# 划分训练集和验证集
train_inputs, val_inputs, train_labels, val_labels = train_test_split(
tokens["input_ids"],
labels,
random_state=42,
test_size=0.2
)
# 转换为PyTorch的张量
train_inputs = torch.tensor(train_inputs)
val_inputs = torch.tensor(val_inputs)
train_labels = torch.tensor(train_labels)
val_labels = torch.tensor(val_labels)
# 创建数据加载器
batch_size = 16
train_data = TensorDataset(train_inputs, train_labels)
train_sampler = DataLoader(train_data, sampler=torch.utils.data.RandomSampler(train_data), batch_size=batch_size)
val_data = TensorDataset(val_inputs, val_labels)
val_sampler = DataLoader(val_data, sampler=torch.utils.data.SequentialSampler(val_data), batch_size=batch_size)
# 定义优化器和损失函数
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
loss_fn = nn.CrossEntropyLoss()
# 训练模型
num_epochs = 5
for epoch in range(num_epochs):
model.train()
total_loss = 0
for batch in train_sampler:
optimizer.zero_grad()
input_ids = batch[0].to(device)
attention_mask = (input_ids > 0).to(device)
labels = batch[1].to(device)
outputs = model(input_ids, attention_mask=attention_mask)
loss = loss_fn(outputs, labels)
total_loss += loss.item()
loss.backward()
optimizer.step()
avg_loss = total_loss / len(train_sampler)
print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {avg_loss}")
model.eval()
val_loss = 0
val_accuracy = 0
for batch in val_sampler:
input_ids = batch[0].to(device)
attention_mask = (input_ids > 0).to(device)
labels = batch[1].to(device)
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
loss = loss_fn(outputs, labels)
val_loss += loss.item()
preds = torch.argmax(outputs, dim=1)
val_accuracy += (preds == labels).cpu().numpy().mean()
avg_val_loss = val_loss / len(val_sampler)
avg_val_accuracy = val_accuracy / len(val_sampler)
print(f"Epoch {epoch+1}/{num_epochs}, Val Loss: {avg_val_loss}, Val Accuracy: {avg_val_accuracy}")
步骤五:模型评估和预测
在训练完成后,我们可以使用验证集来评估模型的性能,并使用测试集来做预测。添加以下代码:
# 评估模型性能
model.eval()
test_loss = 0
test_accuracy = 0
for batch in test_sampler:
input_ids = batch[0].to(device)
attention_mask = (input_ids > 0).to(device)
labels = batch[1].to(device)
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
loss = loss_fn(outputs, labels)
test_loss += loss.item()
preds = torch.argmax(outputs, dim=1)
test_accuracy += (preds == labels).cpu().numpy().mean()
avg_test_loss = test_loss / len(test_sampler)
avg_test_accuracy = test_accuracy / len(test_sampler)
print(f"Test Loss: {avg_test_loss}, Test Accuracy: {avg_test_accuracy}")
# 预测
text = "This is a great movie!"
input = tokenizer.encode_plus(text, add_special_tokens=True, return_tensors="pt")
input_ids = input["input_ids"].to(device)
attention_mask = input["attention_mask"].to(device)
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
pred = torch.argmax(outputs, dim=1)
print(f"Predicted label: {pred.item()}")
结论
本文介绍了如何使用PyTorch和BERT实现文本分类任务,代码简洁高效。通过预处理数据、构建模型、训练和评估,我们可以轻松地完成文本分类任务,并得到较好的结果。
本文来自极简博客,作者:甜蜜旋律,转载请注明原文链接:PyTorch BERT实现文本分类任务(少于80行代码)