01程序员

准备用于训练NLP Transformer 的数据

malong 发布于 2025-01-15

原文：Prepare data to train NLP Transformer

对于这项任务，我从OPUS数据集中选择了一个数据集（来自网络的翻译文本集合）。

我选择了一个包含100万个英语和西班牙语句子的数据集。你可以从这里下载这些数据。这是一个大型数据集，您应该记住，在这些数据上训练Transformer可能需要几天的时间。因此，出于测试目的，您应该使用较小的数据集，例如100000个句子。

为了避免手动下载，我编写了一个脚本，下载数据并将其保存到Datasets文件夹：

import os
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup

# URL to the directory containing the files to be downloaded
language = "en-es"
url = f"https://data.statmt.org/opus-100-corpus/v1.0/supervised/{language}/"
save_directory = f"./Datasets/{language}"

# Create the save directory if it doesn't exist
os.makedirs(save_directory, exist_ok=True)

# Send a GET request to the URL
response = requests.get(url)

# Parse the HTML response
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the anchor tags in the HTML
links = soup.find_all('a')

# Extract the href attribute from each anchor tag
file_links = [link['href'] for link in links if '.' in link['href']]

# Download each file
for file_link in tqdm(file_links):
    file_url = url + file_link
    save_path = os.path.join(save_directory, file_link)
    
    print(f"Downloading {file_url}")
    
    # Send a GET request for the file
    file_response = requests.get(file_url)
    if file_response.status_code == 404:
        print(f"Could not download {file_url}")
        continue
    
    # Save the file to the specified directory
    with open(save_path, 'wb') as file:
        file.write(file_response.content)
    
    print(f"Saved {file_link}")

print("All files have been downloaded.")

如果要下载其他数据集，请将语言变量更改为要翻译的语言。确保语言存在于OPUS数据集列表中。

当我们下载了数据集后，我们需要将其读入内存；我们使用以下代码进行操作：

en_training_data_path = "Datasets/en-es/opus.en-es-train.en"
en_validation_data_path = "Datasets/en-es/opus.en-es-dev.en"
es_training_data_path = "Datasets/en-es/opus.en-es-train.es"
es_validation_data_path = "Datasets/en-es/opus.en-es-dev.es"

def read_files(path):
    with open(path, "r", encoding="utf-8") as f:
        en_train_dataset = f.read().split("\n")[:-1]
    return en_train_dataset

en_training_data = read_files(en_training_data_path)
en_validation_data = read_files(en_validation_data_path)
es_training_data = read_files(es_training_data_path)
es_validation_data = read_files(es_validation_data_path)

max_lenght = 500
train_dataset = [[es_sentence, en_sentence] for es_sentence, en_sentence in zip(es_training_data, en_training_data) if len(es_sentence) <= max_lenght and len(en_sentence) <= max_lenght]
val_dataset = [[es_sentence, en_sentence] for es_sentence, en_sentence in zip(es_validation_data, en_validation_data) if len(es_sentence) <= max_lenght and len(en_sentence) <= max_lenght]
es_training_data, en_training_data = zip(*train_dataset)
es_validation_data, en_validation_data = zip(*val_dataset)

print(len(es_training_data))
print(len(es_validation_data))
print(es_training_data[:3])
print(en_training_data[:3])

提供的代码执行以下步骤：

文件路径：定义了四个文件路径，表示不同数据文件的位置。这些文件根据其语言对命名，其中“en”表示英语，“es”表示西班牙语。这些文件用于训练和验证数据集；
read_files函数：此函数定义为读取给定路径的文件内容。它使用具有“r”模式（读取）的open函数，并指定“utf-8”编码来处理文本数据。然后，该过程读取文件，并使用以换行符（“\n”）为分隔符的split方法将其拆分为行。结果列表的最后一个元素用[：-1]删除，以排除任何空行。该函数返回行列表作为文件的内容；
读取数据文件：read_Files函数用于分别读取英语培训、英语验证、西班牙语培训和西班牙语验证的四个数据文件的内容。每个文件的数据都存储在单独的变量中：en_training_data、en_validation_data、es_training_da和es_validation.data；
过滤数据集：代码设置最大句子长度（max_length）。然后，它通过压缩训练和验证数据集中的西班牙语和英语句子，创建了两个新的数据集train_dataset和val_dataset。然而，只有那些句子对包含在新的数据集中，其中西班牙语和英语句子的长度都小于或等于指定的max_length；
解压缩数据集：在过滤数据集后，代码使用zip函数结合*运算符将train_dataset和val_dataset“解压缩”到西班牙语和英语句子的单独列表中。这导致es_training_data、en_training_da、es_validation_data和en_validation.data分别包含用于训练和验证的过滤后的西班牙语和英语句子。

此代码的总体目的是从文件中读取文本数据，过滤出超过指定最大长度的句子，然后将过滤后的数据组织到单独的列表中进行训练和验证。这些经过过滤和组织的数据旨在用作训练语言模型或其他自然语言处理任务的输入。

当我们运行上述代码时，我们应该看到以下输出：

995249
1990
('Fueron los asbestos aquí. ¡Eso es lo que ocurrió!', 'Me voy de aquí.', 'Una vez, juro que cagué una barra de tiza.')
("It was the asbestos in here, that's what did it!", "I'm out of here.", 'One time, I swear I pooped out a stick of chalk.')

设置Tokenizer

为了处理句子，我创建了一个自定义Tokenizer。此Tokenizer类似于tensorflow.keras.preprocessing.text模块中的Tokenizer。不同的是，当我准备好使用训练好的Transformer模型时，我不需要安装一个巨大的TensorFlow库来使用Tokenizer类。

以下是CustomTokenizer对象的代码：

import os
import json
import typing
from tqdm import tqdm

class CustomTokenizer:
    """ Custom Tokenizer class to tokenize and detokenize text data into sequences of integers

    Args:
        split (str, optional): Split token to use when tokenizing text. Defaults to " ".
        char_level (bool, optional): Whether to tokenize at character level. Defaults to False.
        lower (bool, optional): Whether to convert text to lowercase. Defaults to True.
        start_token (str, optional): Start token to use when tokenizing text. Defaults to "<start>".
        end_token (str, optional): End token to use when tokenizing text. Defaults to "<eos>".
        filters (list, optional): List of characters to filter out. Defaults to 
            ['!', "'", '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', 
            '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '\t', '\n'].
        filter_nums (bool, optional): Whether to filter out numbers. Defaults to True.
        start (int, optional): Index to start tokenizing from. Defaults to 1.
    """
    def __init__(
            self, 
            split: str=" ", 
            char_level: bool=False,
            lower: bool=True, 
            start_token: str="<start>", 
            end_token: str="<eos>",
            filters: list = ['!', "'", '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '\t', '\n'],
            filter_nums: bool = True,
            start: int=1,
        ) -> None:
        self.split = split
        self.char_level = char_level
        self.lower = lower
        self.index_word = {}
        self.word_index = {}
        self.max_length = 0
        self.start_token = start_token
        self.end_token = end_token
        self.filters = filters
        self.filter_nums = filter_nums
        self.start = start

    @property
    def start_token_index(self):
        return self.word_index[self.start_token]
    
    @property
    def end_token_index(self):
        return self.word_index[self.end_token]

    def sort(self):
        """ Sorts the word_index and index_word dictionaries"""
        self.index_word = dict(enumerate(dict(sorted(self.word_index.items())), start=self.start))
        self.word_index = {v: k for k, v in self.index_word.items()}

    def split_line(self, line: str):
        """ Splits a line of text into tokens

        Args:
            line (str): Line of text to split

        Returns:
            list: List of string tokens
        """
        line = line.lower() if self.lower else line

        if self.char_level:
            return [char for char in line]

        # split line with split token and check for filters
        line_tokens = line.split(self.split)

        new_tokens = []
        for index, token in enumerate(line_tokens):
            filtered_tokens = ['']
            for c_index, char in enumerate(token):
                if char in self.filters or (self.filter_nums and char.isdigit()):
                    filtered_tokens += [char, ''] if c_index != len(token) -1 else [char]
                else:
                    filtered_tokens[-1] += char

            new_tokens += filtered_tokens
            if index != len(line_tokens) -1:
                new_tokens += [self.split]

        new_tokens = [token for token in new_tokens if token != '']

        return new_tokens

    def fit_on_texts(self, lines: typing.List[str]):
        """ Fits the tokenizer on a list of lines of text
        This function will update the word_index and index_word dictionaries and set the max_length attribute

        Args:
            lines (typing.List[str]): List of lines of text to fit the tokenizer on
        """
        self.word_index = {key: value for value, key in enumerate([self.start_token, self.end_token, self.split] + self.filters)}
        
        for line in tqdm(lines, desc="Fitting tokenizer"):
            line_tokens = self.split_line(line)
            self.max_length = max(self.max_length, len(line_tokens) +2) # +2 for start and end tokens

            for token in line_tokens:
                if token not in self.word_index:
                    self.word_index[token] = len(self.word_index)

        self.sort()

    def update(self, lines: typing.List[str]):
        """ Updates the tokenizer with new lines of text
        This function will update the word_index and index_word dictionaries and set the max_length attribute

        Args:
            lines (typing.List[str]): List of lines of text to update the tokenizer with
        """
        new_tokens = 0
        for line in tqdm(lines, desc="Updating tokenizer"):
            line_tokens = self.split_line(line)
            self.max_length = max(self.max_length, len(line_tokens) +2) # +2 for start and end tokens
            for token in line_tokens:
                if token not in self.word_index:
                    self.word_index[token] = len(self.word_index)
                    new_tokens += 1

        self.sort()
        print(f"Added {new_tokens} new tokens")

    def detokenize(self, sequences: typing.List[int], remove_start_end: bool=True):
        """ Converts a list of sequences of tokens back into text

        Args:
            sequences (typing.list[int]): List of sequences of tokens to convert back into text
            remove_start_end (bool, optional): Whether to remove the start and end tokens. Defaults to True.
        
        Returns:
            typing.List[str]: List of strings of the converted sequences
        """
        lines = []
        for sequence in sequences:
            line = ""
            for token in sequence:
                if token == 0:
                    break
                if remove_start_end and (token == self.start_token_index or token == self.end_token_index):
                    continue

                line += self.index_word[token]

            lines.append(line)

        return lines

    def texts_to_sequences(self, lines: typing.List[str], include_start_end: bool=True):
        """ Converts a list of lines of text into a list of sequences of tokens
        
        Args:
            lines (typing.list[str]): List of lines of text to convert into tokenized sequences
            include_start_end (bool, optional): Whether to include the start and end tokens. Defaults to True.

        Returns:
            typing.List[typing.List[int]]: List of sequences of tokens
        """
        sequences = []
        for line in lines:
            line_tokens = self.split_line(line)
            sequence = [self.word_index[word] for word in line_tokens if word in self.word_index]
            if include_start_end:
                sequence = [self.word_index[self.start_token]] + sequence + [self.word_index[self.end_token]]

            sequences.append(sequence)

        return sequences
    
    def save(self, path: str, type: str="json"):
        """ Saves the tokenizer to a file
        
        Args:
            path (str): Path to save the tokenizer to
            type (str, optional): Type of file to save the tokenizer to. Defaults to "json".
        """
        serialised_dict = self.dict()
        if type == "json":
            if os.path.dirname(path):
                os.makedirs(os.path.dirname(path), exist_ok=True)
            with open(path, "w") as f:
                json.dump(serialised_dict, f)

    def dict(self):
        """ Returns a dictionary of the tokenizer

        Returns:
            dict: Dictionary of the tokenizer
        """
        return {
            "split": self.split,
            "lower": self.lower,
            "char_level": self.char_level,
            "index_word": self.index_word,
            "max_length": self.max_length,
            "start_token": self.start_token,
            "end_token": self.end_token,
            "filters": self.filters,
            "filter_nums": self.filter_nums,
            "start": self.start
        }

    @staticmethod
    def load(path: typing.Union[str, dict], type: str="json"):
        """ Loads a tokenizer from a file

        Args:
            path (typing.Union[str, dict]): Path to load the tokenizer from or a dictionary of the tokenizer
            type (str, optional): Type of file to load the tokenizer from. Defaults to "json".

        Returns:
            CustomTokenizer: Loaded tokenizer
        """
        if isinstance(path, str):
            if type == "json":
                with open(path, "r") as f:
                    load_dict = json.load(f)

        elif isinstance(path, dict):
            load_dict = path

        tokenizer = CustomTokenizer()
        tokenizer.split = load_dict["split"]
        tokenizer.lower = load_dict["lower"]
        tokenizer.char_level = load_dict["char_level"]
        tokenizer.index_word = {int(k): v for k, v in load_dict["index_word"].items()}
        tokenizer.max_length = load_dict["max_length"]
        tokenizer.start_token = load_dict["start_token"]
        tokenizer.end_token = load_dict["end_token"]
        tokenizer.filters = load_dict["filters"]
        tokenizer.filter_nums = bool(load_dict["filter_nums"])
        tokenizer.start = load_dict["start"]
        tokenizer.word_index = {v: int(k) for k, v in tokenizer.index_word.items()}

        return tokenizer
    
    @property
    def lenght(self):
        return len(self.index_word)

    def __len__(self):
        return len(self.index_word)

我将把这个对象包含在你可以从PyPi安装的MLTU包中，所以你不需要复制和粘贴它。我们稍后会讲到这一点。

它是文本标记器的自定义实现，它获取原始文本数据并将其转换为整数序列（标记）。该类提供了几种执行标记化和去标记化的方法。

标记器可以用各种参数进行初始化，例如分割标记（用于分割输入文本）、是否应在字符或单词级别进行标记化、是否将文本转换为小写等。它还允许您指定开始和结束标记，这些标记将添加到标记化序列中。此外，您可以在标记化过程中过滤掉特定的字符和数字。

该类包含split_line等方法，它将一行文本拆分为标记，fit_on_texts，它将标记器放在文本行列表上并更新其内部字典；以及detokenize，它将标记序列转换回文本。其他方法包括texts_to_sequences，用于将文本行转换为标记序列，以及save/load，用于将标记器保存到文件或从文件加载标记器。

总的来说，这个CustomTokenizer类提供了一种灵活且可定制的方法来预处理需要整数序列作为输入的机器学习模型的文本数据。它使您能够对文本进行标记和去标记，同时根据特定应用程序的要求处理各种预处理选项。

因此，我们有两种语言，需要为每种语言创建两个标记器。让我们这样做：

# prepare Spanish tokenizer, this is the input language
tokenizer = CustomTokenizer(char_level=True)
tokenizer.fit_on_texts(es_training_data)
tokenizer.save("tokenizer.json")

# prepare English tokenizer, this is the output language
detokenizer = CustomTokenizer(char_level=True)
detokenizer.fit_on_texts(en_training_data)
detokenizer.save("detokenizer.json")

该代码演示了为西班牙语和英语两种语言准备标记器。这些标记器在自然语言处理任务中起着至关重要的作用，它们负责将原始文本数据转换为标记序列，这对训练机器学习模型至关重要。

代码的第一部分侧重于准备西班牙语标记器，该标记器将用作输入语言标记器。标记器使用CustomTokenizer类初始化，该类配置为在字符级别标记文本，这意味着输入文本中的每个字符都将被视为单独的标记。然后将fit_on_texts方法应用于西班牙语训练数据（es_training_data），该数据将标记器拟合到西班牙语句子列表上，更新其内部词典和设置。通过这样做，标记器学习字符与其相应整数表示之间的映射。最后，标记器被保存到指定路径中名为“tokenizer.json”的文件中。

代码的第二部分侧重于准备英语标记器，该标记器将用作输出语言标记器。它与第一部分完全相同，只是我们用它来表示英语句子。

当我们运行上述代码时，我们应该看到类似的输出：

Fitting tokenizer: 100%|██████████| 995249/995249 [00:10<00:00, 95719.57it/s] 
Fitting tokenizer: 100%|██████████| 995249/995249 [00:07<00:00, 134446.71it/s]

在上面的输出中，显示了标记器的进度。

这些标记器稍后可用于将文本数据转换为字符序列，从而实现进一步的自然语言处理任务，如序列到序列的翻译、文本生成或任何其他需要标记输入和输出的任务。保存的标记器文件可以在NLP管道的后续阶段加载，以保持一致性，并促进新数据的推理和评估。

让我们尝试使用上面的去标记器将句子转换为标记，并将其转换回句子：

tokenized_sentence = detokenizer.texts_to_sequences(["Hello world, how are you?"])[0]
print(tokenized_sentence)

detokenized_sentence = detokenizer.detokenize([tokenized_sentence], remove_start_end=False)
print(detokenized_sentence)

detokenized_sentence = detokenizer.detokenize([tokenized_sentence])
print(detokenized_sentence)

通过运行上述代码，它应该给我们以下输出：

[33, 51, 48, 55, 55, 58, 3, 66, 58, 61, 55, 47, 15, 3, 51, 58, 66, 3, 44, 61, 48, 3, 68, 58, 64, 36, 32]
['<start>hello world, how are you?<eos>']
['hello world, how are you?']

因此，我们将“Hello world，你好吗？”句子介绍标记化。然后试着去打电话。此外，我还演示了切换remove_start_end时的区别。

设置数据管道

当我们拥有标记器时，我们可以创建一个数据管道。管道将负责从文件中读取数据、对其进行标记和批处理。让我们从mltu包中导入DataProvider类：

从mltu.tensorflow.dataProvider导入数据提供程序

我们将有两个数据提供者，一个用于训练数据，另一个用于验证数据。在迭代它们时，我们应该收到模型的准备数据。让我们创建它们：

from mltu.tensorflow.dataProvider import DataProvider
import numpy as np

def preprocess_inputs(data_batch, label_batch):
    encoder_input = np.zeros((len(data_batch), tokenizer.max_length)).astype(np.int64)
    decoder_input = np.zeros((len(label_batch), detokenizer.max_length)).astype(np.int64)
    decoder_output = np.zeros((len(label_batch), detokenizer.max_length)).astype(np.int64)

    data_batch_tokens = tokenizer.texts_to_sequences(data_batch)
    label_batch_tokens = detokenizer.texts_to_sequences(label_batch)

    for index, (data, label) in enumerate(zip(data_batch_tokens, label_batch_tokens)):
        encoder_input[index][:len(data)] = data
        decoder_input[index][:len(label)-1] = label[:-1] # Drop the [END] tokens
        decoder_output[index][:len(label)-1] = label[1:] # Drop the [START] tokens

    return (encoder_input, decoder_input), decoder_output

train_dataProvider = DataProvider(
    train_dataset, 
    batch_size=4, 
    batch_postprocessors=[preprocess_inputs],
    use_cache=True
    )

val_dataProvider = DataProvider(
    val_dataset, 
    batch_size=4, 
    batch_postprocessors=[preprocess_inputs],
    use_cache=True
    )

我们创建了一个名为preprocess_inputs的Python函数和DataProvider类的两个实例train_DataProvider和val_DataProvider。函数preprocess_inputs用作机器学习模型中使用的输入数据和标签批的预处理步骤，特别是按顺序执行任务。

preprocess_inputs函数接受两个参数，data_batch和label_batch，分别表示输入数据的批次和相应的标签数据。在函数中，三个数组encoder_input、decoder_put和decoder_output被初始化为零填充数组，以存储处理后的数据。

该函数首先对输入进行标记，并使用之前准备的标记器、标记器和去标记器标记数据批。它将文本数据转换为训练模型所需的整数序列。

接下来，它迭代数据和标签批，对于每个数据标签对，它用输入数据的整数序列填充encoder_input数组。同样，它用标签数据的整数序列填充decoder_input数组，但删除了最后一个标记（表示[END]标记）。decoder_output数组填充了标签数据的整数序列，但删除了[START]标记。这些数组对于训练序列到序列模型至关重要，因为它们在训练过程中形成了输入和目标序列。

DataProvider对象在模型训练期间处理数据批处理。它接受训练和验证数据集（train_dataset和val_dataset），并使用preprocess_inputs函数处理数据批。batch_size参数决定了每个批次中的数据样本数量。此外，use_cache参数设置为true，这意味着DataProvider将缓存预处理的批处理，以便在训练期间高效加载数据。

总之，上述代码使用DataProvider类在训练过程中处理数据批处理和缓存，从而更容易将数据分批输入模型，这是机器学习中提高训练效率的常见做法。

让我们检查一下dataProvider的单个输出是什么：

for data_batch in train_dataProvider:
    (encoder_inputs, decoder_inputs), decoder_outputs = data_batch

    encoder_inputs_str = tokenizer.detokenize(encoder_inputs)
    decoder_inputs_str = detokenizer.detokenize(decoder_inputs, remove_start_end=False)
    decoder_outputs_str = detokenizer.detokenize(decoder_outputs, remove_start_end=False)
    print(encoder_inputs_str)
    print(decoder_inputs_str)
    print(decoder_outputs_str)
    
    break

在终端输出中，我们应该看到类似的输出：

['fueron los asbestos aquí. ¡eso es lo que ocurrió!', 'me voy de aquí.', 'una vez, juro que cagué una barra de tiza.', 'y prefiero mudarme, ¿entiendes?']
["<start>it was the asbestos in here, that's what did it!", "<start>i'm out of here.", '<start>one time, i swear i pooped out a stick of chalk.', '<start>and i will move, do you understand me?']
["it was the asbestos in here, that's what did it!<eos>", "i'm out of here.<eos>", 'one time, i swear i pooped out a stick of chalk.<eos>', 'and i will move, do you understand me?<eos>']

这是在训练我们的 Transformer 模型时输入的一个示例。但为了更易于理解，我们对分词进行了还原。

malong

关注私信

文章

关注

粉丝