Alibek Jakupov
- Feb 20, 2021
- 3 min read

Data Augmentation for NLP with CamemBERT

Updated: Dec 11, 2022

In the previous article we've seen how to implement the data augmentation for Computer Vision. In this article we are going to see how to augment textual data. Up we go!

There's an excellent reference explaining how data augmentation for NLP works. However, instead of using a ready-to-use library, it is more interesting to develop everything from scratch. Moreover, the fact that you've implemented the algorithm yourself allows you to adjust the algorithm according to your specific context.

In short the data augmentation techniques include:

Back translation
Synonym Replacement
Random Insertion
Random Deletion
Shuffle Sentences Transform etc.

There's also a python package that allows you to do some basic and advanced augmentation, called NLPAug. NLPAug offers three types of augmentation:

Character level augmentation
Word level augmentation
Sentence level augmentation

According to Jakub Czakon, the author of the discussed reference:

From my experience, the most commonly used and effective technique is synonym replacement via word embeddings.

Sounds reasonable. But as mentioned in the introduction, we will try implementing all these steps ourselves.

Step 0: Imports

import torch
import random
import re
import pandas as pd

from collections import Counter
from math import floor, ceil

Step 1: Contextual Word Embeddings:

For this purpose we are going to use the CamemBERT, a french extension of BERT. A short quote from the official website:

CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. We evaluate CamemBERT in four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI); improving the state of the art for most tasks over previous monolingual and multilingual approaches, which confirms the effectiveness of large pretrained language models for French. CamemBERT was trained and evaluated by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.

So, the first thing is to download the model and intialize the camembert object.

camembert = torch.hub.load('pytorch/fairseq', 'camembert')
camembert.eval()

N.B. there're some additional dependencies like hydra-core or omegaconf to install. Important: do not confuse with the hydra package, otherwise, you will have to uninstall everything.

Step 2: Mask Random Word and Generate Synthetic Texts

This step is quite straightforward, simply mask a random word and send this masked line to camembert to let the language model guess the masked word. These guessed words are ordered by their probabilities in the descending order.

Mask random word in a line


def mask_random_word(text_line):
    text_array = text_line.split()
 
    masked_token_index = random.randint(0, len(text_array)-1)
    text_array[masked_token_index] = "<mask>"
    output_text = " ".join(text_array)

 return output_text

Generate synthetic text for a text line

def generate_synthetic_texts(text, number_of_examples):
    output = []
 for i in range(number_of_examples):
        masked_line = mask_random_word(text)
        top_unmasked = camembert.fill_mask(masked_line, topk=1)
        output.append(top_unmasked[0][0])
 return output

Step 3: Get Majority Class and Execute

We are almost done. The only thing left is to detect the majority class and create N items per minority class according to the size of the majority class.

Read the text dataset

myModel = pd.read_csv('model.csv', sep=';')

Get the majority class

counter = Counter(myModel['Result'])
max_intent = ''
max_intent_count = 0

for intent in set(myModel['Result']):
 if max_intent_count < counter[intent]:
        max_intent = intent
        max_intent_count = counter[intent]

Calculate the number of items to generate per minority class

threshold = floor(max_intent_count/2)
minority_intents = set(myModel['Result'])-set([max_intent])


for intent in minority_intents:
    print (intent, "started")
    intent_examples = myModel[myModel['Result']==intent]
    intent_count = intent_examples.shape[0]
    examples_to_generate = max_intent_count - intent_count
 if examples_to_generate > threshold:
        body_examples = intent_examples['Body']
        number_of_synthetic_per_example = ceil(examples_to_generate/intent_count)
 else:
        body_examples = intent_examples['Body'][:examples_to_generate]
        number_of_synthetic_per_example = 1

    print (intent_count,examples_to_generate, number_of_synthetic_per_example)



    syntetic_bodies = []
 for body in body_examples:
        syntetic_bodies.append(generate_synthetic_texts(body, number_of_synthetic_per_example))

 # flatten arrays
    syntetic_bodies = [item for sublist in syntetic_bodies for item in sublist]

    augmented_df = pd.DataFrame()
    augmented_df['Body']=syntetic_bodies
    augmented_df['Result']=intent

    myModel = alexModel.append(augmented_df)
    print (intent, "processed")

N.B. the threshold is needed to define whether we generate synthetic items for each item in the minority class or, only on the selected sub array. For instance, if our majority class is composed of 500 items, and our minority class has 350, then, the most logical way is to select a subarray of 15 items and generate 1 synthetic example for each.

Dicsussion

The execution takes some time (40 minutes for a dataset of a couple thousands of lines), so it may be interesting to try parallelizing the execution (maybe using Spark on Azure Databricks). I will try to publish an article on this in on of the next tutorials.

Hope this was useful