Data Augmentation for NLP with CamemBERT
Updated: Dec 11, 2022
In the previous article we've seen how to implement the data augmentation for Computer Vision. In this article we are going to see how to augment textual data. Up we go!
There's an excellent reference explaining how data augmentation for NLP works. However, instead of using a ready-to-use library, it is more interesting to develop everything from scratch. Moreover, the fact that you've implemented the algorithm yourself allows you to adjust the algorithm according to your specific context.
In short the data augmentation techniques include:
Shuffle Sentences Transform etc.
There's also a python package that allows you to do some basic and advanced augmentation, called NLPAug. NLPAug offers three types of augmentation:
Character level augmentation
Word level augmentation
Sentence level augmentation
According to Jakub Czakon, the author of the discussed reference:
From my experience, the most commonly used and effective technique is synonym replacement via word embeddings.
Sounds reasonable. But as mentioned in the introduction, we will try implementing all these steps ourselves.
Step 0: Imports
import torch import random import re import pandas as pd from collections import Counter from math import floor, ceil
Step 1: Contextual Word Embeddings:
For this purpose we are going to use the CamemBERT, a french extension of BERT. A short quote from the official website:
CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. We evaluate CamemBERT in four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI); improving the state of the art for most tasks over previous monolingual and multilingual approaches, which confirms the effectiveness of large pretrained language models for French. CamemBERT was trained and evaluated by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
So, the first thing is to download the model and intialize the camembert object.
camembert = torch.hub.load('pytorch/fairseq', 'camembert') camembert.eval()
N.B. there're some additional dependencies like hydra-core or omegaconf to install. Important: do not confuse with the hydra package, otherwise, you will have to uninstall everything.
Step 2: Mask Random Word and Generate Synthetic Texts
This step is quite straightforward, simply mask a random word and send this masked line to camembert to let the language model guess the masked word. These guessed words are ordered by their probabilities in the descending order.
Mask random word in a line
def mask_random_word(text_line): text_array = text_line.split() masked_token_index = random.randint(0, len(text_array)-1) text_array[masked_token_index] = "<mask>" output_text = " ".join(text_array) return output_text
Generate synthetic text for a text line
def generate_synthetic_texts(text, number_of_examples): output =  for i in range(number_of_examples): masked_line = mask_random_word(text) top_unmasked = camembert.fill_mask(masked_line, topk=1) output.append(top_unmasked) return output
Step 3: Get Majority Class and Execute
We are almost done. The only thing left is to detect the majority class and create N items per minority class according to the size of the majority class.
Read the text dataset
myModel = pd.read_csv('model.csv', sep=';')
Get the majority class
counter = Counter(myModel['Result']) max_intent = '' max_intent_count = 0 for intent in set(myModel['Result']): if max_intent_count < counter[intent]: max_intent = intent max_intent_count = counter[intent]
Calculate the number of items to generate per minority class
threshold = floor(max_intent_count/2) minority_intents = set(myModel['Result'])-set([max_intent]) for intent in minority_intents: print (intent, "started") intent_examples = myModel[myModel['Result']==intent] intent_count = intent_examples.shape examples_to_generate = max_intent_count - intent_count if examples_to_generate > threshold: body_examples = intent_examples['Body'] number_of_synthetic_per_example = ceil(examples_to_generate/intent_count) else: body_examples = intent_examples['Body'][:examples_to_generate] number_of_synthetic_per_example = 1 print (intent_count,examples_to_generate, number_of_synthetic_per_example) syntetic_bodies =  for body in body_examples: syntetic_bodies.append(generate_synthetic_texts(body, number_of_synthetic_per_example)) # flatten arrays syntetic_bodies = [item for sublist in syntetic_bodies for item in sublist] augmented_df = pd.DataFrame() augmented_df['Body']=syntetic_bodies augmented_df['Result']=intent myModel = alexModel.append(augmented_df) print (intent, "processed")
N.B. the threshold is needed to define whether we generate synthetic items for each item in the minority class or, only on the selected sub array. For instance, if our majority class is composed of 500 items, and our minority class has 350, then, the most logical way is to select a subarray of 15 items and generate 1 synthetic example for each.
The execution takes some time (40 minutes for a dataset of a couple thousands of lines), so it may be interesting to try parallelizing the execution (maybe using Spark on Azure Databricks). I will try to publish an article on this in on of the next tutorials.
Hope this was useful