In the previous article we've seen how to implement the data augmentation for Computer Vision. In this article we are going to see how to augment textual data. Up we go!
There's an excellent reference explaining how data augmentation for NLP works. However, instead of using a ready-to-use library, it is more interesting to develop everything from scratch. Moreover, the fact that you've implemented the algorithm yourself allows you to adjust the algorithm according to your specific context.
In short the data augmentation techniques include:
Back translation
Synonym Replacement
Random Insertion
Random Deletion
Shuffle Sentences Transform etc.
There's also a python package that allows you to do some basic and advanced augmentation, called NLPAug. NLPAug offers three types of augmentation:
Character level augmentation
Word level augmentation
Sentence level augmentation
According to Jakub Czakon, the author of the discussed reference:
From my experience, the most commonly used and effective technique is synonym replacement via word embeddings.
Sounds reasonable. But as mentioned in the introduction, we will try implementing all these steps ourselves.
Step 0: Imports
import torch
import random
import re
import pandas as pd
from collections import Counter
from math import floor, ceil
Step 1: Contextual Word Embeddings:
For this purpose we are going to use the CamemBERT, a french extension of BERT. A short quote from the official website:
CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. We evaluate CamemBERT in four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI); improving the state of the art for most tasks over previous monolingual and multilingual approaches, which confirms the effectiveness of large pretrained language models for French. CamemBERT was trained and evaluated by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.
So, the first thing is to download the model and intialize the camembert object.
camembert = torch.hub.load('pytorch/fairseq', 'camembert')
camembert.eval()
N.B. there're some additional dependencies like hydra-core or omegaconf to install. Important: do not confuse with the hydra package, otherwise, you will have to uninstall everything.
Step 2: Mask Random Word and Generate Synthetic Texts
This step is quite straightforward, simply mask a random word and send this masked line to camembert to let the language model guess the masked word. These guessed words are ordered by their probabilities in the descending order.
Mask random word in a line
def mask_random_word(text_line):
text_array = text_line.split()
masked_token_index = random.randint(0, len(text_array)-1)
text_array[masked_token_index] = "<mask>"
output_text = " ".join(text_array)
return output_text
Generate synthetic text for a text line
def generate_synthetic_texts(text, number_of_examples):
output = []
for i in range(number_of_examples):
masked_line = mask_random_word(text)
top_unmasked = camembert.fill_mask(masked_line, topk=1)
output.append(top_unmasked[0][0])
return output
Step 3: Get Majority Class and Execute
We are almost done. The only thing left is to detect the majority class and create N items per minority class according to the size of the majority class.
Read the text dataset
myModel = pd.read_csv('model.csv', sep=';')
Get the majority class
counter = Counter(myModel['Result'])
max_intent = ''
max_intent_count = 0
for intent in set(myModel['Result']):
if max_intent_count < counter[intent]:
max_intent = intent
max_intent_count = counter[intent]
Calculate the number of items to generate per minority class
threshold = floor(max_intent_count/2)
minority_intents = set(myModel['Result'])-set([max_intent])
for intent in minority_intents:
print (intent, "started")
intent_examples = myModel[myModel['Result']==intent]
intent_count = intent_examples.shape[0]
examples_to_generate = max_intent_count - intent_count
if examples_to_generate > threshold:
body_examples = intent_examples['Body']
number_of_synthetic_per_example = ceil(examples_to_generate/intent_count)
else:
body_examples = intent_examples['Body'][:examples_to_generate]
number_of_synthetic_per_example = 1
print (intent_count,examples_to_generate, number_of_synthetic_per_example)
syntetic_bodies = []
for body in body_examples:
syntetic_bodies.append(generate_synthetic_texts(body, number_of_synthetic_per_example))
# flatten arrays
syntetic_bodies = [item for sublist in syntetic_bodies for item in sublist]
augmented_df = pd.DataFrame()
augmented_df['Body']=syntetic_bodies
augmented_df['Result']=intent
myModel = alexModel.append(augmented_df)
print (intent, "processed")
N.B. the threshold is needed to define whether we generate synthetic items for each item in the minority class or, only on the selected sub array. For instance, if our majority class is composed of 500 items, and our minority class has 350, then, the most logical way is to select a subarray of 15 items and generate 1 synthetic example for each.
Dicsussion
The execution takes some time (40 minutes for a dataset of a couple thousands of lines), so it may be interesting to try parallelizing the execution (maybe using Spark on Azure Databricks). I will try to publish an article on this in on of the next tutorials.
Hope this was useful
Hello, thank you again for this comment. The problem with my code is that it doesn't verify whether the generated text is a duplicate or not. Give me some time, I will update it. Tonight or this week-end.
By now, if you already know the number of the instances to generate, I would suggest setting the number of synthetic values manually, like this:
Instead of
number_of_synthetic_per_example =ceil(examples_to_generate/intent_count)
number_of_synthetic_per_example = 2372
Or something like this.
I will keep you updated
And I also get this error: augmented_df['tweet']=syntetic_subjects. Do I need to put syntetic_bodies instead of syntetic_subjects ?
Hi, I tried it and I get the following error with my dataset: 0 started 821 2372 1. It looks like it wont generate artifical text..