top of page
  • Writer's pictureAlibek Jakupov

Data Augmentation for NLP with CamemBERT

Updated: Dec 11, 2022



In the previous article we've seen how to implement the data augmentation for Computer Vision. In this article we are going to see how to augment textual data. Up we go!


 

There's an excellent reference explaining how data augmentation for NLP works. However, instead of using a ready-to-use library, it is more interesting to develop everything from scratch. Moreover, the fact that you've implemented the algorithm yourself allows you to adjust the algorithm according to your specific context.


In short the data augmentation techniques include:

  • Back translation

  • Synonym Replacement

  • Random Insertion

  • Random Deletion

  • Shuffle Sentences Transform etc.

There's also a python package that allows you to do some basic and advanced augmentation, called NLPAug. NLPAug offers three types of augmentation:

  • Character level augmentation

  • Word level augmentation

  • Sentence level augmentation

According to Jakub Czakon, the author of the discussed reference:

From my experience, the most commonly used and effective technique is synonym replacement via word embeddings.

Sounds reasonable. But as mentioned in the introduction, we will try implementing all these steps ourselves.



Step 0: Imports


import torch
import random
import re
import pandas as pd

from collections import Counter
from math import floor, ceil

Step 1: Contextual Word Embeddings:


For this purpose we are going to use the CamemBERT, a french extension of BERT. A short quote from the official website:

CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. We evaluate CamemBERT in four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI); improving the state of the art for most tasks over previous monolingual and multilingual approaches, which confirms the effectiveness of large pretrained language models for French. CamemBERT was trained and evaluated by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah and Benoît Sagot.

So, the first thing is to download the model and intialize the camembert object.

camembert = torch.hub.load('pytorch/fairseq', 'camembert')
camembert.eval()

N.B. there're some additional dependencies like hydra-core or omegaconf to install. Important: do not confuse with the hydra package, otherwise, you will have to uninstall everything.




Step 2: Mask Random Word and Generate Synthetic Texts


This step is quite straightforward, simply mask a random word and send this masked line to camembert to let the language model guess the masked word. These guessed words are ordered by their probabilities in the descending order.


Mask random word in a line


def mask_random_word(text_line):
    text_array = text_line.split()
 
    masked_token_index = random.randint(0, len(text_array)-1)
    text_array[masked_token_index] = "<mask>"
    output_text = " ".join(text_array)

 return output_text

Generate synthetic text for a text line

def generate_synthetic_texts(text, number_of_examples):
    output = []
 for i in range(number_of_examples):
        masked_line = mask_random_word(text)
        top_unmasked = camembert.fill_mask(masked_line, topk=1)
        output.append(top_unmasked[0][0])
 return output


Step 3: Get Majority Class and Execute


We are almost done. The only thing left is to detect the majority class and create N items per minority class according to the size of the majority class.


Read the text dataset

myModel = pd.read_csv('model.csv', sep=';')

Get the majority class

counter = Counter(myModel['Result'])
max_intent = ''
max_intent_count = 0

for intent in set(myModel['Result']):
 if max_intent_count < counter[intent]:
        max_intent = intent
        max_intent_count = counter[intent]

Calculate the number of items to generate per minority class


threshold = floor(max_intent_count/2)
minority_intents = set(myModel['Result'])-set([max_intent])


for intent in minority_intents:
    print (intent, "started")
    intent_examples = myModel[myModel['Result']==intent]
    intent_count = intent_examples.shape[0]
    examples_to_generate = max_intent_count - intent_count
 if examples_to_generate > threshold:
        body_examples = intent_examples['Body']
        number_of_synthetic_per_example = ceil(examples_to_generate/intent_count)
 else:
        body_examples = intent_examples['Body'][:examples_to_generate]
        number_of_synthetic_per_example = 1

    print (intent_count,examples_to_generate, number_of_synthetic_per_example)



    syntetic_bodies = []
 for body in body_examples:
        syntetic_bodies.append(generate_synthetic_texts(body, number_of_synthetic_per_example))

 # flatten arrays
    syntetic_bodies = [item for sublist in syntetic_bodies for item in sublist]

    augmented_df = pd.DataFrame()
    augmented_df['Body']=syntetic_bodies
    augmented_df['Result']=intent

    myModel = alexModel.append(augmented_df)
    print (intent, "processed")

N.B. the threshold is needed to define whether we generate synthetic items for each item in the minority class or, only on the selected sub array. For instance, if our majority class is composed of 500 items, and our minority class has 350, then, the most logical way is to select a subarray of 15 items and generate 1 synthetic example for each.


Dicsussion


The execution takes some time (40 minutes for a dataset of a couple thousands of lines), so it may be interesting to try parallelizing the execution (maybe using Spark on Azure Databricks). I will try to publish an article on this in on of the next tutorials.

 

Hope this was useful

725 views8 comments

8 則留言


ajakupov
2022年2月03日

Hello, thank you again for this comment. The problem with my code is that it doesn't verify whether the generated text is a duplicate or not. Give me some time, I will update it. Tonight or this week-end.


By now, if you already know the number of the instances to generate, I would suggest setting the number of synthetic values manually, like this:


Instead of

number_of_synthetic_per_example =ceil(examples_to_generate/intent_count)

number_of_synthetic_per_example = 2372


Or something like this.


I will keep you updated

按讚

mathieualexhache
2022年1月28日

And I also get this error: augmented_df['tweet']=syntetic_subjects. Do I need to put syntetic_bodies instead of syntetic_subjects ?

按讚
mathieualexhache
2022年2月01日
回覆

No problem !

按讚

mathieualexhache
2022年1月28日

Hi, I tried it and I get the following error with my dataset: 0 started 821 2372 1. It looks like it wont generate artifical text..

按讚
mathieualexhache
2022年2月01日
回覆

Thank you for your response, it is greatly appreciated. I want to clarify that I am new to the NLP world. I'm working on a hate speech detection project (a simple binary classification problem) for French tweets. The classes of our dataset are unbalanced, we find 821 neutral tweets and 3193 hateful tweets. So I'm looking to increase the minority class by 2372 tweets. I want, for tweets from the minority class (0 = neutral), to randomly mask some of their words and then send those masked tweets back to the Camembert model so that it fills in the missing words with others which are semantically similar. This way, we could have artificially increased the number of neutral tweets and…


按讚
bottom of page