Feature Extraction from Text: Some Unusual Techniques

Alibek Jakupov
Nov 13, 2022
10 min read

It's common knowledge that feature extraction is more about art or craftsmanship than pure science, especially if we are talking about non-structured data such as text. In most of the cases Data Scientists and ML Engineers tend to generalise the vectorisation techniques into three major categories. One-hot encoders, like Bag of Words, TF-IDF, Label Hot encoders, Hashing algorithms etc. Unidirectional embeddings, like word2vec, skip-gram or glove. And finally, bidirectional models, mainly based-off BERT, like its extensions RoBERTa, CamemBERT and so on and so forth. However, when the task goes beyond trivial classification, one may be interested whether there may be some interesting feature extraction techniques allowing not only to train a reliable classifier, but also to reveal the deep patterns hiding beneath the large corpus of textual documents. Such case may involve psychological analysis, lie detection or text coherence evaluation. In this article are going to dive deeper into this subject. Up we go!

Introduction

In most of the cases, text classification systems consist of two parts : a feature extraction component and a classifier. The former allows to generate features given a text sequence, and the latter assigns class labels to this sequence, given a list of corresponding features.

Commonly such features include lexical and syntactic components. Total words or characters per word, frequency of large and unique words refer to lexical features, whereas syntactic features are mainly based on frequency of function words or phrases, like n-grams, bag-of-words (BOW), or Parts-Of-Speech (POS) tagging.

There also exist lexicon containment features which express the presence of a term from lexicon in the text as a binary value (positive=occurs, negative=doesn't occur). The lexicons for such features may be designed by human expert or generated automatically.

Some approaches suggest using morphological links and dependency linguistic units in the text as input vector for the classifier. In addition to this, there are semantic vector space models, which are used to represent each word with a real valued vector based on the distance or angle between pairs of word vectors, for a variety of tasks as information retrieval, document classification, question answering, named entity recognition and parsing.

Besides these common linguistic features, there are also so-called domain-specific features, for instance, quoted words or external links. There also exist methods based on Knowledge Graphs (KG), which suggest mapping of terms of the text into an external knowledge source, and allows a more effective extraction of patterns from noisy data.

This technique may be robust as it allows integrating the external knowledge source and add common sense knowledge to the analyzer. All the feature extraction algorithms mentioned above may be served to examine the weights of the features, which allows the researchers to shed light on the commonality in the structure of deceptive opinion spam that is less present in truthful sentences.

Although this approach proves to be useful it has some significant drawbacks because the quality of the training set is difficult to control and building a reliable classifier requires a considerable number of high-quality labeled texts. Moreover, certain classification models based on the embeddings approach may be strongly impacted by social or personal attitudes present in the training data, which makes the algorithm draw wrong conclusions. In certain cases inferences of an algorithm may be perfect on the training set and non-generalizable for new cases which may represent serious challenges for Deceptive Opinion Spam detection. From this point of view weakly-supervised or unsupervised models based on topic modeling may perform better due to their better generalization capacity and independence from the training data.

In certain cases, to obtain a deeper understanding of how lies are expressed in texts, researchers investigate the usefulness of the other approaches, like topic modeling to learn the patterns that constitute a certain class, and then explore the outputs of this model to identify those patterns. Topic models may be useful in this task due to their ability to group multiple documents into smaller sets of key topics. Unlike neural nets, which model documents as dense vectors, topic models form sparse mixed-membership of topics to represent documents, which means that most of the elements are close to zero.

However, as topic modelling algorithms based-off LDA does not take advantage of dense word representation which can capture semantically meaningful regularities between words, one can extend his research to other algorithms which can take advantage of word-level representations to build document-level abstractions, such as lda2vec. lda2vec extends Skip Negative Sampling (SGNS) to jointly train word, document and topic vectors and embed them in a common representation space which takes into account semantic relations between the learned vector representations. At the same time, this representation space yields sparse document-to-topic proportions, which allows us to interpret the vectors and draw the conclusions on the nature of deception.

As we will see further, some researchers not only combine the approaches listed above, but also invent their own, based on the expertise of a domain. The below cases consider a very challenging task of detecting the pre-paid reviews on the internet, which is commonly referred as Deceptive Opinion Spam.

Linguistic Inquiry Features

Ott was the first to address the issue of Deception Detection by applying the Machine Learning approach. One of the important contributions of their work is the proof of the necessity of considering both the context and motivations underlying a deception, instead of focusing purely on a pre-defined set of deception cues, like Linguistic Inquiry and Word Count (LIWC), which is widely used to explore personality traits and investigate tutoring dynamics. Accordingly, they combined features from the psycho-linguistic approach and the standard text categorization approach, and succeeded to achieve the 89.8\% performance with the model based on LIWC and bigrams.

Nevertheless, these features are not robust to domain change, as they can do well only if training and testing datasets are of the same domain. For instance, simply shifting the polarity of the reviews for training and testing (i.e. training on positive reviews and testing on negative ones) significantly dropped the overall performance of the model. Topic modeling, in this context, is more flexible as it has been demonstrated by David Blei when they applied their model to the domains as diverse as computer vision, genetic markers, surveys and social network data.

Generative Bayesian Approach

Li succeeded to obtain a satisfying result by capturing the overall dissimilarities between truthful and deceptive texts. In their research they extended Sparse Additive Generative Model (SAGE) , a generative bayesian approach, which combines topic models and generalized additive models and creates multi-faceted latent variable models by adding together the component vectors.

As most of the researches in the domain focused on detecting the deceptive patterns instead of training a single reliable classifier, the main challenge was to identify the features contributing the most to each type of deceptive review and evaluate the impact of such features on the final decision when combined with the other features. SAGE fits these needs due to its additive nature, whereas other classifiers may suffer from the domain-specific properties in cross-domain scenarios. The authors found out that the BOW approach is less robust than LIWC and POS modeled using SAGE and constructed the general rule of deceptive opinion spam with these domain-independent features.

Although the domain-independent features extracted during the research proved to be efficient and allowed to detect fake reviews with above-chance accuracy, it has been demonstrated that the sparsity of these features makes it complicated to leverage non-local discourse structures, thus the trained model will be unable to capture the global semantic information of a document.

Three-stage System

Ren and Ji proposed a three-stage system for feature extraction. At first, they constructed sentence representations from word representations with the help of convolutional neural network, as the convolution action has been generally applied to synthesize lexical n-gram information. For this step they applied three convolutional filters as they are capable of capturing local semantics of n-grams, such unigrams, bigrams and trigrams, an approach that has been already proven successful for such tasks as sentiment classification. After that they modeled the semantic and discourse relations of these sentence vectors to construct a document representation using a bi-directional gated recurrent neural network. These document vectors were finally used as features to train a classifier. The authors achieved a relatively high accuracy and proved that neural networks may be applied to learn continuous document representations to better capture semantic characteristics.

The main goal of their study was to empirically demonstrate the better performance of neural features over traditional discrete features (like n-grams, POS, LIWC, etc.) due to their stronger generalization.

Nevertheless, further experiments conducted by the authors showed that by integrating discrete and neural features the overall accuracy may be improved, thus discrete features still remain a rich source of statistical and semantic information. It therefore follows that jointly trained word, topic and document vectors, represented in a common vector space may improve the overall accuracy of deceptive spam classifiers.

Specific Details Features

Vogler and Pearl investigated the use of specific details for detecting deception both within-domain and across-domains. The linguistic features they covered in the research included n-grams, POS, syntactic structure, measures of syntactic complexity, stylometric features, semantically-related keyword lists, psychologically-motivated keyword lists, sentiments, discourse structure and named entities. The authors claim that these features are not robust enough, especially in cases where domain may very significantly, as most of them tend to rely on cues that are very dependent on specific lexical items, such as n-grams or specific keyword lists. Though there are some linguistically abstract features like POS, stylometric features or syntactic rules, the authors consider them as less relevant as they are not motivated by the psychology of verbal deception. In their research they considered deception as an act of imagination, and besides analyzing the linguistic approaches they also explored psychologically-motivated methods, such as information management theory, information manipulation theory, reality monitoring and criteria-based statement analysis. As more abstract psychologically-motivated linguistic cues may be more applicable across domains the authors find it useful to apply these cues with a basis in the psychological theories of how humans generate deceptive texts.

They have also relied on the results provided by Krüger whose research focuses on subjectivity detection in newspaper articles and suggests that linguistically abstract features may be more robust when applied to the cross-domain texts. For the experimentation Vogler and Pearl utilized three datasets for training and testing with domain changes varying from fairly minimal to quite broad, like essays on emotionally charged topics and personal interview questions. The linguistically-defined specific detail features the authors constructed for this research proved to be useful when training and testing domains vary significantly. The features are based on prepositional phrase modifiers, adjective phrases, exact number words, proper nouns and noun modifiers that appeared as consecutive sequences. Each feature is represented as the total normalized number and the average normalized weight. They succeeded to achieve the best F-score of 0.91 for the cases when the content doesn't change and the best F-score of 0.64 when the content domain changes most significantly, which demonstrates that the linguistically-defined specific detail features are more generalizable across domains.

However, even if the classifier trained on these features had fewer false negative, it poorly classified the truthful texts. As it may be seen from the experimentation results, a mix of n-gram and linguistically-defined specific details features tends to be more reliable only in case the false positive is more costly than false negative. It should also be mentioned that the n-gram-based features may have more semantic generalization capacity when based on distributed meaning representations, such as GloVe and ELMo, whereas n-gram features in their approach are based on individual words and do not capture the semantic relatedness.

BERT with Ablation Study

Barsever built a state-of-the art classifier using BERT and then analyzed this classifier to identify the patterns BERT learned to classify the deceptive reviews. BERT is a neural network architecture pretrained on millions of words and using the Masked Language Modeling (MLM) by jointly conditioning on left and right context in all layers to train deep bidirectional language encoding. The main advantage of BERT is the fact that it learns rules and features unsupervised, which allows BERT looking for the best solution unrestricted by preconceived rules. With their model, Barsever achieved a high accuracy, which proves the existence of features allowing to distinguish truth and deception.

As the main goal of the research was to find rules and patterns of deceptive language, the authors performed an ablation study, by removing each POS and monitoring the performance of the network. Moreover they detected so-called 'swing' sentences, which are more important than the others for the classifier, to run POS analysis on them and shed light on the inner rules BERT constructed.

Finally, the authors created the Generative Adversarial Network (GAN) based on their BERT model, whose goal is to fool the classifier to find out the trends reproduced in the synthetic data. Their findings indicate that certain POS (e.g. singular nouns) are more important for the classifier than the others and that truthful texts are more rich from the point of view of the variance of POS, whereas the deceptive reviews are more formulaic.

Nevertheless, the approach applied by Barsever may have some important challenges. In fact, the main disadvantage of BERT is the absence of independent sentence embeddings, which can play an important role as a higher means of abstraction. Not surprisingly, the authors had to manually remove sentence by sentence from the initial dataset by replacing them with [MASK] tokens, and excluding the one-sentence entries. In addition, the rules generated by BERT are still unclear for the authors, and the results of the ablation study may reveal the other commonalities instead of identifying the patterns of the deception. For instance, the removing of the singular nouns resulted in a significant drop in the model's performance, which is interpreted as a strong weight of this POS in the classifier. We can nevertheless infer from these results that due to the prevalence of nouns in the speech, replacing them may also result in incomprehensible texts, which the classifier can hardly interpret.

Of course, for most of the cases, ML developers do not have to concern themselves with the issue of choosing the best vectorisation technique, as long as most of the classical python packages likes scikit-learn, SpaCy or NLTK do the job, without making the developer dive deeply into the subject. There even exist many frameworks, written in other languages than Python, like ML.NET designed for C# developers. These packages mainly use one-hot encoders like TF-IDF which surprisingly may provide even better results than embeddings. Solutions like Azure Machine Learning may also allow you to apply ready-to-use blocks, that will, based on your compute, implement BERT or TF-IDF for converting your text into structured feature data. But if you were interested in how you can create your own vectorisers or wondered if there exist any non-trivial approaches, you may find this short article helpful. At least I hope you will enjoy reading it as much as I enjoyed writing it.