Feature Extraction Techniques for Different NLP Applications
NLP (Natural Language Processing) is a discipline that focuses on the understanding, manipulation and generation of natural language by machines. Thus, NLP is really at the interface between computer science and linguistics. It is about the ability of the machine to interact directly with humans.
Over the past few years we have seen a tremendous breakthrough in the domain, especially with the appearance of such huge models like Turing, BERT, GPT-3 and GPT-4 (in the nearest future), which allow generating meaningful text, answer the questions, convert the code into text or even train One-shot/Zero-shot classifiers, with no explanation or only few examples provided.
One may think that there is nothing left to do for individual researchers/engineers, as all of those models were developed by huge companies, investing billions of dollars into R&D, and only "big boys" like Facebook, Google or Microsoft can shine. But what if I told that there are still some applications where even monstrous models like GPT-3 can barely outperform hand-crafted models. In this article we are going to see some of the applications where feature engineering is more crucial than hardware capacity, and the craftsmanship is more important than the technology.
Applications of text classification are becoming increasingly interesting with the rising number of natural language processing systems, such as topic and sentiment classifiers, systems detecting domain or domains of interest to a user, fake news analysers or medical diagnostic systems, where text classifiers may significantly improve the robustness and scalability of existing solutions without compromising their accuracy. In most of the cases, theses systems consist of two parts : a feature extraction component and a classifier. The former allows to generate features given a text sequence, and the latter assigns class labels to the sequence.
Commonly such features include lexical and syntactic components. Total words or characters per word, frequency of large and unique words refer to lexical features, whereas syntactic features are mainly based on frequency of function words or phrases, like n-grams, bag-of-words, or Parts-Of-Speech (POS) tagging. There also exist lexicon containment features which express the presence of a term from lexicon in the text as a binary value (positive=occurs, negative=doesn't occur). The lexicons for such features may be designed by human expert or generated automatically. Some approaches suggest using morphological links and dependency linguistic units in the text as input vector for the classifier. In addition to this, there are semantic vector space models, which are used to represent each word with a real valued vector based on the distance or angle between pairs of word vectors, for a variety of tasks as information retrieval, document classification, question answering, named entity recognition and parsing. Besides these common linguistic features, there are also so-called domain-specific features, for instance, quoted words or external links. It should be mentioned that other features can be specifically designed for a certain task, for example, to capture the deceptive cues in writing styles to differentiate fake news.
However, finding an optimal set of features still remains a challenge as most of the models suffer significant drawbacks depending on the application context. While certain methods may do well on specific problem, they may still perform poorly on other ones. For instance, in tasks like fake news detection, existing linguistic-based features are considered as insufficient for revealing the in-depth underlying distribution patterns. Some syntactic features require running expensive parsing models during the evaluation phase. When the input texts are sparse, noisy and ambiguous and there is a need in faster training and new words adaption poor performance of existing techniques is unavoidable. Thus, there is a dire need to address this issue, which will allow to produce vector spaces with meaningful substructure.
This task refers to the identification, extraction and quantification of affective states and subjective information, to detect whether the expressed opinion in a document, sentence or an entity is positive, negative or neutral. There also exist more sophisticated classifiers, that look for different emotional states as "angry", "sad" or "happy".
This is one of the most trivial examples, which has been proven to be successful on many real-world use cases. The list of feature extraction techniques is endless, but the most basic ones include hashing, label-hot encoding, tf-idf etc. For instance, hashing algorithms have demonstrated quite on short texts, like tweets. It can be used to represent variable-length text documents as equal-length numeric feature vectors. An added benefit of using feature hashing is that it reduces the dimensionality of the data, and makes lookup of feature weights faster by replacing string comparison with hash value comparison.
For instance, we can set the number of hashing bits to 17 and the number of N-grams to 2. With these settings, the hash table can hold 2^17 or 131,072 entries in which each hashing feature will represent one or more unigram or bigram features. For many problems, this is plenty, but in some cases, more space is needed to avoid collisions. You should experiment with a different number of bits and evaluate the performance of your machine learning solution.
Moreover, you can the results of sentiment analysis as features themselves. For instance, Azure Cognitive Services for Language provides sentiment labels (such as "negative", "neutral" and "positive") based on the highest confidence score found by the service at a sentence and document-level. This service also returns confidence scores between 0 and 1 for each document & sentences within it for positive, neutral and negative sentiment. These results may also be used as features for more sophisticated tasks, like Deceptive Opinion Spam detection.
Fake News Detection
As defined by Kai Shu,
Fake news is a news article that is intentionally and verifiably false.
In the same paper, the authors define a mathematical formulation of fake news detection. According to this work, the task of fake news detection is to predict whether a news article is a fake news piece or not, given the social news engagement among users. Social news engagement is a set of tuples representing the process of how news spread over time among users and their corresponding posts on social media regarding.
Fake news detection is the most challenging task for syntactic features-based classifiers. As such kind of text is intentionally generated to mislead a reader, the task becomes non-trivial. Firstly, the content is diverse in term of topics and styles. Secondly, fake news mocks truth by citing factual evidence within the false context to support a subjective claim. Consequently, certain research papers claim that hand-crafted and data-specific textual features are not sufficient for this task. According to the results of their work, additional values must also be taken into consideration, as knowledge base and user social engagements. However, during the research conducted at the University of Windsor the author has successfully trained a reliable classifier based on textual features solely. Moreover, research conducted at the Rensselaer Polytechnic Institute illustrated that it is possible to classify fake and honest articles using titles, that generally have fewer stop-words and nouns, while having more nouns and verbs.
Hate Speech Recognition
In the research conducted at the University of Porto, the following definition of hate speech is provided:
Hate speech as a language that attacks or diminishes, that incites violence or hate against groups, based on specific characteristics such as physical appearance, religion, descent, national or ethnic origin, sexual orientation, gender identity or other, and it can occur with different linguistic styles, even in subtle forms or when humour is used.
We refer to this definition as it uses four different dimensions: “Hate speech has specific targets”, “Hate speech is to incite violence or hate”, “Hate speech is to attack or diminish” and “Humour has a specific status”.
The following are a few scenarios in which a software developer or team would require a hate recognition service:
Online marketplaces that moderate product catalogs and other user-generated content.
Gaming companies that moderate user-generated game artifacts and chat rooms.
Social messaging platforms that moderate images, text, and videos added by their users.
Enterprise media companies that implement centralized moderation for their content.
K-12 education solution providers filtering out content that is inappropriate for students and educators.
N.B. Even though the feature extraction for such kind of problems still remains the subject of research, there already exist services allowing to resolve certain aspects of this problem. For instance, Azure Content Moderator, which lets you handle content that is potentially offensive, risky, or otherwise undesirable. It includes the AI-powered content moderation service which scans text, image, and videos and applies content flags automatically.
The task of identifying actionable feedback is non-trivial as it aims at detecting the improvements about a reviewed item, which is semantically different from expressing an opinion. There are two distinct forms of indicating a suggestion, either by wishing the presence of a component or by regretting the absence of such a component. Although subjectivity or objectivity identification may also be useful for the purpose of identifying suggestions, the context is the exploration of a component-based opinion-mining systems , which only determine the sentiments or opinions expressed on different aspects of entities.
Medical Record Phenotyping
Electronic medical records are used in a variety of biomedical research, such as genetic association studies and studies of comparative effectiveness and risk stratification. One of the main challenges of this kind of application is the task of characterizing accurate phenotypes with electronic medical records as the existing approaches rely heavily on expert participation. To create such an algorithm it is necessary to collect informative features strongly characterizing a given phenotype, and to develop a classification algorithm based on these textual features with a gold-standard training set.
Most of the Semantic vector space models methods are based on the distance or angle between pairs of word vectors. It is possible to distinguish two most popular model families for learning word vectors : global matrix factorization and local context windows methods. One of outstanding examples of global matrix factorization methods is Latent Semantic Analysis (LSA) that allows to efficiently leverage statistical information. On the other hand, local context window methods as the skip-gram model, do well on the analogy tasks, that probe the finer structure of the word vector space by examining not the scalar distance between word vectors, but rather their various dimensions of difference. Methods like LSA do relatively poor on word analogy task and methods like skip-gram poorly utilize the statistics of the corpus, since they train on separate local context windows. Lexicon containment features demonstrated poor performance in comparison with long-distance context features. Lexical features, in turn, have the problem with the frequency of large and unique words, which causes certain misproportion.
In this article we've tried to cover the topic of feature extraction techniques for NLP and their variation depending on the application. Hopefully you found it useful.