LDA for suggestion generation: finding redundant terms
One of the anomalies in our case is the terms that are less coherent with the other words in the same distribution. For instance, in array [time; minute; second; hour; key] the word key seems to be an evident outlier. However such kind of terms may contain very rich client information. Finding the cause of such relation may result in finding anomalous information that would otherwise require significant manual effort. Such kind of task, being quite obvious to a human reader, may still represent certain difficulties for a computer algorithm as a machine is not aware of all the facts sufficient enough to draw a reliable conclusion. Thus, using a large knowledge base seems to be the most reasonable solution. In our work, we have used Sematch framework (https://github.com/gsi-upm/sematch) which allows evaluate semantic similarity for knowledge graphs. The semantic similarity metrics in this case is based on structural knowledge in taxonomy and statistical information contents. In other words, the framework uses such Knowledge Graph related metrics as depth, least common subsumer or graph Information Content(IC).
One important thing is that the semantic similarity metrics is provided not only for words but also for entities and concepts, making it possible to compare named entities with words etc. We compare each term to each other and count the average similarity score for each word as it shown on the Figure 1. We then compare the score to the manually defined threshold and filter out the redundant terms
As the discussed tasks are too subjective and there is no set of relevant labels and textual anomalies it is complicated to apply standard evaluation metrics based on precision and recall. For the task of anomaly detection we manually created a dataset containing an evident intruder term thus letting us evaluate the performance of the proposed solution.
The experiment for the redundant term detection has been inspired by the post in stackoverflow (https://stackoverflow.com/questions/20678865/algorithm-to-find-odd-word-in-a-list-of-english-words) where the user was wondering if there is an algorithm to find odd word. As an example he provided banana,apple,orange,tree and wanted the algorithm to find tree. However, it should be mentioned that the task is not even obvious for a human as one may consider banana as an odd term since banana is a herb, while the others are trees, or orange as it is also a color. We therefore decided to insert another intruder term that would be quite obvious for a human user and replaced tree with plane. We have launched the experiment with the following parameters:
language = English
metrics = Wu and Palmer
threshold = 0.6
The result may be observed in the Table 1. We now try the same approach to detect tree.
The algorithm has not detected the intruder but we can see that $tree$ has the smallest similarity score, so just sorting the list by similarity and taking the smallest one would solve the issue. Nevertheless, in the real world example we may find ourselves in a situation where the topic has no intruder at all or, which is more likely, has several intruders. Taking into account all these factors we can however claim that the proposed approach seems to demonstrate relevant results, especially in the most evident cases
 M. I. J. David M. Blei, Andrew Y. Ng. Latent dirichlet allocation. Journal of Machine Learning Research, 3, 2003.