One of the anomalies in our case is the terms that are less coherent with the other words in the same distribution. For instance, in array [time; minute; second; hour; key] the word key seems to be an evident outlier. However such kind of terms may contain very rich client information. Finding the cause of such relation may result in finding anomalous information that would otherwise require significant manual effort. Such kind of task, being quite obvious to a human reader, may still represent certain difficulties for a computer algorithm as a machine is not aware of all the facts sufficient enough to draw a reliable conclusion. Thus, using a large knowledge base seems to be the most reasonable solution. In our work, we have used Sematch framework https://github.com/gsi-upm/sematch which allows evaluate semantic similarity for knowledge graphs. The semantic similarity metrics in this case is based on structural knowledge in taxonomy and statistical information contents. In other words, the framework uses such Knowledge Graph related metrics as depth, least common subsumer or graph Information Content(IC). One important thing is that the semantic similarity metrics is provided not only for words but also for entities and concepts, making it possible to compare named entities with words etc. We compare each term to each other and count the average similarity score for each word as it shown on the following figure. We then compare the score to the manually defined threshold and filter out the redundant terms.
The experiment for the redundant term detection has been inspired by the post in stackoverflow https://stackoverflow.com/questions/20678865/algorithm-to-find-odd-word-in-a-list-of-english-words where the user was wondering if there is an algorithm to find odd word. As an example he provided [banana,apple,orange,tree] and wanted the algorithm to find tree. However, it should be mentioned that the task is not even obvious for a human as one may consider banana as an odd term since banana is a herb, while the others are trees, or orange as it is also a color. We therefore decided to insert another intruder term that would be quite obvious for a human user and replaced tree with plane. We have launched the experiment with the following parameters:
language = English
metrics = Wu and Palmer
threshold = 0.6
The result may be observed in the Table
We now try the same approach to detect tree.
The algorithm has not detected the intruder but we can see that tree has the smallest similarity score, so just sorting the list by similarity and taking the smallest one would solve the issue. Nevertheless in the real world example we may find ourselves in a situation where the topic has no intruder at all or, which is more likely, has several intruders. Taking into account all these factors we can however claim that the proposed approach seems to demonstrate relevant results, especially in the most evident cases.
Comments