©2018 by macnabbs. Proudly created with Wix.com

  • Alibek Jakupov

Semantic similarity-based Key word extraction

This post is inspired by the Burton DeWilde's article (Intro to Automatic Keyphrase Extraction). In his post he gives an excellent explanation on how to implement TextRank from scratch based on word co-occurence.

However, if we have a look on the TextRank paper

Any relation that can be defined between two lexical units is a potentially useful connection (edge) that can be added between two such vertices. We are using a co-occurrence relation,controlled by the distance between word occurrences : two vertices are connected if their corresponding lexical units co-occur within a window of maximum words, where can be set anywhere from 2 to10 words. Co-occurrence links express relations between syntactic elements, and similar to the semantic links found useful for the task of word sense disambiguation (Mihalceaetal.,2004), they represent cohesion indicators for a given text.

Thus, we can use the semantic similarity to construct an edge between words.

So how do we calculate the semantic similarity between words? Fortunately, there is an excellent python package called Sematch.

Sematch is an integrated framework for the development, evaluation, and application of semantic similarity for Knowledge Graphs (KGs). It is easy to use Sematch to compute semantic similarity scores of concepts, words and entities. Sematch focuses on specific knowledge-based semantic similarity metrics that rely on structural knowledge in taxonomy (e.g. depth, path length, least common subsumer), and statistical information contents (corpus-IC and graph-IC). Knowledge-based approaches differ from their counterpart corpus-based approaches relying on co-occurrence (e.g. Pointwise Mutual Information) or distributional similarity (Latent Semantic Analysis, Word2Vec, GLOVE and etc). Knowledge-based approaches are usually used for structural KGs, while corpus-based approaches are normally applied in textual corpora.

Here is the list of steps I fulfilled to implement semantically-based version of the TextRank algorithm.

1. Extract Candidates:

2. Construct edges between semantically related words (here we use french, but as sematch is multilingual you can use your native language).

3. Now construct the graph and implement TextRank:

4. Now you can use this TextRank algorithm implementation to extract keyphrases from your text: