• Alibek Jakupov

Rookie's research: how Scinan helps in learning

Recently I've started my PhD research in NLP, with focus in the analysis of deceptive language on the Internet. This usually takes the form of a malicious customer posting fake negative reviews to hurt a business, or a company shill posting fake positive reviews to inflate its image. Humans are ineffective at detecting deceptive text, faring little better than chance. Consequently, I've started concerning myself with the issue, but I didn't know where to start. As the reader knows, I stil consider myself as a rookie developer, and I am a complete newbie in the world of academic research, so the most evident way for me was to start from where all the others start from : Google Scholar.

I do love using Google Scholar, but there are some important challenges you face if you're new to a certain field. Yet, this is not the problem of Google Scholar, but globally of the approach that is generally recommended to follow for young researchers. So here you are

Unability to get a visual overview of a new domain

As a first-year PhD student, you need to do a long and a complicated research of the domain. You need to understand what are the most important papers in a particular field, what are the related fields, and so on and so forth. It's simply difficult to keep track of all this data. Nevertheless, it's an absolute necessity to get visual understanding of the trends and dynamics of the field you are interested in.

Easy to miss an important paper

There's no sort of summary of all the papers read (it's actually your task to make one), and as you read a lot of papers, it's easy to get lost and confused. Moreover, so many papers are published each year, especially in our domain (faithful readers should know that here we focus mainly on Machine Learning), so unsurprisingly, sometimes we do miss important papers. I am not ashamed to tell you that at the very beginning I missed some important papers that impacted the whole domain, and wouldn't have found them, if my professor hadn't showed them to me. Quite useful to have such kind of tool, that allows us to be more attentive, right?

Create the bibliography

It's not a trivial task to find the references that you will certainly want in your bibliography without a visual tool, representing papers as a graph.

Ancestor and Derivative works

This one is very important. For instance, Deceptive Opinion Spam is a field involving a lot of domains (NLP, Deep Learning, Theory of Graphs, Linguistics, Pshycology to name a few). Thus, it is crucial to define what are the important ancestor works in this or that field of study, as well as related fields that may represent a particular interest for your research.

So, the tool called Scinan allows me to cope with all the obstacles listed above.

Here's the capture of my working folder, before I started using Scinan:

And here's what it looks like now:

Yes, it's still the same, because the goal of Scinan, is not to make you read less, it's to make your reading more efficient.

Here's the knowledge structure that's been constructed based on Microsoft Academic Graph.

Using this graph I can see the most impactful papers, the related and derivative domains, and filter the results by key words. There's also a tool allowing to create more sophisticated filters using logical operands

Simple and elegant, right?

Wait, but what has LSTM to do with Deceptive Opinion Spam? If we go further and start reading the related articles we can get some surprising insights. We can find out that there exists a golden-standard dataset, Ott Deceptive Opinion Spam corpus, that is being actively used by researchers.

Here's a short summary of different researches performed on this dataset

And if you go even further, you'll see that there was an interesting approach by Hu, who used a variety of models to identify concealed information in text and verbal speech. And the best among them was a deep learning model based off bidirectional LSTMs. He also created a corpus of wine tasters evaluating wines and encoding in various ways such as n-grams, LIWC, and GloVe embeddings. And the LSTM model using these features achieved an f-score in identifying the presence of concealed information of 71.51, defeating the human performance of 56.28.

Woah, that's kind of insights I want to gain when I start my research.

To sum up, Scinan is not a miraclous tool, that will make you learn with no effort. By it is a brand new approach in studying, that allows you to keep learning in the same way you did, but without losing focus of the global trends and dynamics of the domain, which is very tricky when you spend a lot of time making your research.

Hope this was useful, dear reader! Keep learning every day, and may the force be with you.

  • Twitter
  • LinkedIn

Since 2018 by ©alirookie