Azure Machine Learning: Fake News detection

Alibek Jakupov
Dec 12, 2020
3 min read

Updated: Nov 19, 2021

Context

Opinions such as online reviews are the main source of information for e-commerce customers to help with gaining insight into the products they are planning to buy.

Fake news and misleading articles is another form of opinion spam, which has gained traction. Some of the biggest sources of spreading fake news or rumors are social media websites such as Google Plus, Facebook, Twitters, and other social media outlet

For this experiment, I've tried creating a Fake News Detector, based on the work of these guys:

Training Set

Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”
Ahmed H, Traore I, Saad S. “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques”

Dataset contains a list of articles considered as "fake“ or “true” news

Columns

Title : The title of the article
Text : The text of the article
Subject : The subject of the article
Date : The date at which the article was posted

Fake

17903 unique values

True

21192 unique values

Validation set

Kai Shu, Arizona State University
Fake news detection shared task for the Second International TrueFact Workshop: Making a Credible Web for Tomorrow in conjunction with SIGKDD 2020.

Columns:

text: text of the article
label: a label that marks the article as potentially unreliable

Dataset

4084 unique values

True

2972

Fake

2014

Data Preparation

This step is quite straightforward, we just simply remove all the specific chars, stop words, punctuation, html tags, etc.

Here's the sample you may reproduce in your experiment

import string
import re
import os
import numpy as np
import pandas as pd

from io import StringIO
from html.parser import HTMLParser

# Read input data
INPUT_FOLDER = "raw_datasets"
OUTPUT_FOLDER = "clean_datasets"

raw_fake = pd.read_csv(os.path.join(INPUT_FOLDER,"Fake.csv")) 
raw_true = pd.read_csv(os.path.join(INPUT_FOLDER,"True.csv"))
"""
Rework input
leave only text field and create a label column as input is composed of 2 datasets
1 = Fake, 0 = True
"""
train_fake = pd.DataFrame()
train_true = pd.DataFrame()

train_fake["text"] = raw_fake["text"]
train_fake["label"] = 1

train_true["text"] = raw_true["text"]
train_true["label"] = 0

train_clean = pd.concat([train_true, train_fake])

#Text preprocessing
class MLStripper(HTMLParser):
 def __init__(self):
        super().__init__()
 self.reset()
 self.strict = False
 self.convert_charrefs = True
 self.text = StringIO()

 def handle_data(self, d):
 self.text.write(d)

 def get_data(self):
 return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(str(html))
 return s.get_data()

def remove_tabulations(text):
    text = str(text)
 return(text.replace("\r", ' ').replace("\t", ' ').replace("\n", ' '))

def clean_text(text):
 # Remove HTML tags
    text = strip_tags(text)
 # Remove tabulation
    text = remove_tabulations(text)
 # convert to lower case
    text = text.lower()
 # Remove special characters
    text = re.sub('\[.*?\]', ' ', text)
 # Remove punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
 # normalize whitespace
    text = ' '.join(text.split())
 return text

def clean_text_basic(text):
 # remove whitespace before and after
    text = text.strip()
 # normalize whitespace
    text = ' '.join(text.split())
 return text
 
train_clean["text"]= train_clean["text"].apply(lambda x : clean_text(x))


output = train_clean[["text", "label"]]

output.to_csv(os.path.join(OUTPUT_FOLDER, "train_clean.csv"), index=False)

1 Train your model

Upload the newly generated train_clean.csv and follow the steps listed in the official documentation. In short here're the steps to reproduce:

Create a workspace
Get started in Azure Machine Learning studio
Create and load dataset
Configure experiment run
Explore models
Deploy the best model

Important: as we are using the text classification, the featurization is done automatically. There're a lot of techniques, such as Tf-Idf, Bag Of Words, N-grams, Murmurhash etc. There're also sophisticated ones, as word2vec, GloVe or BERT so to be able to use them, you need to create a compute instance supporting GPU.

2 Deploy the model as a web service

After successfully training the model (in my case it took 24 hours or less), we simply choose the best model and deploy it using ACI.

We deploy this model, but be advised, deployment takes about 20 minutes to complete. The deployment process entails several steps including registering the model, generating resources, and configuring them for the web service.

Select your best model and open the model-specific page.
Select the Deploy button in the top-left.
Populate the Deploy a model pane
Select Deploy. A green success message appears at the top of the Run screen, and in the Model summary pane, a status message appears under Deploy status. Select Refresh periodically to check the deployment status.

Now you have an operational web service to generate predictions.

3 Consume

After our model is published we can now apply it on our validation set and see what it does. Here's what we've got:

On the left you see the estimated result provided by Azure ML (result of cross validation with 5 folds) and on the right - the result of applying the trained model on another dataset.

Even if the result may seem relatively poor, it's very interesting as it shows how is it important to have a different dataset for validating the Fake News classifier. My guess, is that in our training set there was some metadata that contained the information that confused the model, for instance a newspaper name, that was in all the true news, which may have caused the wrong association. However, Azure ML is an excellent tool to analyze such kind of data and discover insights from the data, so I will keep investigating and will keep you updated, dear reader.

Hope it was interesting!