• Alibek Jakupov

Azure Machine Learning: Fake News detection

Updated: Feb 6



Context


Opinions such as online reviews are the main source of information for e-commerce customers to help with gaining insight into the products they are planning to buy.


Fake news and misleading articles is another form of opinion spam, which has gained traction. Some of the biggest sources of spreading fake news or rumors are social media websites such as Google Plus, Facebook, Twitters, and other social media outlet


For this experiment, I've tried creating a Fake News Detector, based on the work of these guys:


Training Set

  • Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”

  • Ahmed H, Traore I, Saad S. “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques”

Dataset contains a list of articles considered as "fake“ or “true” news


Columns

  • Title : The title of the article

  • Text : The text of the article

  • Subject : The subject of the article

  • Date : The date at which the article was posted

Fake

  • 17903 unique values

True

  • 21192 unique values


Validation set

  • Kai Shu, Arizona State University

  • Fake news detection shared task for the Second International TrueFact Workshop: Making a Credible Web for Tomorrow in conjunction with SIGKDD 2020.

Columns:

  • text: text of the article

  • label: a label that marks the article as potentially unreliable

Dataset

  • 4084 unique values

True

  • 2972

Fake

  • 2014



Data Preparation


This step is quite straightforward, we just simply remove all the specific chars, stop words, punctuation, html tags, etc.


Here's the sample you may reproduce in your experiment


import string
import re
import os
import numpy as np
import pandas as pd

from io import StringIO
from html.parser import HTMLParser

# Read input data
INPUT_FOLDER = "raw_datasets"
OUTPUT_FOLDER = "clean_datasets"

raw_fake = pd.read_csv(os.path.join(INPUT_FOLDER,"Fake.csv")) 
raw_true = pd.read_csv(os.path.join(INPUT_FOLDER,"True.csv"))
"""
Rework input
leave only text field and create a label column as input is composed of 2 datasets
1 = Fake, 0 = True
"""
train_fake = pd.DataFrame()
train_true = pd.DataFrame()

train_fake["text"] = raw_fake["text"]
train_fake["label"] = 1

train_true["text"] = raw_true["text"]
train_true["label"] = 0

train_clean = pd.concat([train_true, train_fake])

#Text preprocessing
class MLStripper(HTMLParser):
 def __init__(self):
        super().__init__()
 self.reset()
 self.strict = False
 self.convert_charrefs = True
 self.text = StringIO()

 def handle_data(self, d):
 self.text.write(d)

 def get_data(self):
 return self.text.getvalue()

def strip_tags(html):
    s = MLStripper()
    s.feed(str(html))
 return s.get_data()

def remove_tabulations(text):
    text = str(text)
 return(text.replace("\r", ' ').replace("\t", ' ').replace("\n", ' '))

def clean_text(text):
 # Remove HTML tags
    text = strip_tags(text)
 # Remove tabulation
    text = remove_tabulations(text)
 # convert to lower case
    text = text.lower()
 # Remove special characters
    text = re.sub('\[.*?\]', ' ', text)
 # Remove punctuation
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
 # normalize whitespace
    text = ' '.join(text.split())
 return text

def clean_text_basic(text):
 # remove whitespace before and after
    text = text.strip()
 # normalize whitespace
    text = ' '.join(text.split())
 return text
 
train_clean["text"]= train_clean["text"].apply(lambda x : clean_text(x))


output = train_clean[["text", "label"]]

output.to_csv(os.path.join(OUTPUT_FOLDER, "train_clean.csv"), index=False)



1 Train your model


Upload the newly generated train_clean.csv and follow the steps listed in the official documentation. In short here're the steps to reproduce:

  1. Create a workspace

  2. Get started in Azure Machine Learning studio

  3. Create and load dataset

  4. Configure experiment run

  5. Explore models

  6. Deploy the best model



Important: as we are using the text classification, the featurization is done automatically. There're a lot of techniques, such as Tf-Idf, Bag Of Words, N-grams, Murmurhash etc. There're also sophisticated ones, as word2vec, GloVe or BERT so to be able to use them, you need to create a compute instance supporting GPU.



2 Deploy the model as a web service


After successfully training the model (in my case it took 24 hours or less), we simply choose the best model and deploy it using ACI.


We deploy this model, but be advised, deployment takes about 20 minutes to complete. The deployment process entails several steps including registering the model, generating resources, and configuring them for the web service.

  1. Select your best model and open the model-specific page.

  2. Select the Deploy button in the top-left.

  3. Populate the Deploy a model pane

  4. Select Deploy. A green success message appears at the top of the Run screen, and in the Model summary pane, a status message appears under Deploy status. Select Refresh periodically to check the deployment status.

Now you have an operational web service to generate predictions.


3 Consume


After our model is published we can now apply it on our validation set and see what it does. Here's what we've got:


On the left you see the estimated result provided by Azure ML (result of cross validation with 5 folds) and on the right - the result of applying the trained model on another dataset.


Even if the result may seem relatively poor, it's very interesting as it shows how is it important to have a different dataset for validating the Fake News classifier. My guess, is that in our training set there was some metadata that contained the information that confused the model, for instance a newspaper name, that was in all the true news, which may have caused the wrong association. However, Azure ML is an excellent tool to analyze such kind of data and discover insights from the data, so I will keep investigating and will keep you updated, dear reader.


Hope it was interesting!

129 views0 comments