Context
Opinions such as online reviews are the main source of information for e-commerce customers to help with gaining insight into the products they are planning to buy.
Fake news and misleading articles is another form of opinion spam, which has gained traction. Some of the biggest sources of spreading fake news or rumors are social media websites such as Google Plus, Facebook, Twitters, and other social media outlet
For this experiment, I've tried creating a Fake News Detector, based on the work of these guys:
Training Set
Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”
Ahmed H, Traore I, Saad S. “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques”
Dataset contains a list of articles considered as "fake“ or “true” news
Columns
Title : The title of the article
Text : The text of the article
Subject : The subject of the article
Date : The date at which the article was posted
Fake
17903 unique values
True
21192 unique values
Validation set
Kai Shu, Arizona State University
Fake news detection shared task for the Second International TrueFact Workshop: Making a Credible Web for Tomorrow in conjunction with SIGKDD 2020.
Columns:
text: text of the article
label: a label that marks the article as potentially unreliable
Dataset
4084 unique values
True
2972
Fake
2014
Data Preparation
This step is quite straightforward, we just simply remove all the specific chars, stop words, punctuation, html tags, etc.
Here's the sample you may reproduce in your experiment
import string
import re
import os
import numpy as np
import pandas as pd
from io import StringIO
from html.parser import HTMLParser
# Read input data
INPUT_FOLDER = "raw_datasets"
OUTPUT_FOLDER = "clean_datasets"
raw_fake = pd.read_csv(os.path.join(INPUT_FOLDER,"Fake.csv"))
raw_true = pd.read_csv(os.path.join(INPUT_FOLDER,"True.csv"))
"""
Rework input
leave only text field and create a label column as input is composed of 2 datasets
1 = Fake, 0 = True
"""
train_fake = pd.DataFrame()
train_true = pd.DataFrame()
train_fake["text"] = raw_fake["text"]
train_fake["label"] = 1
train_true["text"] = raw_true["text"]
train_true["label"] = 0
train_clean = pd.concat([train_true, train_fake])
#Text preprocessing
class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs = True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()
def strip_tags(html):
s = MLStripper()
s.feed(str(html))
return s.get_data()
def remove_tabulations(text):
text = str(text)
return(text.replace("\r", ' ').replace("\t", ' ').replace("\n", ' '))
def clean_text(text):
# Remove HTML tags
text = strip_tags(text)
# Remove tabulation
text = remove_tabulations(text)
# convert to lower case
text = text.lower()
# Remove special characters
text = re.sub('\[.*?\]', ' ', text)
# Remove punctuation
text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
# normalize whitespace
text = ' '.join(text.split())
return text
def clean_text_basic(text):
# remove whitespace before and after
text = text.strip()
# normalize whitespace
text = ' '.join(text.split())
return text
train_clean["text"]= train_clean["text"].apply(lambda x : clean_text(x))
output = train_clean[["text", "label"]]
output.to_csv(os.path.join(OUTPUT_FOLDER, "train_clean.csv"), index=False)
1 Train your model
Upload the newly generated train_clean.csv and follow the steps listed in the official documentation. In short here're the steps to reproduce:
Create a workspace
Get started in Azure Machine Learning studio
Create and load dataset
Configure experiment run
Explore models
Deploy the best model
Important: as we are using the text classification, the featurization is done automatically. There're a lot of techniques, such as Tf-Idf, Bag Of Words, N-grams, Murmurhash etc. There're also sophisticated ones, as word2vec, GloVe or BERT so to be able to use them, you need to create a compute instance supporting GPU.
2 Deploy the model as a web service
After successfully training the model (in my case it took 24 hours or less), we simply choose the best model and deploy it using ACI.
We deploy this model, but be advised, deployment takes about 20 minutes to complete. The deployment process entails several steps including registering the model, generating resources, and configuring them for the web service.
Select your best model and open the model-specific page.
Select the Deploy button in the top-left.
Populate the Deploy a model pane
Select Deploy. A green success message appears at the top of the Run screen, and in the Model summary pane, a status message appears under Deploy status. Select Refresh periodically to check the deployment status.
Now you have an operational web service to generate predictions.
3 Consume
After our model is published we can now apply it on our validation set and see what it does. Here's what we've got:
On the left you see the estimated result provided by Azure ML (result of cross validation with 5 folds) and on the right - the result of applying the trained model on another dataset.
Even if the result may seem relatively poor, it's very interesting as it shows how is it important to have a different dataset for validating the Fake News classifier. My guess, is that in our training set there was some metadata that contained the information that confused the model, for instance a newspaper name, that was in all the true news, which may have caused the wrong association. However, Azure ML is an excellent tool to analyze such kind of data and discover insights from the data, so I will keep investigating and will keep you updated, dear reader.
Hope it was interesting!
Comentarios