STEP BY STEP GUIDE FOR FAKE NEWS DETECTION USING DEEP LEARNING LSTM
This is the second part of my previous post Fake news detection using Machine Learning and NLP. In this post, I will discuss the application of deep learning technique i.e., LSTM for the detection of fake news from news headlines text. In the end, I will compare the results of machine learning techniques discussed in the previous post with LSTM and discuss which approach is better for this problem statement.
About Problem Statement
In recent years, due to the booming development of online social networks, fake news spreading at an alarming rate for various commercial and political purposes which is a matter of concern as it has numerous psychological effects over offline society. According to Gartner Research
By 2022, most people in mature economies will consume more false information than true information.
For enterprises, this accelerated rate of fake news content in social media presents a challenging task to not only monitor closely what is being said about their brands directly but also in what contexts, because fake news will directly affect their brand value. So, in this post, we will discuss how we can detect fake news more accurately by only seeing their headlines.
The data used in this case study is the ISOT Fake News Dataset. The dataset contains two types of articles fake and real news. This dataset was collected from real-world sources; the truthful articles were obtained by crawling articles from Reuters.com (news website). As for the fake news articles, they were collected from unreliable websites that were flagged by Politifact (a fact-checking organization in the USA) and Wikipedia. The dataset contains different types of articles on different topics, however, the majority of articles focused on political and World news topics.
The dataset consists of two csv files. The first file named
True.csv contains more than 12,600 articles from Reuters.com. The second file named
Fake.csv contains more than 12,600 articles from different fake news outlet resources. Each article contains the following information:
- article title (News Headline),
- type (REAL or FAKE)
The total records in the dataset consist of 44898 records out of which 21417 true news and 23481 fake news.
So, next, we will be discussing data pre-processing and data preparation steps required for building Long Term Short Memory (LSTM) model using Keras library in python. The detailed architecture and mathematics behind LSTM can be found here. For building the LSTM model I have used Kaggle notebook.
First we have read the dataset using pandas library in python and shuffle the dataset. The main reason for shuffling the dataset is that our data is sorted by their class/target as we have combined Fake news and True news from two different sources. So, we have to shuffle the dataset to make sure that our training/test/validation sets are representative of the overall distribution of the data. The code is shared below.
For shuffling the dataset we have used random permutation as shown in the code below.
The rows in the dataset are now shuffled as shown in the figure above. Next we will remove one extra column Unnamed:0 and convert target variable label into binary variable 0 for True news and 1 for Fake news.
In the next step, we will remove the stopwords and special characters from the dataset. The code for this task is shared below:
Next step is to lowercase the data, although it is commonly overlooked, it is one of the most effective technique when the data is small. Although the word ‘Good’, ‘good’ and ‘GOOD’ are same but the neural net model will assign different weights to it resulting in abrupt output which will affect the overall performance of the model. The code for lowercasing is shown below:
Building Word Embeddings (GLOVE)
In this case study, we will use Global vectors for Word representation (GLOVE) word embeddings. It is an unsupervised learning algorithm for obtaining vector representations for words. The main idea behind word embedding is that it helps in establishing a contextual representation of word in a document, semantic and syntactic similarity and relationship among different words, etc. The word embedding is proved to work well with LSTM in the text classification tasks. The visual representation of word embeddings is shown below:
So, now we will set the file path of the Glove embeddings file and will do configuration settings as shown in the code snippet below.
As we can see from above code, we have used maximum sequence length of 100, maximum vocab size of 20000 (we can experiment to increase it to 25000 or more for better accuracy but higher training time), number of dimensions in this embedding is 50 i.e., each word has 50 dimensions in vector space again we can experiment with 100 dimensions to check for accuracy improvement, validation split of 0.2 will be used means 20% of the training data used for validating the model during the training phase. The batch size used here is 64, we can also experiment with different batch sizes. For the demo purpose, I have used only 5 epochs to train the LSTM model, we can also increase to higher epochs for better results.
Next, we will load the pre-trained word vectors from embedding file.
Next, we will convert sentences (texts) into integers as we know any machine learning or deep learning model doesn’t understand textual data so we have to convert it into number representation.
Now, lets understand the code snippet shown above. First we have tokenize the title text using tokenizer with number of words equivalent to max vocab size we have set earlier i.e., 20000. Now, the question comes what is the use of tokenizer.fit_on_texts and then tokenizer.text_to_sequences.
So, the tokenizer.fit_on_texts used to create vocabulary index based on its frequency for example it creates the vocabulary index based on word frequency. For example, if you had the sentences “My skill is different from other student”, “I am a good student”,then word_index[“skill”] = 0, word_index[“student”] = 1 (student appears 2 times, skill appears 1 time), while tokenizer.text_to_sequences basically assign each text in sentence into a sequence of integers. So what it does is that it takes each word in the sentence and replaces it with its corresponding integer value from
Next we have padded the sequence to the maxlength of 100 which we have declared earlier. This is done to ensure that all the sequences are of same length as it is needed in case of building neural network. So sequences which are shorter than max length of 100 are padded with zeroes while longer sequences are truncated to max length of 100. Now we have assigned this sequence padded matrix as our feature vector X and our target variable y is label i.e., df[‘label’]. After printing the shape of tensor we get matrix shape as 44898 x 100.
Then we have saved the word to id (integers) mapping obtained from tokenizer into a new variable named word2idx as shown below.
Preparation of Embedding Matrix
After printing the length we found total 29101 unique tokens. Next task is to create embedding matrix. The code for preparing embedding matrix is shared below.
As per the code num_words are the minimum of Max_VOCAB_SIZE and length of word2idx+1. We know MAX_VOCAB_SIZE = 20000 and length of word2idx = 29101. So the number of words is the minimum of these two i.e., 20000. Next embedding matrix is created with the dimension of 50 and 20000 words. The words which will not be found in the matrix will be assigned as zeroes.
Creation of Embedding Layer
Next we will create embedding layer which will be used as input in LSTM model.
In this step we will build the deep learning model Long Term Short Memory (LSTM). For this case study we will be using bi-directional LSTM.
As we can see from above code, we have used one hidden layer of Bidirectional LSTM layer of 15 neurons. For accessing the hidden state output for each input time step we have to set return_sequences=”True”. This is also a hyper-parameter which we can experiment. Further, we can also experiment with other variants of LSTMs like unidirectional LSTM and GRU. The number of neurons can also be increased to check for performance improvements. The detailed model summary is shown below.
Here, the main thing to notice is that total number of model parameters are 1,007,951 whereas training parameters are only 7951 which is the sum of parameters of bidirectional LSTM i.e., 7920 and dense layer’s parameter i.e., 31. The reason for this is that we have already set trainable = false in case of embedding layer due to which 1000000 parameters of embedding layers left non-trainable.
Train test split
Now we split the feature and target variable into train and test set in the ratio of 75:25, where 75% of the data used for training the model and 25% used for testing.
In this step we actually fit the model on training set with batch size =64 and validation split of 20%.
As we can from above result, the model achieved validation accuracy of 95% in just 5 epochs. Obviously we can experiment with higher number of epochs but for demo purpose I have selected only 5 epochs.
The training and validation loss and accuracy plots are shown below to show the progress of model training.
The training and test accuracy of the model is shown below.
The confusion matrix of the model is shown below.
As we can see from above confusion matrix, the model has shown impressive performance. Now, lets see the classification report to understand the overall performance in statistical point of view.
As per the result, our model has higher recall and F1-score for fake news whereas precision is higher for True news. The better estimate to judge any machine learning or deep learning model performance is F1-score as it is the harmonic mean of precision and recall. A higher F1-score of 95% is fairly good score but we can improve the model performance by tuning its hyper-parameters.
Apart from that there is one more performance metric which is widely used in binary classification tasks is Receiver Operating Characterstics curve also known as ROC curve. The ROC curve of the model is plotted as shown below.
As per the ROC plot above it is evident that model performs fairly well with a higher Area Under Curve (AUC) of 0.991. AUC of 1 means ideal model.
So, as we have seen our model performance in detail, now lets compare its performance with other machine learning models employed in our previous post. The detailed comparative analysis in terms of F1-score, precision, recall and accuracy is shown below.
From the above result, we can clearly observe that Passive Aggressive is the winner in this problem. LSTM with GLOVE embedding is second best.
The full code demonstrated in this post is available in this repo.
In this post, we have discussed application of deep learning model LSTM for detection of fake news from new headlines and it shows overall good result. But as per the results comparison we have found that Passive Agressive classifier shows best performance than its counterparts. Apart from that we can also try other deep learning based approaches like 1D CNN, RNN, Unidirectional CNN and GRUs for this problem statement. In the next post we will discuss application of one more trending deep learning model i.e., BERT (Bidirectional Encoder Representations from Transformers) for this problem statement and then make detailed comparison to select the best model.
I hope guys you like this post. If like please hit on like button to show your support.