In this article, we will discuss the case study of detecting malicious URLs using lexical features with a boosted machine learning approach.
In the recent past, we have witnessed a huge increase in cybersecurity attacks such as ransomware, phishing, injecting malware, etc. on different websites all over the world.
As a result of this, various financial institutions, e-commerce companies, and individuals incurred huge losses.
In such type of scenario containing cyber security attack is a major challenge before cyber security professionals as different types of new attacks are coming day by day.
Hackers create 300,000 new pieces of malware daily.(Source: McAfee)
On average 30,000 new websites are hacked every day.(Source: Forbes)
The above quotes determine how rapidly these malicious websites spread all over the world on a daily basis.
Before moving towards the discussion of applying machine learning for detecting malicious URLs, first, try to understand what is URL? and what makes it malicious?
What is URL? and What makes it malicious ?
The Uniform Resource Locator (URL) is the well defined structured format for accessing websites over World Wide Web (WWW).
Generally, there are three basic components which make up a legitimate URL
i.) Protocol: It is basically an identifier which determines what protocol to use for e.g., http, https etc.
ii) Hostname: Also known as resource name. It contains the IP address or the domain name where the actual resource is located.
iii) Path: It specifies the actual path where the resource is located
Modified or compromised URLs employed for cyber attacks are known as malicious URLs.
A malicious URL or website generally contains different types of trojans, malware, unsolicited content in the form of phishing, drive-by-download, spams.
The main objective of the malicious website is to fraud or steal personal or financial details of unsuspecting users.
In this case study, we address the detection of malicious URLs as a multi-class classification problem. In this case study, we classify the raw URLs into different class types such as benign or safe URL, phishing URL, malware URL, or defacement URL.
As we know machine learning algorithms only support numeric inputs so we will create lexical numeric features from input URLs. So the input to machine learning algorithms will be the numeric lexical features rather than actual raw URLs. If you don’t know about lexical features you can refer to this discussion about a lexical feature in StackOverflow.
So, in this case study, we will be using three well-known boosting machine learning classifiers namely XGBoost, Light GBM, Gradient Boosting Machines.
Later, we will also compare their performance and plot average feature importance plot to understand which features are important in predicting different malicious URLs.
Objective of Case Study
So, the basic objective of this case study is not only the application of machine learning in predicting malicious URLs or websites but also to derive critical features for classifying any URL into four sub types.
For training and testing machine learning algorithms, we have used a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. Figure 2 depicts their distribution in terms of percentage.
As we know one of the most crucial tasks is to curate the dataset for a machine learning project. We have curated this dataset from five different sources.
- For collecting benign, phishing, malware and defacement URLs we have used URL dataset (ISCX-URL-2016)
- For increasing phishing and malware URLs, we have used Malware domain black list dataset.
- We have increased benign URLs using faizan git repo
- At last, we have increased more number of phishing URLs using Phishtank dataset and PhishStorm dataset
As we have told you that dataset is collected from different sources. So firstly, we have collected the URLs from different sources into a separate dataframe and finally merge them to retain only URLs and their class type. The final dataset is available as a Kaggle dataset.
Now, lets take a look of the final dataset in pandas dataframe.
Now, lets discuss different types of URLs in our dataset i.e., Benign, Malware, Phishing and Defacement URLs.
- Benign URLs: These are safe to browse URLs. Some of the examples of benign URLs are as follows:
- Malware URLs: These type of URLs inject malware into the victim system once he/she visit such URLs. Some of the examples of malware URLs are as follows:
- Defacement URLs: Defacement URLs are generally created by hackers with the intention of breaking into a web server and replace the hosted website with one of their own, using techniques such as code injection, cross-site scripting, etc. Common targets of defacement URLs are religious websites, government websites, bank websites, and corporate websites. Some of the examples of defacement URLs are as follows:
- Phishing URLs: By creating phishing URLs, hackers try to steal sensitive personal or financial information such as login credentials, credit card numbers, internet banking details, etc. Some of the examples of phishing URLs are shown as below:
Next, we will see what will plot the word cloud of different types of URLs.
The word cloud helps in understanding the pattern of words/tokens in particular target labels.
It is one of the most appealing techniques of natural language processing for understanding the pattern of word distribution.
As we can see in the below figure word cloud of benign URLs is pretty obvious having frequent tokens such as html, com, org, wiki etc. Phishing URLs have frequent tokens as tools, ietf, www, index, battle, net whereas html, org, html are higher frequency tokens as these URLs try to mimick original URLs for deceiving the users.
The word cloud of malware URLs has higher frequency tokens of exe, E7, BB, MOZI. These tokens are also obvious as malware URLs try to install trojans in the form of executable files over the users’ system once the user visits those URLs.
The defacement URLs’ intention is to modify the original website’s code and this is the reason that tokens in its word cloud are more common development terms such as index, php, itemid, https, option, etc.
So, from the above word clouds most of the tokens distribution in different type of URLs is now crystal clear.
Next, we will move towards feature engineering part in which we will create lexical features from raw URLs.
In this section, we will extract following lexical features from raw URLs, as these features will be used as the input features for training the machine learning model. Following features are created as follows:
- having_ip_address: Generally cyber attackers use an IP address in place of the domain name to hide the identity of the website. this feature will check whether the URL has IP address or not.
- abnormal_url: This feature can be extracted from the WHOIS database. For a legitimate website, identity is typically part of its URL.
- google_index: In this feature, we check whether the URL is indexed in google search console or not.
- Count. : The phishing or malware websites generally use more than two sub-domains in the URL. Each domain is separated by dot (.). If any URL contains more than three dots(.), then it increases the probability of a malicious site.
- Count-www: Generally most of the safe websites have one www in its URL. This feature helps in detecting malicious websites if the URL has no or more than one www in its URL.
- [email protected]: The presence of the “@” symbol in the URL ignores everything previous to it.
- Count_dir: The presence of multiple directories in the URL generally indicates suspicious websites.
- Count_embed_domain: The number of embedded domain can be helpful in detecting malicious URLs. It can be done by checking the occurrence of “//” in the URL.
- Suspicious words in URL: Malicious URLs generally contain suspicious words in the URL such as PayPal, login, signin, bank, account, update, bonus, service, ebayisapi, token etc. We have find the presence of such frequently occurring suspicious words in the URL as a binary variable i.e., whether such words present in URL or not.
- Short_url: This feature is created to identify whether the URL uses URL shortening services like bit. \ly, goo.gl, go2l.ink, etc.
- Count_https: Generally malicious URLs do not use https protocols as https generally require user credentials and ensure that the website is safe for transactions. So, the presence or absence of https protocol in the URL is an important feature.
- Count_http: Most of the time, phishing or malicious websites have more than one http in its URL whereas safe sites have only one http.
- Count%: As we know URLs cannot contain spaces. URL encoding normally replaces spaces with sign (%). Safe sites generally contain less number of spaces whereas malicious websites generally contain more spaces in its URL hence more number of %.
- Count?: The presence of ? in URL denote query string which contains the data to be passed to the server. More number of ? in URL definitely indicates suspicious URL.
- Count- : Phishers or cybercriminals generally add dashes(-) in prefix or suffix of brand name so that it looks genuine URL. For example. www.flipkart-india.com.
- Count=: Presence of equal to (=) in URL indicates passing of variable values from one form page t another. It is considered as riskier in URL as anyone can change the values to modify the page.
- url_length: Attackers generally use long URLs to hide the domain name. We found the average length of safe URL is 74.
- hostname_length: The length of hostname is also a important feature for detecting malicious URLs.
- First directory length: This feature helps in determining the length of the first directory in the URL. So looking for first ‘/’ and counting the length of URL up to this point helps in finding the first directory length of URL. For accessing directory level information we need to install python library tld. You can check this link for installing tld.
- Length of top-level domains: A top–level domain (TLD) is one of the domains at the highest level in the hierarchical Domain Name System of the Internet.For example, in the domain name www.example.com, the top-level domain is com. So, the length of TLD is also important in identifying malicious URLs. As most of the URLs have .com extension. TLDs in the range from 2 to 3 generally indicates safe URLs.
- Count_digits: The presence of digits in URL generally indicate suspicious URLs. Safe URLs generally do not have digits so counting the number of digits in URL is an important feature for detecting malicious URLs.
- Count_letters: The number of letters in the URL also plays a significant role in identifying malicious URLs. As attackers tries to increase the length of the URL to hide domain name and this is generally done by increasing number of letters and digits in the URL.
So, now after creating the above 20 features, the dataset looks like below.
You can create more number of features that may be relevant in identifying malicious URLs. Further you can also explore other machine learning algorithms to check improvement in overall performance.
Now, in the next step we drop the irrelevant columns i.e., URL and tld.
The reason for dropping URL column is that we have already extracted relevant features from it that can be used as input in machine learning algorithms.
The tld column is dropped because it is indirect textual column as for finding the length of top level domain we have created tld column.
After that the most important step is to label encode target variable (type) so that it can be converted into numerical categories 0,1,2 and 3. As machine learning algorithms only understand numeric target variable.
Segregating Feature and Target variables
So, in the next step we have created predictor and target variable. Here predictor variable are the independent variables i.e., features of URL and target variable is type.
Train Test Split
Next step is to split the dataset into train and test set. We have splitted the dataset into 80:20 ratio i.e., 80% of the data used to train the machine learning model and rest 20% used to test the model. As we know we have a imbalanced dataset. The reason for this is around 66% of the data has benign URLs, 5% malware, 14% phishing and 15% defacement URLs. So after randomly splitting the dataset into train and test it may happen that distribution of different categories got disturbed which will highly effect the performance of machine learning model. So to maintain the same proportion of target variable stratification is needed.
stratify parameter makes a split so that the proportion of values in the sample produced will be the same as the proportion of values provided to parameter
For knowing the basic difference between shuffle and random_state parameter you can follow this discussion on stackoverflow.
The code for building machine learning models is shared below.
After fitting the model as shown above, we have made predictions on test set. The performance of Light GBM, XGBoost, and GBDT are shown below.
From the above figure, it is evident that XGBoost shows the best performance in terms of test accuracy. XGBoost attains the highest accuracy of 96.1%, Light GBM attains 95.6% while GBDT attains 93.6%.
Further to find which feature is more critical in classifying malicious URLs, we have created average feature importance by combining the feature importances of Light GBM, XGBoost, and GBDT. Finally, after creating the dataframe of average feature importances, we have plotted top-performing features which are as shown below.
From the above figure, it can be observed that top 5 features in classifying malicious URLs are as follows:
- First directory length (fd_length)
- Length of a hostname (hostname_length)
- The number of directories in a URL (count_dir)
- The number of digits in URL (count_digits)
- The actual length of the URL (url_length)
The full code of this case study is available at this Github repo.
- Malicious URL Detection based on Machine Learning, International Journal of Advanced Computer Science and Applications, Vol. 11, No. 1, 2020
- PHISH-SAFE: URL Features-Based Phishing Detection System Using Machine Learning, Cyber Security, springer
- Malicious URL Detection using Machine Learning: A Survey
- Empirical Study on Malicious URL Detection Using Machine Learning: 15th International Conference, ICDCIT 2019
- Detecting malicious URLs using machine learning techniques, 2016 IEEE Symposium Series on Computational Intelligence (SSCI)
- Malicious URL Detection