Automate the process of Data profiling with few lines of code
Data profiling is very essential step in data science problem. It helps in examining different variables in the dataset and give overall statistical summaries of the data. But it is a very time consuming process as it involves exploratory data analysis (EDA) of the overall dataset which generally takes about 30% of the time in any data science problem. The reason for this is that in data profiling we generally perform not only univariate analysis but also multivariate analysis to understand the hidden pattern in the data.
In this post, we will discuss three python based libraries which automate the whole data profiling process and saves lot of time of data scientists. These libraries helps in creating detailed data profiling reports, visualize the dataset and comparing two different datasets. The top three python based data profiling libraries are as follows:
- Pandas Profiling
- Sweet Viz
- Auto Viz
1. Pandas Profiling
Pandas Profiling is a python based library which helps in creating automated EDA reports in just two lines of python code. For demonstration purpose, we have used chronic kidney disease dataset available in UCI ML repo available here. This dataset consists of 25 columns and 400 rows. For analyzing each and every variable of this dataset will take lots of time. Lets see how we can use pandas profiling library to perform automated data profiling.
For installing pandas profiling library you just need to run pip install pandas-profiling command on command line. After installing the library you can simply import it using following commands.
Next, we have to read the dataset, which we need to analyze.
In the next step, we have to run following command to generated automated data profiling of the dataset in the jupyter notebook itself.
If you want full data profiling report in html then you can use below commands for the same.
After running above command it will take hardly 1 to 2 minutes to generate profiling report. The generated report has 7 important sections to look out for.
1. Overview of the dataset
This is the first section of the report. In this section you will get the overall information of the dataset in terms of number of observations, percentage of missing records, number of columns with different data types etc.
2. Data Quality section
The data quality section is available under warnings section. This section gives the point wise data quality issues in the dataset. As you can see in the below image that it has find out two columns albumin and sugar which have more than 50% and 80% zero values in them repectively.
3. Variables distribution
In this section, you will get the distribution of each of the column in the dataset.
Apart from that if you click on toggle details button you will get more details of the column in terms of statistics, histogram, common values and extreme values.
4. Correlations among variables
In this section you will get correlation matrix plotted among all the numeric variables in the dataset. It by default shows Pearson correlation but you can view other types of correlations as well such as spearman’s, kendalls, phik and cramers V.
5. Missing Values
In this section you will get the missing values matrix for all the columns in the dataset.
6. Variable Interaction
In this section we can do multivariate analysis of different features of the dataset by selecting the features as below.
7. Sample Rows
In this section you will see top 10 and last 10 rows of the dataset.
2. Sweet Viz
Sweetviz is a python library which helps in automating data profiling and create high-density visualizations. It also helps in comparing datasets and drawing inferences from it.
Here we will analyze the same dataset as we used for pandas profiling.
After running the above code you will be redirected to eda.html where you can view full data profiling report generated by sweetviz.
When you click on any particular feature you will get more information about its distribution, most frequent values, smallest values, largest values, numerical associations etc.
Apart from that if you want to compare all the feature distribution in terms of target variable then you have to run following command.
As you can see target variable class is highlighted with black color and all features distribution is given in terms of target variable class.
Autoviz library in python helps in plotting the visualization of all the features automatically. For using the autoviz you have to use following command.
First it will show the overall statistics and information of the dataset.
Following type of visualizations will be plotted after running above command.
As we know that such for plotting such type of visualizations involve python code but this one liner code automate these visualization and saves lot of time in plotting and writing python code.
Thanks for reading this post. I hope you like this post. If yes, then please like this post and share to more number of data science professionals.
Tell me which python library you love to use in data profiling and data visualization. 🙂