PostHeaderIcon Common way to make a data quality report in python

Data quality is a fundamental issue for business intelligence. The reliability of your analysis and, by extension, the decisions you make based on that analysis, depend on the quality of data you use.

A data quality report provides objective measures of the quality of your data making it a critical first step of the business intelligence process.

For creating an report, we are going to create a number of DataFrames from our dataset, and then merge them together at the end. The parts of our report will include the following:

1. Available columns
2. For each column:
* Data type
* Count of missing values
* Count of present values
* Number of unique values
* Minimum value
* Maximum value

I have to say that the government of China’s open data is so limit, I have to crawl some.
let’s first see the all code about our theme:

import pandas as pd
from pymongo import  MongoClient

client = MongoClient('localhost', 27017)

db = client.smallbusiness
collection = db.company

data = collection.find()

company = pd.DataFrame(list(data))

columns = pd.DataFrame(list(company.columns.values))
# one
data_types = pd.DataFrame(company.dtypes, columns=["Data Type"])

# two
missing_data_counts = pd.DataFrame(company.isnull().sum(),
                        columns=['Missing Values'])

# three
present_data_counts = pd.DataFrame(company.count(), columns=['Present Values'])
# print present_data_counts
# four
unique_value_counts = pd.DataFrame(columns=['Unique Values'])
for v in list(company.columns.values):
    unique_value_counts.loc[v]=[company[v].nunique()]
# print unique_value_counts
# five
min_values = pd.DataFrame(columns=['Minimum Value'])
for v in list(company.columns.values):
    min_values.loc[v] = [company[v].min()]

# print min_values
# six
max_values = pd.DataFrame(columns=['Maximum Value'])
for v in list(company.columns.values):
    max_values.loc[v] = [company[v].max()]

data_quality_report = data_types.join(present_data_counts).join(missing_data_counts).join(unique_value_counts).join(min_values).join(max_values)
print "Data Quality Report"
print "Total records: {}".format(len(company.index))
print data_quality_report

The result of run it:

steps of it:
The first thing we do is import the Python libraries that we’ll need to create the report. we create a DataFrame from mongodb, we’re ready to create each of the DataFrames that will comprise the report:

# Create a dataframe of the columns in the company dataframe
columns = pd.DataFrame(list(company.columns.values))

We first create a DataFrame of columns, we need to iterate over the list of column values in the company DataFrame. We use list() to create a list of the column names, and use the results to create the columns DataFrame:

# Create a dataframe of the data type of each column
data_types = pd.DataFrame(company.dtypes, columns=['Data Type'])

We next create a DataFrame of the data types for each column. To do this, we use the dtypes function to get the list of data types in the DataFrame, and use the columns=['Data Type'] to specify the name of the column for our new DataFrame:

# Create a dataframe with the count of missing values in each column
missing_data_counts = pd.DataFrame(company.isnull().sum(),
                                   columns=['Missing Values'])

To create the DataFrame of missing data counts, we chain together two functions provided by a Pandas DataFrame, isnull() and sum(). The count of the cells with missing data is returned for each column in the DataFrame. We then put that count into the ‘Missing Values’ column of the new DataFrame:

# Create a dataframe with the count of present values in each column
present_data_counts = pd.DataFrame(company.count(),
                                   columns=['Present Values'])

Next, we create a DataFrame with the per-column count of cells that contain a value. To do this, we simply call count() on the DataFrame, which returns a count of non-null columns by default:

# Create a dataframe with the count of unique values in each column
unique_value_counts = pd.DataFrame(columns=['Unique Values'])
for v in list(company.columns.values):
    unique_value_counts.loc[v] = [company[v].nunique()]

The unique value count DataFrame is a bit more complicated to create. We first create an empty DataFrame with a single column: ‘Unique Values’. We then create a Python list from the values in each column of the DataFrame and loop through it. In our loop, we do the following:

Use .nunique() to get a count of the unique values in the given column
Use the .loc function of the DataFrame to look up the value in the row, which in this case would be the label value, such as ‘Company Index’
Assign the unique value count to the row in our DataFrame, based on the label value:

# Create a dataframe with the minimum value in each column
min_values = pd.DataFrame(columns=['Minimum Value'])
for v in list(company.columns.values):
    min_values.loc[v] = [company[v].min()]

Similar to what we did when creating the unique_value_counts DataFrame, to create the minimum_values DataFrame, we loop through a list of column values in the company DataFrame, get the minimum value for each column using the min() function, and insert it into our minimum_values DataFrame beside the appropriate column name:

# Create a dataframe with the minimum value in each column
max_values = pd.DataFrame(columns=['Maximum Value'])
for v in list(company.columns.values):
    max_values.loc[v] = [company[v].max()]

We create the max_values DataFrame just as we did with min_values, only for this one, we use the max() function to get the maximum value in each column:

# Merge all the dataframes together by the index
data_quality_report = data_types.join(present_data_counts).join(missing_data_counts).join(unique_value_counts).join(min_values).join(max_values)

With all of our DataFrames created, we merge them all together starting with the data_types DataFrame. Pandas allows us to chain together statements, so rather than having to write multiple lines of code, we can use a join() statement for each of our DataFrames.

Something to note here is that since each of our DataFrames has exactly the same index values, which in this example are the column names of the company DataFrame, we don’t need to provide any arguments to the join statements; we simply call them, passing the DataFrame as the only argument.

This joining by index is very similar to the way a join statement works in a relational database. If your DataFrames have the same index column, you can simply join them together as we’ve done here. If, however, you have a primary/foreign key type of a relationship, you can use those keys to perform an SQL-like join.

# Print out a nice report
print "Data Quality Report"
print "Total records: {}".format(len(company.index))
print data_quality_report

At last, we print out the report and discover the quality of our data. For good measure, we also print out the total record count. This number provides the context for all the counts in our report.

Another thing to note is that we’ve included the minimum and maximum values of our object type columns. The only reason they are included is because they are a part of the report we are creating. When presenting this information to others, you can “tell them to disregard the minimum and maximum values of those columns. However, we want to keep all the columns there to report on the other metrics.

1908 views

Leave a Reply

Your email address will not be published. Required fields are marked *

*


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Copyright © 2010 - C++ Technology. All Rights Reserved.

Powered by Jerry | Free Space Provided by connove.com