PostHeaderIcon Importing a CSV file into a Pandas DataFrame

Pandas is an open-source, high-performance library that provides easy-to-use data structures and data analysis tools for Python. Pandas was created to aid in the analysis of time series data, and has become a standard in the Python community. Not only does it provide data structures, such as a Series and a DataFrame, that help with all aspects of data science, it also has built-in analysis methods which we’ll use later.

Before we can start cleaning and standardize data using Pandas, we need to get the data into a Pandas DataFrame, the primary data structure of Pandas. You can think of a DataFrame like an Excel document—it has rows and columns. Once data is in a DataFrame, we can use the full power of Pandas to manipulate and query it.

Pandas provides a highly configurable function—read_csv()—that we’ll use to import our data. On a modern laptop with 4+ GB of RAM, we can easily and quickly import the entire accidents dataset, more than 7 million rows.

The following code shows how to import a CSV file into a Pandas DataFrame:

import pandas as pd
import numpy as np

data_file='~/Downloads/data/company.csv'
raw_data = pd.DataFrame.from_csv(data_file,
                                header=0,
                                sep=',',
                                index_col=0,
                                encoding=None,
                                tupleize_cols=False)
print (raw_data.head())

let’s first see the result of running it:

Tips of it:

In order to use Pandas, we need import it along with the numpy library:
// import pandas as pd
// import numpy as np
Next we set the path to our data file. In this case, I've used a relative path. I suggest using the full path to the file in production applications:
// data_file = '~/Downloads/data/company.csv'
After that, we use the read_csv() method to import the data. We've passed a number of arguments to the function:
>>header: The row number to use as the column names
>>sep: Tells Pandas how the data is separated
>>index_col: The column to use as the row labels of the DataFrame
>>encoding: The encoding to use for UTF when reading/writing
>>tupleize_cols: To leave the list of tuples on columns as it is
Finally, we print out the top five rows of the DataFrame using the head() method.
print(raw_data.head())
1922 views

Leave a Reply

Your email address will not be published. Required fields are marked *

*


*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>


Copyright © 2010 - C++ Technology. All Rights Reserved.

Powered by Jerry | Free Space Provided by connove.com