Pandas data frame is a data object type that stores tabular data. It acts like a spreadsheet in Microsoft Excel, that each row represents a sample, with columns representing different information for the sample. Data frame is widely used in reading and storing labeled data, because there are two index along the rows and columns store label information. Furthermore, different columns of a data frame can have different variable types.
To generate a data frame, it is naturally easiest to using Pandas Data.Frame() function, by inputting a dictionary in which keys become column labels and values become column values of the data frame.
#Import Pandas and Numpy module
import pandas as pd
import numpy as np
#create a dictionary
dict1 = {'name' : ['wilson', 'shirley', 'mico', 'mia', 'miaomiao'],
'age' : [32, 31, 8, 3, 13],
'gender' : ['male', 'female', 'male', 'male', 'male']}
#create a data frame by inputting dictionary
df1 = pd.DataFrame(dict1)
df1
#output
name age gender
0 wilson 32 male
1 shirley 31 female
2 mico 8 male
3 mia 3 male
4 miaomiao 13 male
Not necessarily that all the columns of a dictionary are inputted to DataFrame() when creating a data frame. You can select wanted key-value pairs instead.
#select key-value pairs from a dictionary, to create a data frame
df2 = pd.DataFrame(dict1, columns=['name', 'age'])
df2
#output
name age
0 wilson 32
1 shirley 31
2 mico 8
3 mia 3
4 miaomiao 13
In the previous example, we have seen Pandas automatically adds index labels for rows when creating a data frame. But you can manually define row labels by using ‘index’ option in DataFrame() function.
#manually define row label index, by setting option 'index'
df3 = pd.DataFrame(dict1, index=['p1', 'p2', 'p3', 'p4', 'p5'])
df3
#output
name age gender
p1 wilson 32 male
p2 shirley 31 female
p3 mico 8 male
p4 mia 3 male
p5 miaomiao 13 male
In many cases, a data frame will be generated by inputting a Numpy array. Usually we can set both ‘index’ and ‘columns’ options if necessary.
#generating a data frame by inputting a Numpy array
#setting index and columns labels manually
df4 = pd.DataFrame(np.arange(12).reshape((4,3)),
index=['p1', 'p2', 'p3', 'p4'],
columns=['Person', 'Age', 'Sex'])
df4
#output
Person Age Sex
p1 0 1 2
p2 3 4 5
p3 6 7 8
p4 9 10 11
0 Comments