When a data frame is created in R, usually parts of it will be used later. Subset of the dataset, either by row or by column, or both can be easily created in R. The following example, only data containing columns for ‘Math’, ‘Physics’, and ‘Chemistry’ are selected and assigned to a new object, either by using column indices or column names.
#show the full dataset
head(grade)
#output
StudentID First Last Gender Country Age Math Physics
1 1 James Zhang Male US 23 73 70
2 2 Wilson Li Male UK 26 95 999
3 3 Richard Nuan Ye Male UK 35 77 83
4 4 Mary Deng Female US 21 60 99
5 5 Jason Wilson Male UK 19 77 89
6 6 Jennifer Hopkin Female UK 43 79 64
Chemistry Date
1 87 10/31/08
2 83 03/16/08
3 92 05/22/08
4 84 01/24/09
5 93 07/30/09
6 83 04/05/09
#select only three columns, using column indices
test<-grade[,c(7:9)]
> head(test)
#output
Math Physics Chemistry
1 73 70 87
2 95 999 83
3 77 83 92
4 60 99 84
5 77 89 93
6 79 64 83
#alternative way using variables names
test<-grade[,c("Math","Physics","Chemistry")]
> head(test)
#output
Math Physics Chemistry
1 73 70 87
2 95 999 83
3 77 83 92
4 60 99 84
5 77 89 93
6 79 64 83
>
If a single column is to be selected, column index can be passed in the square bracket without comma behind.
#Physics column subset, return as a vector
grade[8]
Physics
1 70
2 999
3 83
4 99
5 89
6 64
7 99
8 87
9 92
10 93
11 90
12 63
13 76
14 78
15 69
16 83
17 79
18 73
19 72
20 73
Subset of the data frame can also be returned by excluding unwanted columns with ‘!’ or ‘-‘ symbol.
#dropping Math and Chemistry, using '!' symbol
> dvars <- names(grade) %in% c("Math", "Chemistry")
> test <- grade[!dvars]
> head(test)
#output
StudentID First Last Gender Country Age Physics Date
1 1 James Zhang Male US 23 70 10/31/08
2 2 Wilson Li Male UK 26 999 03/16/08
3 3 Richard Nuan Ye Male UK 35 83 05/22/08
4 4 Mary Deng Female US 21 99 01/24/09
5 5 Jason Wilson Male UK 19 89 07/30/09
6 6 Jennifer Hopkin Female UK 43 64 04/05/09
>
#Alternatively you can achieve this with '-' symbol
> test <- grade[c(-7,-9)]
> head(test)
#output
StudentID First Last Gender Country Age Physics Date
1 1 James Zhang Male US 23 70 10/31/08
2 2 Wilson Li Male UK 26 999 03/16/08
3 3 Richard Nuan Ye Male UK 35 83 05/22/08
4 4 Mary Deng Female US 21 99 01/24/09
5 5 Jason Wilson Male UK 19 89 07/30/09
6 6 Jennifer Hopkin Female UK 43 64 04/05/09
If subset of data will be selected in terms of rows or observations, you can use row indices directly, or use conditional test of rows. Following example show these operations.
#Selecting observations 3-6 using indices directly
> test<-grade[3:6,]
> test
#output
StudentID First Last Gender Country Age Math Physics
3 3 Richard Nuan Ye Male UK 35 77 83
4 4 Mary Deng Female US 21 60 99
5 5 Jason Wilson Male UK 19 77 89
6 6 Jennifer Hopkin Female UK 43 79 64
Chemistry Date
3 92 05/22/08
4 84 01/24/09
5 93 07/30/09
6 83 04/05/09
#to select Male students and younger than 25 years old.
> test <- grade[grade$Gender=="Male" & grade$Age < 25,]
> test
#output
StudentID First Last Gender Country Age Math Physics Chemistry
1 1 James Zhang Male US 23 73 70 87
5 5 Jason Wilson Male UK 19 77 89 93
15 15 Phil Yao Male UK 21 69 69 83
Date
1 10/31/08
5 07/30/09
15 10/15/08
#to select grades tested from October 2008 to March 2009
> grade$Date <- as.Date(grade$Date, "%m/%d/%y")
> startdate <- as.Date("2008-10-01")
> enddate <- as.Date("2009-02-28")
> test <- grade[which(grade$Date >= startdate & grade$Date <= enddate),]
> test
#output
StudentID First Last Gender Country Age Math Physics
1 1 James Zhang Male US 23 73 70
4 4 Mary Deng Female US 21 60 99
7 7 Kari Gjendem Female US 37 87 99
8 8 Wenche Dale Female US 28 95 87
11 11 Michael Chen Male UK 42 83 90
13 13 Jennifer Jones Male US 27 79 76
14 14 Gary Grant Female UK 35 90 78
15 15 Phil Yao Male UK 21 69 69
Chemistry Date
1 87 2008-10-31
4 84 2009-01-24
7 67 2008-11-24
8 93 2008-10-02
11 77 2008-10-24
13 82 2008-10-29
14 92 2008-10-24
15 83 2008-10-15
>
R provides also a function ‘subset’ which makes subset operations easily with one statement. The following example shows Math and Chemistry scores for male younger students are selected.
#set working directory
setwd("d:\\RStatistics-Tutorial") # to set working directory
> vartype<-c("character", "character", "character", "character", "character", "numeric","numeric", "numeric","numeric","character")
#read file from working directory and create a data frame
> grade <- read.table("University-NA.csv", colClasses=vartype, header=TRUE, sep=",")
#Math and Chemistry for younger male students are selected
> test <- subset(grade, Gender=="Male" & Age <25, select=c(Math, Chemistry))
> test
#output
Math Chemistry
1 73 87
5 77 93
15 69 83
Click here to download Python Course Source Files !
For online Python training registration, click here ! Pandas provides flexible ways of generating data…
For online Python training registration, click here ! Data frame is the tabular data object…
Click her for course registration ! When a data frame in Python is created via…
We provide affordable online training course(via ZOOM meeting) for Python and R programming at fundamental…