Subseting datasets in R

We provide effective and economically affordable training courses for R and Python, click here for more details and course registration !

When a data frame is created in R, usually parts of it will be used later. Subset of the dataset, either by row or by column, or both can be easily created in R. The following example, only data containing columns for ‘Math’, ‘Physics’, and ‘Chemistry’ are selected and assigned to a new object, either by using column indices or column names.

#show the full dataset
head(grade)
#output
  StudentID    First    Last Gender Country Age Math Physics
1         1    James   Zhang   Male      US  23   73      70
2         2   Wilson      Li   Male      UK  26   95     999
3         3  Richard Nuan Ye   Male      UK  35   77      83
4         4     Mary    Deng Female      US  21   60      99
5         5    Jason  Wilson   Male      UK  19   77      89
6         6 Jennifer  Hopkin Female      UK  43   79      64
  Chemistry     Date
1        87 10/31/08
2        83 03/16/08
3        92 05/22/08
4        84 01/24/09
5        93 07/30/09
6        83 04/05/09
#select only three columns, using column indices
test<-grade[,c(7:9)]
> head(test)
#output
  Math Physics Chemistry
1   73      70        87
2   95     999        83
3   77      83        92
4   60      99        84
5   77      89        93
6   79      64        83
#alternative way using variables names
test<-grade[,c("Math","Physics","Chemistry")]
> head(test)
#output
  Math Physics Chemistry
1   73      70        87
2   95     999        83
3   77      83        92
4   60      99        84
5   77      89        93
6   79      64        83
>

If a single column is to be selected, column index can be passed in the square bracket without comma behind.

#Physics column subset, return as a vector
grade[8] 
   Physics
1       70
2      999
3       83
4       99
5       89
6       64
7       99
8       87
9       92
10      93
11      90
12      63
13      76
14      78
15      69
16      83
17      79
18      73
19      72
20      73

Subset of the data frame can also be returned by excluding unwanted columns with ‘!’ or ‘-‘ symbol.

#dropping Math and Chemistry, using '!' symbol
> dvars <- names(grade) %in% c("Math", "Chemistry")
> test <- grade[!dvars]
> head(test)
#output
  StudentID    First    Last Gender Country Age Physics     Date
1         1    James   Zhang   Male      US  23      70 10/31/08
2         2   Wilson      Li   Male      UK  26     999 03/16/08
3         3  Richard Nuan Ye   Male      UK  35      83 05/22/08
4         4     Mary    Deng Female      US  21      99 01/24/09
5         5    Jason  Wilson   Male      UK  19      89 07/30/09
6         6 Jennifer  Hopkin Female      UK  43      64 04/05/09
> 
#Alternatively you can achieve this with '-' symbol
> test <- grade[c(-7,-9)]
> head(test)
#output
  StudentID    First    Last Gender Country Age Physics     Date
1         1    James   Zhang   Male      US  23      70 10/31/08
2         2   Wilson      Li   Male      UK  26     999 03/16/08
3         3  Richard Nuan Ye   Male      UK  35      83 05/22/08
4         4     Mary    Deng Female      US  21      99 01/24/09
5         5    Jason  Wilson   Male      UK  19      89 07/30/09
6         6 Jennifer  Hopkin Female      UK  43      64 04/05/09

If subset of data will be selected in terms of rows or observations, you can use row indices directly, or use conditional test of rows. Following example show these operations.

#Selecting observations 3-6 using indices directly
> test<-grade[3:6,]  
> test
#output
  StudentID    First    Last Gender Country Age Math Physics
3         3  Richard Nuan Ye   Male      UK  35   77      83
4         4     Mary    Deng Female      US  21   60      99
5         5    Jason  Wilson   Male      UK  19   77      89
6         6 Jennifer  Hopkin Female      UK  43   79      64
  Chemistry     Date
3        92 05/22/08
4        84 01/24/09
5        93 07/30/09
6        83 04/05/09
#to select Male students and younger than 25 years old.
> test <- grade[grade$Gender=="Male" & grade$Age < 25,]
> test
#output
   StudentID First   Last Gender Country Age Math Physics Chemistry
1          1 James  Zhang   Male      US  23   73      70        87
5          5 Jason Wilson   Male      UK  19   77      89        93
15        15  Phil    Yao   Male      UK  21   69      69        83
       Date
1  10/31/08
5  07/30/09
15 10/15/08
#to select grades tested from October 2008 to March 2009
> grade$Date <- as.Date(grade$Date, "%m/%d/%y")
> startdate <- as.Date("2008-10-01")
> enddate <- as.Date("2009-02-28")
> test <- grade[which(grade$Date >= startdate & grade$Date <= enddate),]
> test
#output
   StudentID    First    Last Gender Country Age Math Physics
1          1    James   Zhang   Male      US  23   73      70
4          4     Mary    Deng Female      US  21   60      99
7          7     Kari Gjendem Female      US  37   87      99
8          8   Wenche    Dale Female      US  28   95      87
11        11  Michael    Chen   Male      UK  42   83      90
13        13 Jennifer   Jones   Male      US  27   79      76
14        14     Gary   Grant Female      UK  35   90      78
15        15     Phil     Yao   Male      UK  21   69      69
   Chemistry       Date
1         87 2008-10-31
4         84 2009-01-24
7         67 2008-11-24
8         93 2008-10-02
11        77 2008-10-24
13        82 2008-10-29
14        92 2008-10-24
15        83 2008-10-15
>

R provides also a function ‘subset’ which makes subset operations easily with one statement. The following example shows Math and Chemistry scores for male younger students are selected.

#set working directory
setwd("d:\\RStatistics-Tutorial")    # to set working directory
> vartype<-c("character", "character", "character", "character", "character", "numeric","numeric", "numeric","numeric","character")
#read file from working directory and create a data frame
> grade <- read.table("University-NA.csv", colClasses=vartype, header=TRUE, sep=",") 
#Math and Chemistry for younger male students are selected                                     
> test <- subset(grade, Gender=="Male" & Age <25, select=c(Math, Chemistry))
> test
#output
   Math Chemistry
1    73        87
5    77        93
15   69        83

For getting more knowledge of R and a preview of our training course, you can watch R tutorial videos on our YouTube channel !

wilsonzhang746

Next Using probability functions in R »

Previous « Merging and combining datasets using merge(), rbind() and cbind() in R

Subseting datasets in R

Recent Posts

Download R Course source files

Download Python Course source files

How to create a data frame from nested dictionary with Pandas in Python

How to delete columns of a data frame in Python

Using isin() to check membership of a data frame in Python

How to assign values to Pandas data frame in Python

Subseting datasets in R

Related Post

Recent Posts

Download R Course source files

Download Python Course source files

How to create a data frame from nested dictionary with Pandas in Python

How to delete columns of a data frame in Python

Using isin() to check membership of a data frame in Python

How to assign values to Pandas data frame in Python