We provide effective and economically affordable training courses for R and Python, Click here for more details and course registration !

It is very natural to generate summary statistics in data analysis. For example, people may calculate the mean values of costs and income by different years. In R programming, function aggregate() provides an easy way to calculate summary statistics of variables by specific groups in a data frame. The basic form of the function is aggregate(df, by, FUN), Where

df is a data frame, and

by is a list of variables used for grouping, and

FUN is the function applied to the grouping.

In the following code example, we first create a data frame ‘test’ with testing scores of mathematics, physics and chemistry from different students. Then we use aggregate() function to generate mean testing scores by variable ‘Gender’ and ‘Country’.

# to set working directory
setwd("d:\\RStatistics-Tutorial")   

#create a grade data frame
vartype<-c("character", "character", "character", "character", "character", "numeric","numeric", "numeric","numeric","character")

grade <- read.table("University-Fullname-full.csv", colClasses=vartype, header=TRUE, sep=",")                                      

#to create a data frame 'test'
test<-grade[,c(4,5,7:9)]
test$Gender<-as.factor(test$Gender)
test$Country<-as.factor(test$Country)

#to show first observations of data frame 'test'
head(test)
#output
 Gender Country Math Physics Chemistry
1   Male      US   73      70        87
2 Female      UK   95      76        83
3   Male      UK   77      83        92
4 Female      US   60      99        84
5   Male      UK   77      89        93
6 Female      UK   79      64        83

#aggretate data, calculate mean value of all variables
#by Gender and Country
agg <- aggregate(test, by = list(test$Gender, 
     test$Country), FUN = mean, na.rm = TRUE)

#to show result
agg
  Group.1 Group.2 Gender Country  Math  Physics Chemistry
1  Female      UK     NA      NA 88.75 74.25000  81.75000
2    Male      UK     NA      NA 76.50 82.75000  86.25000
3  Female      US     NA      NA 78.00 86.42857  82.85714
4    Male      US     NA      NA 74.20 75.00000  78.40000

We can see that the result contains mean scores of mathematics, physics, and chemistry by different groups with respect to Gender and Country. But the column labels for grouping are assigned by default as ‘Group.1’ and ‘Group.2’. aggregate() function provides the feasibility to customize the labels in the resulting data frame. In the following code example, we will implement this by setting column names in the list() option, and remove two redundant columns ‘Gender’ and ‘Country’ by using the form [-c(1,2)] after data frame.

#a better solution: remove the redundant variable 
#Gender and Country 
#and customize columns names for groups



#using aggregate() in R to generate mean scores
#for Math, Physics, Chemistry by gender
#          www.rdatacode.com
agg <- aggregate(test[-c(1,2)], 
      by = list(Gender=test$Gender, 
      Coungry=test$Country), 
      FUN = mean, na.rm = TRUE)

#to show result
agg

  Gender Coungry  Math  Physics Chemistry
1 Female      UK 88.75 74.25000  81.75000
2   Male      UK 76.50 82.75000  86.25000
3 Female      US 78.00 86.42857  82.85714
4   Male      US 74.20 75.00000  78.40000

You can also watch video on R tutorial from our YouTube channel.