We provide effective and economically affordable training courses for R and Python, Click here for more details and course registration !
It is very natural to generate summary statistics in data analysis. For example, people may calculate the mean values of costs and income by different years. In R programming, function aggregate() provides an easy way to calculate summary statistics of variables by specific groups in a data frame. The basic form of the function is aggregate(df, by, FUN), Where
df is a data frame, and
by is a list of variables used for grouping, and
FUN is the function applied to the grouping.
In the following code example, we first create a data frame ‘test’ with testing scores of mathematics, physics and chemistry from different students. Then we use aggregate() function to generate mean testing scores by variable ‘Gender’ and ‘Country’.
# to set working directory
#create a grade data frame
vartype<-c("character", "character", "character", "character", "character", "numeric","numeric", "numeric","numeric","character")
grade <- read.table("University-Fullname-full.csv", colClasses=vartype, header=TRUE, sep=",")
#to create a data frame 'test'
#to show first observations of data frame 'test'
Gender Country Math Physics Chemistry
1 Male US 73 70 87
2 Female UK 95 76 83
3 Male UK 77 83 92
4 Female US 60 99 84
5 Male UK 77 89 93
6 Female UK 79 64 83
#aggretate data, calculate mean value of all variables
#by Gender and Country
agg <- aggregate(test, by = list(test$Gender,
test$Country), FUN = mean, na.rm = TRUE)
#to show result
Group.1 Group.2 Gender Country Math Physics Chemistry
1 Female UK NA NA 88.75 74.25000 81.75000
2 Male UK NA NA 76.50 82.75000 86.25000
3 Female US NA NA 78.00 86.42857 82.85714
4 Male US NA NA 74.20 75.00000 78.40000
We can see that the result contains mean scores of mathematics, physics, and chemistry by different groups with respect to Gender and Country. But the column labels for grouping are assigned by default as ‘Group.1’ and ‘Group.2’. aggregate() function provides the feasibility to customize the labels in the resulting data frame. In the following code example, we will implement this by setting column names in the list() option, and remove two redundant columns ‘Gender’ and ‘Country’ by using the form [-c(1,2)] after data frame.
#a better solution: remove the redundant variable
#Gender and Country
#and customize columns names for groups
#using aggregate() in R to generate mean scores
#for Math, Physics, Chemistry by gender
# www.rdatacode.com
agg <- aggregate(test[-c(1,2)],
by = list(Gender=test$Gender,
FUN = mean, na.rm = TRUE)
#to show result
Gender Coungry Math Physics Chemistry
1 Female UK 88.75 74.25000 81.75000
2 Male UK 76.50 82.75000 86.25000
3 Female US 78.00 86.42857 82.85714
4 Male US 74.20 75.00000 78.40000
You can also watch video on R tutorial from our YouTube channel.