We provide effective and economically affordable training courses for R and Python, Click here for more details and course registration !
In data analysis it is often needed to set new values to a variable based on one or several conditions, and these kinds of operations are called recode variables. The most frequently applied recoding variables in R may be the setting some values to missing values (NA), and recoding a continuous to the values of a categorical variable.
- Recoding values to missing value (NA)
Say, we will read a CSV format data file from working directory into R environment. However, there are several observations in the file having been coded with ‘999’ from the survey.
StudentID,First,Last,Gender,Country,Age,Math,Physics,Chemistry,Date
1,James,Zhang,Male,US,23,73,70,87,10/31/08
2,Wilson,Li,Male,UK,26,95,999,83,03/16/08
3,Richard,Nuan Ye,Male,UK,35,77,83,92,05/22/08
4,Mary,Deng,Female,US,21,60,99,84,01/24/09
5,Jason,Wilson,Male,UK,19,77,89,93,07/30/09
6,Jennifer,Hopkin,Female,UK,43,79,64,83,04/05/09
7,Kari,Gjendem,Female,US,37,87,99,67,11/24/08
8,Wenche,Dale,Female,US,28,95,87,93,10/02/08
9,Jane,Larsen,Female,US,19,73,92,84,06/05/09
10,Steinar,Hansen,Male,UK,25,66,93,999,08/01/08
11,Michael,Chen,Male,UK,42,83,90,77,10/24/08
12,Josef,Curton,Male,US,32,71,63,96,11/08/09
13,Jennifer,Jones,Male,US,27,79,76,82,10/29/08
14,Gary,Grant,Female,UK,35,90,78,92,10/24/08
15,Phil,Yao,Male,UK,21,69,69,83,10/15/08
16,Nora,Spears,Female,US,29,79,83,76,03/11/09
17,Goril,Nordmann,Female,UK,36,91,79,69,05/24/08
18,Lisa,Bondvik,Female,US,39,65,73,87,07/09/09
19,Guri,Olsen,Female,US,24,87,72,89,08/12/09
20,Martin,Jones,Male,US,25,82,73,62,999
The following code shows how we can recode those values to ‘NA’.
# to set working directory
setwd("d:\\RStatistics-Tutorial")
#to set column type before reading file
vartype<-c("character", "character", "character", character", "character", "numeric","numeric", numeric","numeric","character")
#read file into a data frame
grade <- read.table("University-NA.csv", colClasses=vartype, header=TRUE, sep=",")
#to show the first several observations of the data frame
head(grade)
StudentID First Last Gender Country Age Math Physics Chemistry
1 1 James Zhang Male US 23 73 70 87
2 2 Wilson Li Male UK 26 95 999 83
3 3 Richard Nuan Ye Male UK 35 77 83 92
4 4 Mary Deng Female US 21 60 99 84
5 5 Jason Wilson Male UK 19 77 89 93
6 6 Jennifer Hopkin Female UK 43 79 64 83
#to show the structure and variables of the data frame
str(grade)
'data.frame': 20 obs. of 10 variables:
$ StudentID: chr "1" "2" "3" "4" ...
$ First : chr "James" "Wilson" "Richard" "Mary" ...
$ Last : chr "Zhang" "Li" "Nuan Ye" "Deng" ...
$ Gender : chr "Male" "Male" "Male" "Female" ...
$ Country : chr "US" "UK" "UK" "US" ...
$ Age : num 23 26 35 21 19 43 37 28 19 25 ...
$ Math : num 73 95 77 60 77 79 87 95 73 66 ...
$ Physics : num 70 999 83 99 89 64 99 87 92 93 ...
$ Chemistry: num 87 83 92 84 93 83 67 93 84 999 ...
$ Date : chr "10/31/08" "03/16/08" "05/22/08" "01/24/09" ...
#recode variable, setting field with '999' to 'NA'
grade$Math[grade$Math == 999] <- NA
grade$Physics[grade$Physics == 999] <- NA
grade$Chemistry[grade$Chemistry == 999] <- NA
grade$Date[grade$Date == "999"] <- NA
#to test if there are NA in the several variables
is.na(grade[,7:10])
Math Physics Chemistry Date
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE TRUE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE
[10,] FALSE FALSE TRUE FALSE
[19,] FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE TRUE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE
2. Recode a continuous variable to a categorical variable
In the data frame , there is a continuous variable Age. The code below shows how to recode the value of ‘Age’ and create a new categorical variable ‘agecat’.
#create a new variable 'agecat',the value is recoded from
#variable 'Age', where younger than 31 years old is 'Young'
#Otherwise, is 'MiddleAged'
grade$agecat[grade$Age < 31] <- "Young"
grade$agecat[grade$Age >= 31] <- "MiddleAged"
#show again data frame structure and variable type
str(grade)
'data.frame': 20 obs. of 11 variables:
$ StudentID: chr "1" "2" "3" "4" ...
$ First : chr "James" "Wilson" "Richard" "Mary" ...
$ Last : chr "Zhang" "Li" "Nuan Ye" "Deng" ...
$ Gender : chr "Male" "Male" "Male" "Female" ...
$ Country : chr "US" "UK" "UK" "US" ...
$ Age : num 23 26 35 21 19 43 37 28 19 25 ...
$ Math : num 73 95 77 60 77 79 87 95 73 66 ...
$ Physics : num 70 NA 83 99 89 64 99 87 92 93 ...
$ Chemistry: num 87 83 92 84 93 83 67 93 84 NA ...
$ Date : chr "10/31/08" "03/16/08" "05/22/08" "01/24/09" ...
$ agecat : chr "Young" "Young" "MiddleAged" "Young" ...
#to show first several observations of the data frame
> head(grade)
StudentID First Last Gender Country Age Math Physics Chemistry
1 1 James Zhang Male US 23 73 70 87
2 2 Wilson Li Male UK 26 95 NA 83
3 3 Richard Nuan Ye Male UK 35 77 83 92
4 4 Mary Deng Female US 21 60 99 84
5 5 Jason Wilson Male UK 19 77 89 93
6 6 Jennifer Hopkin Female UK 43 79 64 83
Date agecat
1 10/31/08 Young
2 03/16/08 Young
3 05/22/08 MiddleAged
4 01/24/09 Young
5 07/30/09 Young
6 04/05/09 MiddleAged
#to show the frequency of 'agecat'
table(grade$agecat)
MiddleAged Young
8 12
You can also watch video on R course tutorial from our YouTube channel.
0 Comments