Recode variables in R

We provide effective and economically affordable training courses for R and Python, Click here for more details and course registration !

In data analysis it is often needed to set new values to a variable based on one or several conditions, and these kinds of operations are called recode variables. The most frequently applied recoding variables in R may be the setting some values to missing values (NA), and recoding a continuous to the values of a categorical variable.

Recoding values to missing value (NA)

Say, we will read a CSV format data file from working directory into R environment. However, there are several observations in the file having been coded with ‘999’ from the survey.

StudentID,First,Last,Gender,Country,Age,Math,Physics,Chemistry,Date
1,James,Zhang,Male,US,23,73,70,87,10/31/08
2,Wilson,Li,Male,UK,26,95,999,83,03/16/08
3,Richard,Nuan Ye,Male,UK,35,77,83,92,05/22/08
4,Mary,Deng,Female,US,21,60,99,84,01/24/09
5,Jason,Wilson,Male,UK,19,77,89,93,07/30/09
6,Jennifer,Hopkin,Female,UK,43,79,64,83,04/05/09
7,Kari,Gjendem,Female,US,37,87,99,67,11/24/08
8,Wenche,Dale,Female,US,28,95,87,93,10/02/08
9,Jane,Larsen,Female,US,19,73,92,84,06/05/09
10,Steinar,Hansen,Male,UK,25,66,93,999,08/01/08
11,Michael,Chen,Male,UK,42,83,90,77,10/24/08
12,Josef,Curton,Male,US,32,71,63,96,11/08/09
13,Jennifer,Jones,Male,US,27,79,76,82,10/29/08
14,Gary,Grant,Female,UK,35,90,78,92,10/24/08
15,Phil,Yao,Male,UK,21,69,69,83,10/15/08
16,Nora,Spears,Female,US,29,79,83,76,03/11/09
17,Goril,Nordmann,Female,UK,36,91,79,69,05/24/08
18,Lisa,Bondvik,Female,US,39,65,73,87,07/09/09
19,Guri,Olsen,Female,US,24,87,72,89,08/12/09
20,Martin,Jones,Male,US,25,82,73,62,999

The following code shows how we can recode those values to ‘NA’.

# to set working directory
setwd("d:\\RStatistics-Tutorial")    

#to set column type before reading file
vartype<-c("character", "character", "character", character", "character", "numeric","numeric", numeric","numeric","character")

#read file into a data frame
grade <- read.table("University-NA.csv", colClasses=vartype, header=TRUE, sep=",")  
  
#to show the first several observations of the data frame                                  
head(grade)
   StudentID    First     Last Gender Country Age Math Physics Chemistry
1          1    James    Zhang   Male      US  23   73      70        87
2          2   Wilson       Li   Male      UK  26   95     999        83
3          3  Richard  Nuan Ye   Male      UK  35   77      83        92
4          4     Mary     Deng Female      US  21   60      99        84
5          5    Jason   Wilson   Male      UK  19   77      89        93
6          6 Jennifer   Hopkin Female      UK  43   79      64        83

#to show the structure and variables of the data frame
str(grade)
'data.frame':	20 obs. of  10 variables:
 $ StudentID: chr  "1" "2" "3" "4" ...
 $ First    : chr  "James" "Wilson" "Richard" "Mary" ...
 $ Last     : chr  "Zhang" "Li" "Nuan Ye" "Deng" ...
 $ Gender   : chr  "Male" "Male" "Male" "Female" ...
 $ Country  : chr  "US" "UK" "UK" "US" ...
 $ Age      : num  23 26 35 21 19 43 37 28 19 25 ...
 $ Math     : num  73 95 77 60 77 79 87 95 73 66 ...
 $ Physics  : num  70 999 83 99 89 64 99 87 92 93 ...
 $ Chemistry: num  87 83 92 84 93 83 67 93 84 999 ...
 $ Date     : chr  "10/31/08" "03/16/08" "05/22/08" "01/24/09" ...

#recode variable, setting field with '999' to 'NA'
grade$Math[grade$Math == 999] <- NA
grade$Physics[grade$Physics == 999] <- NA
grade$Chemistry[grade$Chemistry == 999] <- NA
grade$Date[grade$Date == "999"] <- NA


#to test if there are NA in the several variables
is.na(grade[,7:10])
       Math Physics Chemistry  Date
 [1,] FALSE   FALSE     FALSE FALSE
 [2,] FALSE    TRUE     FALSE FALSE
 [3,] FALSE   FALSE     FALSE FALSE
 [4,] FALSE   FALSE     FALSE FALSE
 [5,] FALSE   FALSE     FALSE FALSE
 [6,] FALSE   FALSE     FALSE FALSE
 [7,] FALSE   FALSE     FALSE FALSE
 [8,] FALSE   FALSE     FALSE FALSE
 [9,] FALSE   FALSE     FALSE FALSE
[10,] FALSE   FALSE      TRUE FALSE
[19,] FALSE   FALSE     FALSE FALSE
[20,] FALSE   FALSE     FALSE  TRUE
[11,] FALSE   FALSE     FALSE FALSE
[12,] FALSE   FALSE     FALSE FALSE
[13,] FALSE   FALSE     FALSE FALSE
[14,] FALSE   FALSE     FALSE FALSE
[15,] FALSE   FALSE     FALSE FALSE
[16,] FALSE   FALSE     FALSE FALSE
[17,] FALSE   FALSE     FALSE FALSE
[18,] FALSE   FALSE     FALSE FALSE

2. Recode a continuous variable to a categorical variable

In the data frame , there is a continuous variable Age. The code below shows how to recode the value of ‘Age’ and create a new categorical variable ‘agecat’.

#create a new variable 'agecat',the value is recoded from 
#variable 'Age', where younger than 31 years old is 'Young'
#Otherwise, is 'MiddleAged'
grade$agecat[grade$Age < 31] <- "Young" 
grade$agecat[grade$Age >= 31] <- "MiddleAged"

#show again data frame structure and variable type
str(grade)
'data.frame':	20 obs. of  11 variables:
 $ StudentID: chr  "1" "2" "3" "4" ...
 $ First    : chr  "James" "Wilson" "Richard" "Mary" ...
 $ Last     : chr  "Zhang" "Li" "Nuan Ye" "Deng" ...
 $ Gender   : chr  "Male" "Male" "Male" "Female" ...
 $ Country  : chr  "US" "UK" "UK" "US" ...
 $ Age      : num  23 26 35 21 19 43 37 28 19 25 ...
 $ Math     : num  73 95 77 60 77 79 87 95 73 66 ...
 $ Physics  : num  70 NA 83 99 89 64 99 87 92 93 ...
 $ Chemistry: num  87 83 92 84 93 83 67 93 84 NA ...
 $ Date     : chr  "10/31/08" "03/16/08" "05/22/08" "01/24/09" ...
 $ agecat   : chr  "Young" "Young" "MiddleAged" "Young" ...

#to show first several observations of the data frame
> head(grade)
  StudentID    First    Last Gender Country Age Math Physics Chemistry
1         1    James   Zhang   Male      US  23   73      70        87
2         2   Wilson      Li   Male      UK  26   95      NA        83
3         3  Richard Nuan Ye   Male      UK  35   77      83        92
4         4     Mary    Deng Female      US  21   60      99        84
5         5    Jason  Wilson   Male      UK  19   77      89        93
6         6 Jennifer  Hopkin Female      UK  43   79      64        83
      Date     agecat
1 10/31/08      Young
2 03/16/08      Young
3 05/22/08 MiddleAged
4 01/24/09      Young
5 07/30/09      Young
6 04/05/09 MiddleAged

#to show the frequency of 'agecat'
table(grade$agecat)

MiddleAged      Young 
         8         12

You can also watch video on R course tutorial from our YouTube channel.

Published by wilsonzhang746 on April 3, 2024April 3, 2024

0 Comments

Leave a Reply Cancel reply

Download R Course source files

How to delete columns of a data frame in Python

Mathematical operations between Pandas Series in Python