Dealing with missing values in R

We provide effective and economically affordable training courses for R and Python, Click here for more details and course registration !

In R programming, the value ‘NA’ is used to represent a missing value. Say we try to read a csv file from working directory and generate a data frame. There several places in the csv file have value ‘999’, which means missing value due to various circumstances during data survey and collection. The following code example show how to create data frame, then recode ‘999’ to ‘NA’ in R.

# to set working directory
setwd("d:\\RStatistics-Tutorial")      

#read file from working directory and generate a data frame
# for the information from this csv file, with specified
#column variable types in the data frame
vartype<-c("character", "character", "character", "character", "character", "numeric","numeric", "numeric","numeric","character")
grade <- read.table("University-NA.csv", colClasses=vartype, header=TRUE, sep=",") 

#to show the first several observations of the data frame                                     
head(grade)

  StudentID    First     Last Gender Country Age Math Physics Chemistry
1          1    James    Zhang   Male      US  23   73      70        87
2          2   Wilson       Li   Male      UK  26   95     999        83
3          3  Richard  Nuan Ye   Male      UK  35   77      83        92
4          4     Mary     Deng Female      US  21   60      99        84
5          5    Jason   Wilson   Male      UK  19   77      89        93
6          6 Jennifer   Hopkin Female      UK  43   79      64        83

#to show structure of the data frame
str(grade)
'data.frame':	20 obs. of  10 variables:
 $ StudentID: chr  "1" "2" "3" "4" ...
 $ First    : chr  "James" "Wilson" "Richard" "Mary" ...
 $ Last     : chr  "Zhang" "Li" "Nuan Ye" "Deng" ...
 $ Gender   : chr  "Male" "Male" "Male" "Female" ...
 $ Country  : chr  "US" "UK" "UK" "US" ...
 $ Age      : num  23 26 35 21 19 43 37 28 19 25 ...
 $ Math     : num  73 95 77 60 77 79 87 95 73 66 ...
 $ Physics  : num  70 999 83 99 89 64 99 87 92 93 ...
 $ Chemistry: num  87 83 92 84 93 83 67 93 84 999 ...
 $ Date     : chr  "10/31/08" "03/16/08" "05/22/08" "01/24/09" ...
> 


#recode value '999' as NA
grade$Math[grade$Math == 999] <- NA
grade$Physics[grade$Physics == 999] <- NA
grade$Chemistry[grade$Chemistry == 999] <- NA 
grade$Date[grade$Date == "999"] <- NA

Function is.na() is used to identify if there is missing value in the object. It will return an object with the same size as the targeted object with Boolean values, where TRUE means the field is a missing value, and FALSE not.

#identifying if columns 7 -10 of the data frame contain #missing values
is.na(grade[,7:10]) 
      Math Physics Chemistry  Date
 [1,] FALSE   FALSE     FALSE FALSE
 [2,] FALSE    TRUE     FALSE FALSE
 [3,] FALSE   FALSE     FALSE FALSE
 [4,] FALSE   FALSE     FALSE FALSE
 [5,] FALSE   FALSE     FALSE FALSE
 [6,] FALSE   FALSE     FALSE FALSE
 [7,] FALSE   FALSE     FALSE FALSE
 [8,] FALSE   FALSE     FALSE FALSE
 [9,] FALSE   FALSE     FALSE FALSE
[10,] FALSE   FALSE      TRUE FALSE

Many functions in R has an option for excluding missing values before the operation can be carried out. For example the follow code example show that missing values are excluded first, then the remaining values are summed.

#Excluding missing values from analyses
sum(grade$Physics, na.rm=TRUE)
[1] 1532

Of course, the alternative way dealing with missing value is that you can remove those observations with missing value existing in any variables. The following code shows the data frame has been reduced with sample size after observations with missing values are removed.

#remove any observation with missing data and assign to 
#a new data frame
test <- na.omit(grade)    
test
   StudentID    First     Last Gender Country Age Math Physics Chemistry
1          1    James    Zhang   Male      US  23   73      70        87
3          3  Richard  Nuan Ye   Male      UK  35   77      83        92
4          4     Mary     Deng Female      US  21   60      99        84
5          5    Jason   Wilson   Male      UK  19   77      89        93
6          6 Jennifer   Hopkin Female      UK  43   79      64        83
7          7     Kari  Gjendem Female      US  37   87      99        67
8          8   Wenche     Dale Female      US  28   95      87        93
9          9     Jane   Larsen Female      US  19   73      92        84
11        11  Michael     Chen   Male      UK  42   83      90        77
12        12    Josef   Curton   Male      US  32   71      63        96

#to show structure of the data frame
str(test)
'data.frame':	17 obs. of  10 variables:
 $ StudentID: chr  "1" "3" "4" "5" ...
 $ First    : chr  "James" "Richard" "Mary" "Jason" ...
 $ Last     : chr  "Zhang" "Nuan Ye" "Deng" "Wilson" ...
 $ Gender   : chr  "Male" "Male" "Female" "Male" ...
 $ Country  : chr  "US" "UK" "US" "UK" ...
 $ Age      : num  23 35 21 19 43 37 28 19 42 32 ...
 $ Math     : num  73 77 60 77 79 87 95 73 83 71 ...
 $ Physics  : num  70 83 99 89 64 99 87 92 90 63 ...
 $ Chemistry: num  87 92 84 93 83 67 93 84 77 96 ...
 $ Date     : chr  "10/31/08" "05/22/08" "01/24/09" "07/30/09" ...
 - attr(*, "na.action")= 'omit' Named int [1:3] 2 10 20
  ..- attr(*, "names")= chr [1:3] "2" "10" "20"

You can also watch video of R course full tutorial from our YouTube channel.

Published by wilsonzhang746 on April 6, 2024April 6, 2024

0 Comments

Leave a Reply Cancel reply

Download R Course source files

How to delete columns of a data frame in Python

Mathematical operations between Pandas Series in Python