We provide effective and economically affordable training courses for R and Python, Click here for more details and course registration !
In R programming, the value ‘NA’ is used to represent a missing value. Say we try to read a csv file from working directory and generate a data frame. There several places in the csv file have value ‘999’, which means missing value due to various circumstances during data survey and collection. The following code example show how to create data frame, then recode ‘999’ to ‘NA’ in R.
# to set working directory
setwd("d:\\RStatistics-Tutorial")
#read file from working directory and generate a data frame
# for the information from this csv file, with specified
#column variable types in the data frame
vartype<-c("character", "character", "character", "character", "character", "numeric","numeric", "numeric","numeric","character")
grade <- read.table("University-NA.csv", colClasses=vartype, header=TRUE, sep=",")
#to show the first several observations of the data frame
head(grade)
StudentID First Last Gender Country Age Math Physics Chemistry
1 1 James Zhang Male US 23 73 70 87
2 2 Wilson Li Male UK 26 95 999 83
3 3 Richard Nuan Ye Male UK 35 77 83 92
4 4 Mary Deng Female US 21 60 99 84
5 5 Jason Wilson Male UK 19 77 89 93
6 6 Jennifer Hopkin Female UK 43 79 64 83
#to show structure of the data frame
str(grade)
'data.frame': 20 obs. of 10 variables:
$ StudentID: chr "1" "2" "3" "4" ...
$ First : chr "James" "Wilson" "Richard" "Mary" ...
$ Last : chr "Zhang" "Li" "Nuan Ye" "Deng" ...
$ Gender : chr "Male" "Male" "Male" "Female" ...
$ Country : chr "US" "UK" "UK" "US" ...
$ Age : num 23 26 35 21 19 43 37 28 19 25 ...
$ Math : num 73 95 77 60 77 79 87 95 73 66 ...
$ Physics : num 70 999 83 99 89 64 99 87 92 93 ...
$ Chemistry: num 87 83 92 84 93 83 67 93 84 999 ...
$ Date : chr "10/31/08" "03/16/08" "05/22/08" "01/24/09" ...
>
#recode value '999' as NA
grade$Math[grade$Math == 999] <- NA
grade$Physics[grade$Physics == 999] <- NA
grade$Chemistry[grade$Chemistry == 999] <- NA
grade$Date[grade$Date == "999"] <- NA
Function is.na() is used to identify if there is missing value in the object. It will return an object with the same size as the targeted object with Boolean values, where TRUE means the field is a missing value, and FALSE not.
#identifying if columns 7 -10 of the data frame contain #missing values
is.na(grade[,7:10])
Math Physics Chemistry Date
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE TRUE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE
[10,] FALSE FALSE TRUE FALSE
Many functions in R has an option for excluding missing values before the operation can be carried out. For example the follow code example show that missing values are excluded first, then the remaining values are summed.
#Excluding missing values from analyses
sum(grade$Physics, na.rm=TRUE)
[1] 1532
Of course, the alternative way dealing with missing value is that you can remove those observations with missing value existing in any variables. The following code shows the data frame has been reduced with sample size after observations with missing values are removed.
#remove any observation with missing data and assign to
#a new data frame
test <- na.omit(grade)
test
StudentID First Last Gender Country Age Math Physics Chemistry
1 1 James Zhang Male US 23 73 70 87
3 3 Richard Nuan Ye Male UK 35 77 83 92
4 4 Mary Deng Female US 21 60 99 84
5 5 Jason Wilson Male UK 19 77 89 93
6 6 Jennifer Hopkin Female UK 43 79 64 83
7 7 Kari Gjendem Female US 37 87 99 67
8 8 Wenche Dale Female US 28 95 87 93
9 9 Jane Larsen Female US 19 73 92 84
11 11 Michael Chen Male UK 42 83 90 77
12 12 Josef Curton Male US 32 71 63 96
#to show structure of the data frame
str(test)
'data.frame': 17 obs. of 10 variables:
$ StudentID: chr "1" "3" "4" "5" ...
$ First : chr "James" "Richard" "Mary" "Jason" ...
$ Last : chr "Zhang" "Nuan Ye" "Deng" "Wilson" ...
$ Gender : chr "Male" "Male" "Female" "Male" ...
$ Country : chr "US" "UK" "US" "UK" ...
$ Age : num 23 35 21 19 43 37 28 19 42 32 ...
$ Math : num 73 77 60 77 79 87 95 73 83 71 ...
$ Physics : num 70 83 99 89 64 99 87 92 90 63 ...
$ Chemistry: num 87 92 84 93 83 67 93 84 77 96 ...
$ Date : chr "10/31/08" "05/22/08" "01/24/09" "07/30/09" ...
- attr(*, "na.action")= 'omit' Named int [1:3] 2 10 20
..- attr(*, "names")= chr [1:3] "2" "10" "20"
You can also watch video of R course full tutorial from our YouTube channel.
0 Comments