Missing values

Introduction

When we don't know the value a variable takes, we say its value is missing, indicated by NA.

Remarks

Missing values are represented by the symbol NA (not available). Impossible values (e.g., as a result of sqrt(-1)) are represented by the symbol NaN (not a number).

Examining missing data

anyNA reports whether any missing values are present; while is.na reports missing values elementwise:

vec <- c(1, 2, 3, NA, 5)

anyNA(vec)
# [1] TRUE
is.na(vec)
# [1] FALSE FALSE FALSE  TRUE FALSE

ìs.na returns a logical vector that is coerced to integer values under arithmetic operations (with FALSE=0, TRUE=1). We can use this to find out how many missing values there are:

sum(is.na(vec))
# [1] 1

Extending this approach, we can use colSums and is.na on a data frame to count NAs per column:

colSums(is.na(airquality))
#   Ozone Solar.R    Wind    Temp   Month     Day 
#      37       7       0       0       0       0 

The naniar package (currently on github but not CRAN) offers further tools for exploring missing values.

Omitting or replacing missing values

Recoding missing values

Regularly, missing data isn't coded as NA in datasets. In SPSS for example, missing values are often represented by the value 99.

num.vec <- c(1, 2, 3, 99, 5)
num.vec
## [1]  1  2  3 99  5

It is possible to directly assign NA using subsetting

num.vec[num.vec == 99] <- NA

However, the preferred method is to use is.na<- as below. The help file (?is.na) states:

is.na<- may provide a safer way to set missingness. It behaves differently for factors, for example.

is.na(num.vec) <- num.vec == 99

Both methods return

num.vec
## [1]  1  2  3 NA  5

Removing missing values

Missing values can be removed in several ways from a vector:

num.vec[!is.na(num.vec)]
num.vec[complete.cases(num.vec)]
na.omit(num.vec)
## [1] 1 2 3 5

Excluding missing values from calculations

When using arithmetic functions on vectors with missing values, a missing value will be returned:

mean(num.vec) # returns: [1] NA

The na.rm parameter tells the function to exclude the NA values from the calculation:

mean(num.vec, na.rm = TRUE) # returns: [1] 2.75

# an alternative to using 'na.rm = TRUE':
mean(num.vec[!is.na(num.vec)]) # returns: [1] 2.75

Some R functions, like lm, have a na.action parameter. The default-value for this is na.omit, but with options(na.action = 'na.exclude') the default behavior of R can be changed.

If it is not necessary to change the default behavior, but for a specific situation another na.action is needed, the na.action parameter needs to be included in the function call, e.g.:

 lm(y2 ~ y1, data = anscombe, na.action = 'na.exclude')

Reading and writing data with NA values

When reading tabular datasets with the read.* functions, R automatically looks for missing values that look like "NA". However, missing values are not always represented by NA. Sometimes a dot (.), a hyphen(-) or a character-value (e.g.: empty) indicates that a value is NA. The na.strings parameter of the read.* function can be used to tell R which symbols/characters need to be treated as NA values:

read.csv("name_of_csv_file.csv", na.strings = "-")

It is also possible to indicate that more than one symbol needs to be read as NA:

read.csv('missing.csv', na.strings = c('.','-'))

Similarly, NAs can be written with customized strings using the na argument to write.csv. Other tools for reading and writing tables have similar options.

TRUE/FALSE and/or NA

NA is a logical type and a logical operator with an NA will return NA if the outcome is ambiguous. Below, NA OR TRUE evaluates to TRUE because at least one side evaluates to TRUE, however NA OR FALSE returns NA because we do not know whether NA would have been TRUE or FALSE

NA | TRUE
# [1] TRUE  
# TRUE | TRUE is TRUE and FALSE | TRUE is also TRUE.

NA | FALSE
# [1] NA  
# TRUE | FALSE is TRUE but FALSE | FALSE is FALSE.

NA & TRUE
# [1] NA  
# TRUE & TRUE is TRUE but FALSE & TRUE is FALSE.

NA & FALSE
# [1] FALSE
# TRUE & FALSE is FALSE and FALSE & FALSE is also FALSE.

These properties are helpful if you want to subset a data set based on some columns that contain NA.

df <- data.frame(v1=0:9, 
                 v2=c(rep(1:2, each=4), NA, NA), 
                 v3=c(NA, letters[2:10]))

df[df$v2 == 1 & !is.na(df$v2), ]
#  v1 v2   v3
#1  0  1 <NA>
#2  1  1    b
#3  2  1    c
#4  3  1    d

df[df$v2 == 1, ]
     v1 v2   v3
#1     0  1 <NA>
#2     1  1    b
#3     2  1    c
#4     3  1    d
#NA   NA NA <NA>
#NA.1 NA NA <NA>

Using NAs of different classes

The symbol NA is for a logical missing value:

class(NA)
#[1] "logical"

This is convenient, since it can easily be coerced to other atomic vector types, and is therefore usually the only NA you will need:

x <- c(1, NA, 1)
class(x[2])
#[1] "numeric"

If you do need a single NA value of another type, use NA_character_, NA_integer_, NA_real_ or NA_complex_. For missing values of fancy classes, subsetting with NA_integer_ usually works; for example, to get a missing-value Date:

class(Sys.Date()[NA_integer_])
# [1] "Date"


2016-07-24
2017-02-07
R Language Pedia
Icon