Here's a little puzzle that might shed some light on some apparently confusing behaviour by missing values (NAs) in R:
What is NA^0 in R?
You can get the answer easily by typing at the R command line:
> NA^0
[1] 1
But the interesting question that arises is: why is it 1? Most people might expect that the answer would be NA, like most expressions that include NA. But here's the trick to understanding this outcome: think of NA not as a number, but as a placeholder for a number that exists, but whose value we don't know.
Now think of all of the numbers that could replace NA in the expression NA^0. Any positive number to the power zero is 1. Same goes for any negative number. Even zero to the power zero is defined by mathematicians to be 1 (for reasons I'm not going to go into here). So that means whatever number you substitute for NA in the expression NA^0, the answer will be 1. And so that's the answer R gives.
There are a few other instances where using the indeterminate NA in an expression can lead to a specific non-NA result. Consider this example:
> NA || TRUE
[1] TRUE
Here. the NA is holding the place of a logical value1, so it could be representing only TRUE or FALSE. But whatever it represents, the answer will be the same:
> TRUE || TRUE
[1] TRUE
> FALSE || TRUE
[1] TRUE
By the same token, any(x) can return TRUE even if the logical vector includes NAs, as long as x includes at least one TRUE value. Similarly, NA && FALSE is always FALSE.
There are a few other examples as well (if you know some, share them in the comments). But always remember: if you're ever confused by the behaviour of NA in R, think about what values it might contain, and if changing them changes the outcome. That might explain what's going on. For more on how R handles NAs, see the R Language Definition.
1Footnote: I'm deliberately ignoring the storage mode of NA, which can come in logical, integer, double and character flavours. In all the examples above, it gets coerced to the type of the other elements in the expression.
David, I applaud your attempt, but I think R's handling of NA values defies explanation.
You wrote: "Now think of all of the numbers that could replace NA in the expression NA^0. Any positive number to the power zero is 1."
Allow me to change this slightly: "Now think of all of the numbers that could replace NA in the expression NA*0. Any positive number times zero is 0."
Thus, we expect NA*0 to be 0. Let's check:
R> NA * 0
[1] NA
Ahg, no.
I've seen people try to explain R's handling of NA values as being somehow consistent from a computer-science language-design point of view, but as a user who writes R scripts with lots of missing data, I claim there are some inexplicable inconsistencies with NA values in R.
Kevin Wright
Posted by: Kevin Wright | July 18, 2016 at 14:35
Just for further example, I can sorta, kinda, maybe, tolerate R doing this:
R> sum(NA, na.rm=TRUE)
[1] 0
But this borders on insanity for real-life analytic scripts:
R> prod(NA, na.rm=TRUE)
[1] 1
Posted by: Kevin Wright | July 18, 2016 at 14:46
Annoying counter point: if we would consider NA to replace any number, then the following should be TRUE instead of NA:
R> Inf >= NA
(instead we get NA). However, this counter point provides also a counter point to the previous comment that NA * 0 should be 0; in fact, Inf * 0 == NaN.
This also lead to a result that was slightly surprising to me: Inf^0 == 1 (I was expecting NaN!)
Posted by: Cliff AB | July 18, 2016 at 16:11
Hello Kevin,
I might be able to explain your results:
1) Notice that Infinity*0 is completely undefined, but Infinity^0 is still reasonable to be defined as 1 - you can try this in R with Inf*0 and Inf^0
2) It's reasonable, and standard, to define the empty product as the multiplicative unit - see this: https://en.wikipedia.org/wiki/Empty_product
Posted by: R-Stats | July 18, 2016 at 16:12
Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for Big data training.
Posted by: Rakshana | July 19, 2016 at 01:58
Thanks @R-Stats for the link to the Empty_product. This is exactly what I meant about the R language being designed to some ideal standard. But consider the following example. Is there any possible way that you would ever want Q1 sales to print as 0? Wouldn't you want it to be NA? Printing 0 is extremely misleading in my opinion.
R> dat <- data.frame(yr=c("Y1","Y1","Y1","Y1","Y2","Y2","Y2","Y2"),
+ qtr=c("Q1","Q2","Q3","Q4","Q1","Q2","Q3","Q4"),
+ sales=c(NA,5,5,6,NA,6,7,8))
R> tapply(dat$sales, dat$yr, FUN=sum, na.rm=TRUE)
Y1 Y2
16 21
R> tapply(dat$sales, dat$qtr, FUN=sum, na.rm=TRUE)
Q1 Q2 Q3 Q4
0 11 12 14
Posted by: Kevin Wright | July 19, 2016 at 07:11
@Kevin Wright
That makes sense. But let me present a different POV. If you're using
na.rm = TRUE,
shouldn't you be responsible for making sense of the absence of NA? If you do want to keep the NA, you can use
tapply(dat$sales, dat$qtr, FUN=sum, na.rm=FALSE)
which correctly results in
Q1 Q2 Q3 Q4
NA 11 12 14
Posted by: R-Stats | July 19, 2016 at 08:12
I agree that, at the very least, the result of prod(NA,na.rm=TRUE) should be documented in the help page.
I did find this nugget at ?prod :
"For historical reasons, NULL is accepted and treated as if it were numeric(0)."
So now we can all start arguing about what NULL really is :-)
Posted by: Carl Witthoft | July 19, 2016 at 13:03
While I'm at it, just for fun:
> NA/NaN
[1] NA
> NaN/NA
[1] NaN
Posted by: Carl Witthoft | July 19, 2016 at 13:09
An interesting follow-up would be to find out why R claims that 0^0, Inf^0, and 1^Inf are all equal to 1. Whereas it returns NA for Inf * 0, Inf-Inf, Inf/Inf, and 0/0. It seems that R is not consistent in the treatment of indeterminate forms.
Posted by: flodel | July 19, 2016 at 18:13
@flodel
That's not exactly an inconsistent treatment of indeterminate forms. That's the mathematical treatment.
0^0, Inf^0 and 1^Inf are all indeed equal to 1, in the mathematical sense. On the other hand, Inf*0, Inf - Inf, Inf/Inf and 0/0 are all undetermined - again in the mathematical sense - which is exactly what R returns - it actually returns NaN, at least in my machine.
Posted by: R-Stats | July 19, 2016 at 18:42
@R-Stats, you could check http://mathworld.wolfram.com/Indeterminate.html or https://en.wikipedia.org/wiki/Indeterminate_form; both sources describe 0^0, Inf^0, 1^Inf, Inf * 0, Inf-Inf, Inf/Inf, and 0/0 as indeterminate forms.
After R made the choice that 0^0 and Inf^0 are both equal to 1, then it's understandable that it claims NA^0 is 1 as well. However, apply the log() to that result and you get that log(NA^0) is not equal to 0 * log(NA).
Similarly, after R made the choice that 1^Inf be 1, it is understandable that it returns 1 for 1^NA. However, take the log() and you get that log(1^NA) is not equal to NA * log(1).
With some work, one could probably come up with more examples of surprising results like the ones above, which exploit the inconsistent way R handles the indeterminate forms I have listed. Makes you wonder why the R authors had not decided to return NA for all these indeterminate forms.
Posted by: flodel | July 19, 2016 at 19:40
Another counterpoint is to realize that in R, NaN^0 also equals 1. Since NaN is by definition 'not a number', it can't be the case that R is using a 'placeholder for an unknown number' logic.
Posted by: Heitz | July 20, 2016 at 05:58
There seems to be a little confusion between NaN (not-a-number) and NA (R's placeholder for a missing number) in the above. R shouldn't return NA for an indeterminate form; it should (and generally does) return NaN in such cases. James Howard has a recent blog post on this topic.
I suspect the reason why R Core adopted the 0^0=1 definition is because of the binomial justification, R being a stats package after all.
I can't think of any defense for NaN^0=1 though...
Posted by: David Smith | July 20, 2016 at 13:07
1. I think the post David was trying to link to was this one: https://jameshoward.us/2016/07/18/nan-versus-na-r/
2. The defense for NaN^0 = 1 comes from the hardware: https://jameshoward.us/2016/07/25/course-nan0-1/
Posted by: James P. Howard, II | July 25, 2016 at 18:54
@Kevin and NA * 0: before drawing too quick conclusions, note that Inf * 0 is (most of the time, at least in the double precision standard!) defined to be 'NaN' and basic arithmetic in R does follow that. So, replacing the placeholder x=NA by Inf (or -Inf !), you have cases where x * 0 is not 0.... and that was the reason NA * 0 was defined to be NA (and NaN * 0 to be NaN).
And yes, it is true, one *could have* adopted the definition that all of these, including 0^NA should return NaN ... which corresponds to typical floating point standards... *BUT* and here we are back to the original posting by David Smith, in almost all math-stat applications it is very convenient to have 0^0 = 1; this goes for the border cases of binomial, negbinomial and poisson and derived formulas IIRC.
Posted by: Martin Maechler | July 26, 2016 at 06:41