Ever wanted to do a loop in R over a million elements, but felt bad that
for (i in 1:1e6) do.stuff(i)
allocated a 4Mb vector of indices you didn't actually need to store? That's where iterators come in.
Iterators are new to R (REvolution Computing just released the iterators package to CRAN last month), but will be familiar to programmers of languages like Java or Python. You can think of an iterator as something like a cursor or pointer to a predefined sequence of elements. Each time you access the iterator, it returns the current element being pointed to, and advances to the next one.
This is probably easier to explain with an example. We can create an iterator for a sequence of integers 1 to 5 with the icount function:
> require(iterators)
Loading required package: iterators
> i <- icount(5)
The function nextElem returns the current value of the iterator, and advances it to the next. Iterators created with icount always start at 1:
> nextElem(i)
[1] 1
> nextElem(i)
[1] 2
> nextElem(i)
[1] 3
When an iterator runs out of values to return, it signals an error:
> nextElem(i)
[1] 4
> nextElem(i)
[1] 5
> nextElem(i)
Error: StopIteration
So, if we wanted to make a loop of a million iterations, all we need to do is make an iterator and then loop using the foreach function (from the foreach package):
> require(foreach)
Loading required package: foreach
> m <- icount(1e6)
> foreach (i = m) %do% { do.stuff(i) }
One nice thing about this construction is that m is a very small object: you don't need to waste a bunch of RAM on index values you only need one at a time. The other nice thing is that by replacing %do% with %dopar% you can run multiple iterations in parallel. Because the iterator m is shared amongst all the parallel instances, it guarantees that i takes each value between one and a million exactly once across all the iterations, even if they don't necessarily complete in sequence.
An iterator isn't constrained to simply return integers, either. You can set up an iterator on a matrix, so that each call to nextElem returns the next row (or column) as a vector. Or, you can set up an iterator on a MySQL or Oracle database, so that each call to nextElem returns the next record in the table. Iterators can even return infinite, irregular sequences -- the sequence of all primes, for examples. You can see examples of all these kinds of iterators in my recent UseR! talk.
You probably meant to have "1:" before the 1e6 like this
and it is "only" 4MB since that makes an int[] and not a double (object.size(1:1e6) returns 4000040 bytes on my machine).
You can of course re-write it as a while() statement without the allocation
but the control flow statements in R are pitiful so it is good to see you expanding the language. And of course your approach extends to parallel processing.
Posted by: Allan Engelhardt | July 17, 2009 at 05:54
Aha, good points, thanks Allan. I've made corrections to the original article.
Posted by: David Smith | July 17, 2009 at 06:53
i = 1
while(i < 1e6) { do.stuff(i); i = i + 1 }
Posted by: Bob | July 18, 2009 at 03:22
Hi.
I executed the following script to try the foreach package, but encountered the error.
> library(foreach)
> foreach(i=1:3) %do% sqrt(i)
以下にエラー sqrt(i) : 添え字が許される範囲外です
This error message means that "out of range of the subscript " in Japanese. My envirnment is as follws.
> sessionInfo()
R version 2.9.1 (2009-06-26)
i386-apple-darwin9.7.0
locale:
ja_JP.UTF-8/ja_JP.UTF-8/C/C/ja_JP.UTF-8/ja_JP.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] foreach_1.2.1 codetools_0.2-2 iterators_1.0.1
loaded via a namespace (and not attached):
[1] tcltk_2.9.1 tools_2.9.1
So I changed the LANG as following.
/Users/syou6162% LANG=C
/Users/syou6162% echo $LANG
C
And I restarted R and executed the above script. In this time, it went well.
/Users/syou6162% R
R version 2.9.1 (2009-06-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(foreach)
Loading required package: iterators
Loading required package: codetools
foreach: simple, scalable parallel programming from REvolution Computing
Use REvolution R for scalability, fault tolerance and more.
http://www.revolution-computing.com
> foreach(i=1:3) %do% sqrt(i)
[[1]]
[1] 1
[[2]]
[1] 1.414214
[[3]]
[1] 1.732051
So I think foreach packages depends on the "locale". Could you research the cause of this?
Posted by: Yasuhisa Yoshida | July 20, 2009 at 09:14
Thanks for letting us know about that, I agree it looks like a problem with locales. I'll have the developers take a look and get in touch with you directly.
Posted by: David Smith | July 20, 2009 at 09:27
Actually a standard 'while' loop in R does also not consume RAM but is factor 300 (three-hundred) faster then 'foreach with icount'.
Foreach with icount is about factor 2000 (two-thousand) slower than a standard for loop, thus a moderate RAM gain comes at a very expensive speed-loss - too expensive to be healed by parallization.
One of the fastest AND RAM efficient ways of looping is chunked-looping as facilitated by function 'chunk' in package 'bit': same speed like 'for' but factor 1000 less RAM.
The examples below also show that for example calculating a sum with chunked looping can be up to factor 60,000 (sixty-thousand) faster than a 'foreach' implementation employing '.combine="+"' and still saves factor 1000 RAM compared to a simple call to 'sum'.
Cheers
J.O.
> require(bit)
> require(foreach)
> m <- 10000
> k <- 1000
> n <- m*k
> cat("Four ways to loop from 1 to n. Slowest foreach to fastest chunk is 1700:1 on a dual core notebook with 3GB RAM\n")
Four ways to loop from 1 to n. Slowest foreach to fastest chunk is 1700:1 on a dual core notebook with 3GB RAM
> z <- 0L; k*system.time({it <- icount(m); foreach (i = it) %do% { z <- i; NULL }})
User System verstrichen
6180 0 6320
> z <- 0L; system.time({i <- 0L;while (i User System verstrichen
20.25 0.00 20.78
> z <- 0L; system.time(for (i in 1:n) z <- i)
User System verstrichen
3.30 0.02 3.40
> z <- 0L; n <- m*k; system.time(for (ch in chunk(1, n, by=m)){for (i in ch[1]:ch[2])z <- i})
User System verstrichen
3.14 0.00 3.35
> cat("Seven ways to calculate sum(1:n). Slowest foreach to fastest chunk is 61000:1 on a dual core notebook with 3GB RAM\n")
Seven ways to calculate sum(1:n). Slowest foreach to fastest chunk is 61000:1 on a dual core notebook with 3GB RAM
> k*system.time({it <- icount(m); foreach (i = it, .combine="+") %do% { i }})
User System verstrichen
9460 0 9780
> z <- 0; k*system.time({it <- icount(m); foreach (i = it) %do% { z <- z + i; NULL }})
User System verstrichen
6280 0 6410
> z <- 0; system.time({i <- 0L;while (i User System verstrichen
27.06 0.00 28.40
> z <- 0; system.time(for (i in 1:n) z <- z + i)
User System verstrichen
9.47 0.02 9.62
> system.time(sum(as.double(1:n)))
User System verstrichen
0.13 0.01 0.15
> z <- 0; n <- m*k; system.time(for (ch in chunk(1, n, by=m)){for (i in ch[1]:ch[2])z <- z + i})
User System verstrichen
8.92 0.00 9.42
> z <- 0; n <- m*k; system.time(for (ch in chunk(1, n, by=m)){z <- z + sum(as.double(ch[1]:ch[2]))})
User System verstrichen
0.14 0.01 0.16
Posted by: Jens Oehlschlägel | July 31, 2009 at 10:52
Jens, thanks for the info about the "bit" package, but I don't think your timings make a lot of sense as the body of each of your loops is trivial; you're really only measuring the overhead of the different kinds of loops, which in any real application (simulations, etc) is a trivial component of the overall execution time.
The real advantage of using foreach comes in the parallelization, where any overhead in the loop is far outweighed by running iterations in parallel (provided they do more than trivial amounts of calculation). In particular, if you're writing code for others to use (say, in a package), it allows OTHERS to run your code in parallel, too. Even if you don't have a parallel system to run on, if you use foreach/%dopar% someone using your code can register multicore or networkspaces as the parallel backend and get significant speedups. (DoMC, the multicore backend, is available on CRAN for all platforms except Windows; DoNWS is available in REvolution R Enterprise and is available for all platforms.)
Posted by: David Smith | July 31, 2009 at 11:29