by Joseph Rickert
I was recently looking through upcoming Coursera offerings and came across the course Coding the Matrix: Linear Algebra through Computer Science Applications taught by Philip Klein from Brown University. This looks like a fine course; but why use Python to teach linear algebra? I suppose this is a blind spot of mine: MATLAB I can see. That software has a long tradition of being used in applied mathematics and engineering applications. The Linear Algebra course from MIT open courseware is based on MATLAB and half the linear algebra books published by SIAM use MATLAB.
I expected MATLAB and Python seems like a stretch, but where do we stand in the R world vis-a-vis linear algebra? Well, from a pedagogical point of view there does not appear to be much material “out there” that specifically relates to teaching linear algebra with R. It seems that not much has changed since this 2009 post. A google search yields some nice, short documents (Hyde, Højsgaard, Petris, and Carstensen) that look like the online residue from efforts to teach linear algebra using R. And, although most introductory R books have some material devoted to linear algebra (e.g. the extended markov chain in The Art of R Programming), one would be hard pressed to find a book entirely devoted to teaching linear algebra with R. Hands-On Matrix Algebra Using R: Active and Motivated Learning with Applications by Hrishikesh D. Vinod is the exception. (This looks like a gem waiting to be discovered.)
My guess, however, is that we will see much more introductory R material focused on linear algebra as scientists and engineers with computational needs outside of statistics proper discover R. R is, after all, well suited for doing the matrix computations associated with linear algebra, and here are some reasons:
(1) As a language designed for doing computational statistics, R is built on an efficient foundation of well-tested and trusted linear algebra code. From the very beginning, R was good at linear algebra. The vignette 2nd Introduction to the Matrix package lays out a little of the history of R’s linear algebra underpinnings:
Initially the numerical linear algebra functions in R called underlying Fortran routines from the Linpack (Dongarra et al., 1979) and Eispack (Smith et al., 1976) libraries but over the years most of these functions have been switched to use routines from the Lapack (Anderson et al., 1999) library which is the state-of-the-art implementation of numerical dense linear algebra.
(For example, the base R functions for computing eigenvalues, eigen(), Cholesky decompositions, chol(), and singular value decompositions svd() all use LAPACK or LINPACK code.)
(2) R’s notation, indexing and operators are very close to the matrix notation which mathematicians normally use to express linear algebra, and there are basic R functions for several matrix operations (See Quick R for a summary)
(3) The way R functions operate on whole objects: vectors, matrices arrays etc model very closely the conceptual processes of manipulating matrices as single entities.
(4) The seamless interplay between the data frame and matrix data structures in R make it easy to populate matrices from the appropriate columns in heterogeneous data sets.
(5) R is an extensible language and there is a considerable amount of work being done in the R community to "go deep", to "go sparse" and to "go big".
Going Deep
“Going Deep” means making it easy to access the computational resources that may be necessary for building production level applications. To this end, the Rcpp package makes it easy to write R functions that call C++ code to do the heavy lifting. Moreover, the RcppArmadillo and RcppEigen packages provide direct and efficient access to the C++ Armadillo and Eigen libraries for doing linear algebra.
Going Sparse
Modern statistical applications on even moderately large data sets often produce sparse matrices. One way to work with them in R is via Matrix, a “recommended” package that provides S4 classes for both dense and sparse matrices that extend R’s basic matrix data type. Methods for R functions that work on Matrix objects provide access to efficient linear algebra libraries including BLAS, Lapack CHOLMOD including AMD and COLAMD and Csparse. (Getting oriented in the world of linear algebra software is not an easy task, I found this chart from an authoritative source helpful.)
The following code which provides a very first look at the Matrix package shows a couple of notable features: (1) the Matrix() function evaluates a matrix to determine its class and (2) once the Cholesky factorization is computed it automatically becomes part of the matrix object.# A very first look at library(Matrix) set.seed (1) m <- 10; n <- 5 # Set dim of matrix A <- matrix (runif (m*n),m,n) # Construct a random matrix B <- crossprod(A) # Make a positive-definite symmetric matrix A;B # Have a look # A.M <- Matrix(A) # Matrix recognizes as a dense matrix and class(A.M) # automatically assigns A.M to class dgeMatrix [1] "dgeMatrix" # B.M <- Matrix(B) # Matrix recognizes B as a symmetric, dense matrix class(B.M) # and automatically assign B.M to class dysMatrix [1] "dsyMatrix" # chol(B.M) # Compute the Cholesky decomposition of B.M str(B.M) # The Cholesky decomposition is now part of the B.M object Formal class 'dsyMatrix' [package "Matrix"] with 5 slots ..@ x : num [1:25] 3.94 3.09 2.05 2.87 3.07 ... ..@ Dim : int [1:2] 5 5 ..@ Dimnames:List of 2 .. ..$ : NULL .. ..$ : NULL ..@ uplo : chr "U" ..@ factors :List of 1 .. ..$ Cholesky:Formal class 'Cholesky' [package "Matrix"] with 5 slots .. .. .. ..@ x : num [1:25] 1.98 0 0 0 0 ... .. .. .. ..@ Dim : int [1:2] 5 5 .. .. .. ..@ Dimnames:List of 2 .. .. .. .. ..$ : NULL .. .. .. .. ..$ : NULL .. .. .. ..@ uplo : chr "U" .. .. .. ..@ diag : chr "N" B.M@factors # Access the B.M object $Cholesky 5 x 5 Matrix of class "Cholesky" [,1] [,2] [,3] [,4] [,5] [1,] 1.9845453 1.5548529 1.0323413 1.4475384 1.5451499 [2,] . 1.1677451 0.4304767 0.5193746 0.6326812 [3,] . . 1.1606778 0.4354018 0.9637288 [4,] . . . 0.8844295 0.1572823 [5,] . . . . 0.646782
Created by Pretty R at inside-R.org
Other contributed sparse matrix packages are SparseM, slam and spam. For more practical information see the short tutorial from John Myles White and the Sparse Models Vignette by Martin Maechler
Going Big
R has had the capability to work with big matrices for some time and this continues to be an area of active development. The R packages bigmemory, and biganalytics provide structures for working with matrices that are too large to fit into memory. bigalgebra contains functions for doing linear algebra with bigmemory structures. The scidb package enables R users to do matrix computations on huge arrays stored in a SciDB data base, and Revolution Analytics’ commercial distribution of R, Revolution R Enterprise (RRE), makes high performance matrix calculations readily available from R. Because RRE is compiled with the Intel Math Kernel Library most common R functions based on linear algebra calculations automatically get a significant performance boost. However, the real linear algebra benefit RRE provides comes from the ability to compute very large matrices in seconds and seamlessly integrate them into an R workflow. A post from 2011 shows the code for doing a principal components analysis on 50 years of stock data with over 9 million observations and 2,800 stocks. The function rxCor() took only 0.57 seconds on my 8GB, 2-core laptop to compute the correlation matrix .
Of course these there categories are not mutually exclusive. In this short video Bryan Lewis shows how the winners of the NetFlix prize used R to go deep, to go big and to go sparse.
The Python parts are probably using NumPy, Python's matrix/array extension, not just base Python.
Well, they wouldn't have much of a book without it!Posted by: Mildred Bonk | August 29, 2013 at 08:57
Don't forget about pbdR for going big. We've run benchmarks on terabytes of data with up to 50,000 cores.
Posted by: Drew Schmidt | August 29, 2013 at 11:10
I've almost completed the Coding the Matrix class and my understanding is that Python was used for instructional purposes, not because it is best tool for the job. Matrix and Vector classes are built from scratch. Having previously only studied theoretical linear algebra I've learned a lot getting my hands "dirty".
With respect to R and linear algebra - my main issue is having to use the "drop=FALSE" line to prevent column and row matrices becoming vectors, for example, when you're filtering a data set and only one row is output. This was discussed by Radford Neal ("Design Flaws in R #2 — Dropped Dimensions") on his blog some time ago:-
http://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/
Posted by: Ben W | August 29, 2013 at 11:52
Nothing that you said about R could not also be said about Python. Apart perhaps from point 2.
Python is a much cleaner and purer environment than R though which is why I'd choose it over R for linalg.
Posted by: David Heffernan | August 29, 2013 at 12:16
Great synopsis on R for linear algebra. But your comments about Python make it clear that you don't know much about how the language has increasingly been used in scientific computing for more than 10 years. The motivation came in many ways from frustrations with MATLAB, so you'll see some naming conventions and entire packages (e.g. matplotlib) influenced by MATLAB's conventions. If you read up on that history a bit (e.g. Wikipedia) then the fact that many CS and Math departments teach Python won't be as shocking.
Posted by: Josh Hemann | August 29, 2013 at 20:55
There is a lot of momentum currently for usage of python in science. The combination of numpy, scipy, pandas modules are key to that. Companies like Enthought and Continuum Analytics are also helping with that. There are also several resources for a newbie. The tutorials and several talks from a python conference (pycon,scipy) are available as videos immediately after the conference (it will be really nice if materials were available like this from useR conferences).
The question still remains whether R can do lot of the same things and your post is timely in that regard. Thanks for posting.
Posted by: Rdabbler | August 30, 2013 at 02:50
I actually took the Coursera course and I think Python was a much better alternative to R in this case.
1.) Most students came in with little to no programming experience.
2.) R's learning curve is just too steep, whereas Python is human-readable (I love R, but python is more user-friendly).
3.) No other packages / modules were used, we built a Matrix and Vector class and implemented all of the features of matrices with overloaded operators.
The purpose of the class was to teach linear algebra through building each basic step. Using a pre-built Matrix package would defeat the purpose and reduce the course to just learning terms rather than forcing yourself to understand the linear algebra well enough to be able to write it in code.
Explaining the algorithms of linear algebra through code was a lot more fun than just knowing how to run a certain module.
So, yes, R has a lot more going for it than Python for Linear Algebra. But knowing how to use a tool wasn't the purpose. You were supposed to really grok linear algebra.
Posted by: Will | August 30, 2013 at 05:05
I can second what Will said. I also took the course and we didn't use numpy or any other package, but had to code our own Matrix and Vector classes as a learning experience. Now that I am done with the class I would gladly choose to use numpy over my code since numpy is much more optimized and tested. I would be even more likely to use R since it has great support for LA applications. I have used R for a long time now, but the pandas package does look very interesting. Python seems like a really great language.
Prior to this course I look a machine learning course on coursera that used Octave (or Matlab). Before I used R I liked Matlab better than S-Plus, but after using R for many years and then going back to Octave, I was so happy that I choose to learn R!
Posted by: Roger | August 30, 2013 at 05:27
For books, have a look at Gentle's Matrix Algebra. I have his Computational Stats book, and it is well written for a daunting subject. I expect he's done similar for LA. The Amazon reviews (not many, but it's not Miley Cyrus topic) are all very positive. According to the ToC, and the snippet of the Preface one can sample, he uses most of the stat packs, including R.
Posted by: Robert Young | August 30, 2013 at 06:19