*by Andrew Ekstrom**Recovering physicist, applied mathematician and graduate student in applied Stats and systems engineering*

We know that R is a great system for performing statistical analysis. The price is quite nice too ;-) . As a graduate student, I need a cheap replacement for Matlab and/or Maple. Well, R can do that too. I’m running a large program that benefits from parallel processing. RRO 8.0.2 with the MKL works exceedingly well.

For a project I am working on, I need to generate a really large matrix (10,000x10,000) and raise it to really high powers (like 10^17). This is part of my effort to model chemical kinetics reactions, specifically polymers. I’m using a Markov Matrix of 5,000x5,000 and now 10,000x10,000 to simulate polymer chain growth at femptosecond timescales.

At the beginning of this winter semester, I used Maple 18 originally. I was running my program on a Windows 7 Pro computer using an intel I7 – 3700K (3.5GHz) quad core processor with 32GB of DDR3 ram. My full program took, well, WWWWWAAAAAAAYYYYYYYY TTTTTTTTOOOOOOOO LLLLLOOOONNNNGGGGGG!!!!!!!!

After a week, my computer would still be running. I also noticed that my computer would use 12% -13% of the processor power. With that in mind, I went to the local computer parts superstore and consulted with the sales staff. I ended up getting a “Gamer” rig when I purchased a new AMD FX9590 processor (4.7GHz on 8 cores) and dropped it into a new mobo. This new computer ran the same Maple program with slightly better results. It took 4-5 days to complete... assuming no one else used the computer and turned it off.

After searching for a better method (meaning better software) for running my program, I decided to try R. After looking around for a few hours, I was able to rewrite my program using R. YEAH! Using the basic R (version 3.1.2), my new program only took a few days (2-3). A nice feature of R is an improved BLAS and LAPACK and their implementation in R over Maple 18. Even though R 3.1.2 is faster than Maple 18, R only used 12%-13% of my processor.

Why do I keep bringing up the 12%-13% CPU usage? Well, it means that on my 8 core processor, only 1 core is doing all the work. (1/8 = 0.125) Imagine you go out and buy a new car. This car has a big V8 engine but, only 1 cylinder runs at a time. Even though you have 7 other cylinders in the car, they are NOT used. If that was your car, you would be furious. For a computer program, this is standard protocol. A cure for this type of silliness is to use parallel programming.

Unfortunately, I AM NOT A PROGRAMMER! I make things happen with a minimal amount of typing. I’m very likely to use “default settings” because I’m likely to mistype something and spend an hour trying to figure out, “Is that a colon or a semi colon?” So when I looked around at other websites discussing how to compile and/or install different blas and lapack for R, I started thinking, “I wish I was taking QED right now. (QED = Quantum Electro-Dynamics)” I also use Windows, most of the websites I saw discussed doing this in Linux.

That led me to Revolution Analytics RRO. I installed RRO version 8.0.2 and the MKL available from here: http://mran.revolutionanalytics.com/download/#download

RRO uses Intel’s Math Kernel Library, which is updated and upgraded to run certain types of calculations in parallel. Yes, parallel processing in Windows, which is step one of HPC (High Performance Computing) and something many of my comp sci friends and faculty said was difficult to do.

A big part of my project is raising a matrix to a power. This is a highly parallelizable process. By that I mean, calculating element A(n,n) in the new matrix does not depend upon the value of A(x,x) in the new matrix. They only care about what is in the old matrix. Using the old style (series) computing, you calculate A(1,1), then A(1,2), A(1,3) … A(n,n). With parallel programming, on my 8 core AMD processor, I can calculate A(1,1), A(1,2), A(1,3) … A(1,8) at the same time. If these calculations were “perfectly parallel” I would get my results 8 times faster. For those of us that have read other blog posts on RevolutionAnalytics.com, you know that the speed boost for parallel programming is great, but not perfect. (Almost like it follows the laws of thermodynamics.) By using RRO, I was able to run my program in R and get results for all of my calculations in 6-8 hours. That got me thinking.

If parallel processing on 8 cores instead of series processing on 1 core is a major step up, can I boost the parallel processing possibility? Yes. GPU processors like the Tesla and FirePro are nice and all but:

1) Using them with R requires programming and using Linux. Two things I don’t have time to do.

2) Entry level Tesla and Good Firepro GPUs cost a lot of money. Something I don’t have a lot of right now.

The other option is using an Intel Phi coprocessor, or two. Fortunately, when I started looking, I could pick up a Phi coprocessor for cheap. Like $155 cheap for a brand new coprocessor from an authorized retailer. The video card in my computer cost more than my 2 Phi’s. The big issue, is getting a motherboard that has the ability to handle the Phi’s. Phi coprocessors have 6+GB of ram. Most mobo’s can’t handle more than 4GB of ram through a PCI-E 3.0 slot. So, I bought a second mobo as a “hobby” project computer. This new mobo is intended for “workstations” and has 4 PCI-E 3.0 slots. That gives me enough room for a good video card and 2 Phi’s. This new Workstation PC has an Intel Xeon E5-2620V3 (2.4GHz 6-core, 12-Thread) processor, 2 Intel Xeon Phi coprocessors 31S1P (57 cores with 4 threads per core at 1.1GHz per thread for a total of 456threads) and 48Gb DDR4 Ram.

The Intel Phi coprocessors work well with the Intel MKL. The same MKL RRO uses. Which means, if I use RRO with my Phi’s, after they are properly set up, I should be good to go….. Intel doesn’t make this easy. (I cobbled together the information from 6-7 different sources. Each source had a small piece of the puzzle.) The Phi’s are definitely not “Plug and Play”. I used MPSS version 3.4 for Windows 7. I downloaded the drivers from here:

https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#wn34rel

I had to go into the command prompt and follow some of the directions available here. (Helpful hint, use micinfo to check your Phi coprocessors after step 9 in section 2.2.3 “Updating the Flash”.)

http://registrationcenter.intel.com/irc_nas/6252/readme-windows.pdf

After many emails to Revolution Analytics staff, I was able to get the Phi’s up and running! Now, my Phi’s work harmoniously with MKL. Most of the information I needed is available here. https://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf

In the paper and website above, I needed to create some environmental variables. The generic ones are:

MKL_MIC_ENABLE=1

OFFLOAD_DEVICES=*<list>*

MKL_MIC_MAX_MEMORY=2GB

MIC_ENV_PREFIX=MIC

MIC_OMP_NUM_THREADS=###

MIC_KMP_AFFINITY=balanced

Since I have 2 Phi coprocessors, my <list> is 0, 1.(At least this is the list that worked.) I set MKL_MIC_MAX_MEMORY to 8GB. ( I have the ram to do it, so why not.) MIC_OMP_NUM_THREADS = 456.

Below, is a sample program I used to benchmark Maple 2015, R and RRO on my Gamer computer and my Workstation. Between the time I started this project and now, Maple up graded their program to Maple 2015. The big breakthrough is that Maple now does parallel processing. So, I ran the program below using Maple 2015 to see how it compares to R and RRO. (I uninstalled Maple 18 in anger.) I also ran the same program on my Workstation PC to see how well the Phi coprocessors worked. Once I had everything enabled, I didn’t want to disable anything. So, I just have the one, VERY IMPRESSIVE, time for my workstation.

require("expm")

options(digits=22)

a=10000

b=0.000000001

c=matrix(0,a,a)

for ( i in 1:a){c[i,i] = 1-1.75*b}

for ( i in 1:a){c[i-1,i] = b}

for ( i in 2:a){c[i,i-1] = 0.75*b}

c[1,1]=1-b

c[a,a]=1

c[a,a-1]=0

system.time(e=c%^%100)

By using RRO instead of R, I got my results 3.12 hours faster. Considering the fact that I have several dozen more calcs like this one, saving 3hrs per calc is wonderful ;-) By using RRO instead of Maple 2015, I saved about 41 mins. By using RRO with the Phi’s on my Workstation PC, I was done in 187.3s. I saved an additional 39 mins over my Gamer Computer! When I ran my full program, it took under an hour. Compared to the days/weeks for my smaller calculations, an hour is awesome!

An interesting note on the InteL MKL. It only uses cores, not threads, on the main processor. I’m not sure how it handles the threads on the Phi coprocessors. So, my Intel Xeon processor only had 50% usage of the main processor.

Now, your big question is, “Why should I care?” I ran a 10,000x10,000 matrix and raised it to unbelievably high values. I used a brute force method to do it. Suppose that you are doing “Big Data” analysis and you have 30 columns by 2,000,000 rows. If you run a linear regression on that data, your software will use a Pseudoinverse to calculate the coefficients of your regression. A part of the pseudoinverse involves multiplying your 30x2,000,000 matrix by a 2,000,000x30 matrix and it’s all parallelizable! Squaring my matrix uses about 1.00x10^{12} operations (assuming I have my Big O calculation correct.) The pseudo inverse of your matrix uses a mere 1.80x10^{9} operations.

Some of my friends who do these sort of “Big Data” calculations using the series method built into basic R or SAS tell me that they take hours(1-2) to complete. With my workstation, I have the computational power of 17 servers that use my same Xeon processor. That calculation would take me way less than a minute.

Behold, the power of parallel processing!

Those numbers don't sound right at all. Numpy finds the product of two 5k*5k matrices in about 6.5 seconds on my macbook air, raising to 10**17 should take about a hundred multiplies (depending on what exactly the power is), which is about ten minutes.

In your other example, raising a 10k*10k matrix to the 100th power should take 10 multiplies, each lasting about 61 seconds, or about ten minutes.

Also if your matrices are symmetric, maybe try doing a partial SVD, and then you can just write down the SVD of the power matrix, and multiply it back out.

Posted by: Jake | May 19, 2015 at 09:26

Also I just noticed they were markov matrices, so you're going to have a very friendly eigenspectrum.

Posted by: Jake | May 19, 2015 at 09:30

post links from where you bought the motherboard and the phi coprocessors for so cheap

Posted by: myschizobuddy | May 19, 2015 at 14:23

I hope you realize that you can compute the 17th power of A with five matrix multiplies (compute A^2, then A^4, then A^8, then A^16, and finally A^17).

You might also be interested in pqR (see pqR-project.org), which can automatically parallelize various R operations. It probably wouldn't help for this calculation, however, which will be dominated by how well your BLAS does matrix multiplies.

By the way, if your cores can run two threads (via hyperthreading), there may appear to be twice as many processors, and it may appear that a program not using these extra processors is using only 50% of the CPU, but it's probably really using something like 85% of the CPU, since two threads running on the same core are far from being the equivalent of two threads on separate cores.

Posted by: Radford Neal | May 19, 2015 at 16:23

Looks like you already invested a lot of time for the configuration. I really think you should try linux (ubuntu can be quite good for win users), using just a little part of time you've invested already you would have a good base to start HPC without any M$ artifacts.

Posted by: jangorecki | May 19, 2015 at 16:50

The matrix shown is a projection matrices and has the property that c raised to any power is c so if that is the form of your actual matrix you can avoid the multiplication entirely.

Posted by: G. Grothendieck | May 20, 2015 at 07:43

I have been looking into the Phi cards are well. Our simulation software is in FORTRAN. I wonder what the performance would be with FORTRAN compiled with the Intel compiler specifically made for use with the Phi cores? This can be found as a bundle packages 'starting for under $5k' https://software.intel.com/en-us/xeon-phi-starter-kit

Also has anybody re-compiled the R core using the intel starting kit - specifically optimizing for the Phi cores? I assume that is what Revolution has done - to gain access to the Phi cores and intel math libraries?

Posted by: tomw | May 20, 2015 at 10:11

Hi Everyone,

@Jake

The numbers are correct. I ran them multiple times. I lost a month+ earlier in the winter semester because my original software took so long.

I tried a couple different methods earlier from my numerical analysis class. When I talked to my prof about the issues I had, it came down to float point errors. My eigenvalues values all ended up being 1.00.

I got the phi from Sabrepc.com. They have 3 phis at a low price.

@buddy

On my Gamer comouter, I get 100% utilization. One the workstation and my laptop,(both intel processors) I get 50% utilization. I run simutations from BOINC, which use 100% of my processors on all my computers. The core temps on the xeon processor and cooling fan speeds seem to correlate to 50% utilization too. One of my fans is really loud at full throttle, when I run BOINC For several hours. When I run my larger program, the fans are not that loud.

@Radford

I did download Ubantu. The phi drivers work on red hat, not ubantu. I spent a few days trying to get the phi working on ubantu. I gave up. Since I wasted so much time earlier in the semester, I didn't have a week or two more to figure out ubantu. As is, I turned in my paper 2 days before the end of the semester.

@Jang

I'm not sure what you are getting at. Matrix c is a sparse matrix where c (n, n) is approximately 0.999999998. I ran multiple iterations of the program with a=500, 1000, 2500, 5000, 10000. To cut down on programming mistakes, I used a=# to declare the size of the matrix.

@Groth

From my (poor) understanding, if you can call upon the intel mkl and set the environment variables, the automatic offload does the rest. You can optimize the programming. I have a copy of Intel's parallel suite, which I got free for a year. (Sometimes being a student isn't so bad.) I'm not sure if I even need it. I haven't done anything with it yet, that I know of.

Thanks for reading and commenting.

Posted by: Drew | May 20, 2015 at 11:06

Dear Andrew:

For your own good, learn about Sparse Arrays...

For example, using Mathematica:

a = 10000; b = 10^-9;

cc = SparseArray[{{1, 1} -> 1 - b, {a, a} -> 1, {a, a - 1} -> 1,

{i_, i_} -> 1 - 1.75*b, {i_, j_} /; i - 1 == j ->

b, {i_, j_} /; i == j - 1 -> .75}, {a, a}, 0]

It will take LESS THAN A SECOND to calculate this.

Timing[dd = MatrixPower[cc, 100]]

This are the values of the first row:

1., 75.0002, 2784.38, 68217.3, ...

Posted by: Marcos F | May 23, 2015 at 07:40

Greaat post. What moherboard did you get for the Xeon Phi coprocessors?

Posted by: diego | May 24, 2015 at 18:51