As longtime readers of this blog will know, I love optical illusions, and the checkerboard shadow illusion is one of my all-time favourites. If your not familiar with it, here's a rendering of the illusion done in Maya I found online; note that squares C3 (a white square) and B5 (a grey square) look as different as you'd expect in the top frame, but when you add a shadow-casting cylinder to the scene the two squares are almost exactly the same shade of grey onscreen, despite what your eyes are telling you.
While the illusion is real (and demonstrates effictively that you can't always believe what your eyes tell you about colour), there's something fishy going on in this real-world recreation of the scene:
The whole point of the illusion is that the middle tile is actually white, and appears white to our brains, but is dark grey on-screen. Yet the woman in the video drags the white tile to a dark tile, where it should definitely appear as a different color in the better-lit area. I guess they're maintaining the RGB color of the dragged tile in CGI, but I don't think this really helps to explain the illusion.
Anyway, that's all for this week (but you can check out older Friday posts here). We're back on Monday.
(Experienced R users can always drill down to see the code behind the analysis, as you can see in this video.) Not only will this be a great way for non-programmers to access the wealth of capabilities in R (even with very large data sets, thanks to Revolution R Enterprise), it's also a great way for R programmers to make their custom data visualizations and models available to business analysts, by adding new icons to the Alteryx palette, or by publishing new workflows to the Alteryx Gallery.
The 2013 Mining Big Data Camp was held last Saturday at Ebay’s Town Hall Conference Center in San Jose. The San Francisco chapter of the ACM has been sponsoring this data mining themed, “un-conference” event since 2009. Attendance, this year was lighter than I remembered in the past, however, the event continues to be a viable way to find out what’s hot in the Silicon Valley Data Mining scene. The buzz this year was about Deep Learning, Data Science and R.
I stumbled into the hall just in time to watch the un-conference take shape. Greg Makowski and his team of ACM volunteers do a superb job of managing chaos. An un-conference self-organizes: people propose sessions, a show of hands decides which will fly, people volunteer or are gently prodded into leading the sessions, and a quick count decides which sessions get the larger rooms. I found myself leading two sessions: “An Overview and Introduction to R”, and a discussion on “How to become a Data Scientist”. The attitude of the participants in the R session was strictly business: “How is R organized?”, “What is the best way learn R?’, “Show me some code.” The pragmatism and enthusiasm reflected exactly what the polls indicate: R skills have become essential to Data Mining and Data Science.
In addition to the “Data Scientist” session in which I participated there was another parallel session led by eBay hiring managers on getting hired as a Data Scientist. I think the tremendous interest in this topic at the un-conference and elsewhere reflects how much momentum has been built up towards establishing “Data Scientist” as a distinct job position, and also indicates how useful the title has become as a label for a fairly extensive set of interdisciplinary skills. My take is that a Data Scientist needs to be proficient in four areas:
Statistical Inference: an understanding of sampling and experimental design at minimum
Sufficient programming skills to acquire and manipulate large data sets and implement machine learning algorithms
IT skills: some knowledge of Linux and big data architectures, how to connect to databases, clusters, clouds and hadoop
Business Skills: How to take an insufficiently articulated business problem and shape it into a series of relevant technical questions.
These are not all that different from Drew Conway’s original Venn diagram, but they include the ability to ask the right questions that Hilary Mason always so eloquently emphasizes.
While R and Data Science are in the realm of the here and now, the buzz around Deep Learning is that it might be the next really big thing. “Deep Learning” refers to using multi-layer neural nets, including Restricted Boltzmann Machines, to solve difficult tasks in machine vision, audio processing and difficult Natural Language Processing. Apparently, the basic ideas have been around for quite some time but recent advances in training these multilayer networks have made them practical for certain classes of problems. Python seems to be the language of choice for working in this area: for example NuPIC (the Numenta Platform for Intelligent Computing, which recently became an open-source project) is a mix of Python and C++ . The two very knowledgable Ebay engineers who lead the un-conference session worked through and example based on code that I think relied on the Pylearn2 library.
For me, the ACM un-conference brought some clarity to the complementary roles R and Python play in Data Mining, and provided concrete examples that illustrate why KDnuggets advises would-be Data Scientists to learn both languages (and SQL).
If you are interested in learning more about Deep Learning and its role in reviving the dreams for Artificial Intelligence have a look at the two Google Tech Talks by Geoff Hinton and Andrew Ng.
Boris Chen, a data scientist for the New York Times, has been running since August a weekly blog with statistical analysis of NFL players, as fodder for Fantasy Football players around the country. Here's how he describes what he does:
My model pulls aggregated expert rankings from fantasypros, and I pass that data into a machine learning clustering algorithm called a gaussian mixture model to find tiers of players each week. Then I plot them in two dimensional space and the result is charts that let you easily decide your line up each week.
A quick heads-up that I'll be participating in an on-line webinar and panel discussion on the "small data" side of data science, and what Big Data practitioners can learn from statistical reasoning and expertise. Gregory Piatetsky (KDNuggets editor) will join me on the discussion. It starts at 8AM Pacific Time, and the bulk of the time will be devoted to your questions so it should be a fun interactive session.
To register for this online event hosted by Kalido, follow the link below.
Rexer Analytics has been conducting regular polls of data miners and analytics professionals on their software choices since 2007, and the results of the 2013 Rexer Analytics Data Miner Survey were presented at last month's Predictive Analytics World conference in Boston. Here are some highlights of the responses from the 1,039 non-vendor participants. (You can request a free copy of the full survey results from Rexer Analytics.)
R is the most popular data mining tool, used at least occasionally by 70% of those polled. This popularity holds amongst all of the subgroups in the survey as well: R remains the most-used tool amongst corporate data miners (70%), consulting data miners (73%), academic data miners (75%) and nonprofit/NGO/government data miners (67%). And while the average data miner reports using five software tools, R is also the most popular primary tool in the survey, at 24% overall.
The popularity of R is skyrocketing, and its prevalence has increased in every single Rexer poll conducted since 2007, both in overall use and as the primary tool.
Most users are satisfied with R, with more than 85% of users self-identifying as "satisfied" or "extremely satisfied". The software platforms with the most dissatisfied users are SAS and SAS Enterprise Miner.
These results are in line with other recent polls about R usage. The annual KDDNuggets poll of top languages for analytics, data mining and data science, released in September, named R as the most popular software for the third year running.
You already know that R is an amazingly powerful language for data analysis, but what if you're not a programmer? Or, what if you want to make the data manipulations, visualizations or statistical models you've developed in R available to business analysts, marketers, managers or other non-programming types?
That's why Revolution Analytics has teamed up with Alteryx to bring the power of R to Alteryx's easy-to-use drag-and-drop workflow interface. From your Windows desktop or laptop, you can connect your data to icons like "Contingency Table", "Histogram", "Decision Tree" and get results from R without a line of R programming. (If you are an R programmer, you can always drill down and see exactly the R code that's being run, and even modify it if you like.)
Alteryx has had integration with R for a while, but if you have Revolution R Enterprise installed, it will take advantage of the performance benefits of the Revolution R build. Better yet, if you're working with large data sets, you can send data to out-of-memory XDF files, and connect these large data files to R-based icons. In this case, the big-data functions of the Revolution R Enterprise ScaleR package will be used behind the scenes, to speed up computations and avoid out-of-memory errors.
If you missed last week's webinar R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers, the slides and replay are now available for download at that link. I've also embedded the replay video below, where you can hear Michele Chambers (Revolution Analytics) and Jairam Ranganathan (Cloudera) discuss the second generation of modern analytics, predictive analytics in the context of Big Data, and how Revolution R Enterprise and Cloudera's Distribution including Apache Hadoop work together.
You've probably heard about the political trainwreck we're going through here in the US. There are many causes for what's going on, but a lot of the blame goes to the extreme polarization in US politics today. As Nate Silver notes, the degree of polarization in the Congress is higher than at any point in the 80 years. According to Gallup, 60% of Americans, a third political party is needed to break the ongoing political roadblock, but there's never been an enduring third party in this country. The reason? The winner-takes-all voting system:
Personally, I'm a fan of the Australian system: proportional representation voting ensures minor parties get some seats in parliament (and often have significant influence as the deciding bloc in coalitions), and compulsory voting means that campaigns (and policies) are targeted at the centrist majority instead of aimed at driving the single-issue extremes to the polls. And given the current state of politics here in the US, I'm warming to the idea of an all-powerful monarch that can just reboot the entire system when there's no way out of the gridlock.
As always, thanks for the comments and please send any suggestions to me at david@revolutionanalytics.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.