9 big data pain points

Do enough Hadoop and NoSQL deployments, and the same problems crop up again and again

Sometimes, there's a big hole in the side of the ship, and the industry decides to wait until the ship starts sinking in hope of selling lifeboats.

At other times, less severe flaws resemble the door in my downstairs bathroom, which opens only if you turn the handle one direction, not the other. I’ll fix it one day, although I've said that for 12 years or so.

I can count nine issues confronting the big data business that fall at either extreme ... or somewhere in between.

Big data pain point No. 1: General-use GPU programming

CPUs are still kind of expensive, or at least compared to GPUs. If better standards and fewer obscure drivers were developed for GPUs, a whole marketplace would open up. For now, the fact that GPUs cost a lot less is outweighed by the fact it is much harder to program them and virtually impossible to do so without tying yourself to a very specific model.

This is the kind of situation where someone does the hard work of writing something that looks like ODBC or JDBC and convinces AMD or Nvidia that the market is bigger than graphics cards alone. Suppose you had a general binding for Spark that you didn’t have to think real hard about; suddenly, people would start building “GPGPU” clusters with reckless abandon.

People are already working on this. But to get the marketplace going, you need at least two ruthless competitors -- AMD and Nvidia plus maybe Intel -- to cooperate, one of whom thinks secrecy is the path to competitive success. Gosh, I want one!

Big data pain point No. 2: Multiple workload scaling

You have Docker. You have Yarn. You have Spark, Tez, MapReduce, and whatever comes next. You also have different pools with different priorities and things that come up. You can “autoscale” on a PaaS if you’re deploying, say, a Java war file, but if you were hoping to do this with Hadoop that's still special.

Plus, what about the interaction between storage and processing? Sometimes you need to temporarily expand and distribute storage. I should be able to run my “end of month” batch and have Docker images autodeploy all over the place. Then when I stop doing so much of that, and the system should undeploy them, then deploy whatever else needs the resources. The application or workload should put no effort whatsoever into this.

Related:
Shop Tech Products at Amazon