By Dr. Hossein Eslambolchi
big data is a reality for every IT group whose job is to provide the business with information about its customers, prospects and markets in a manner that is quick, easy and efficient.Balancing this with capital and operational cost constraints is the hardest part. The good news is that we have two factors intersecting: a never-before-seen wealth of data from transactions and interactions, plus a new level of technology innovation to drive down costs.
Facebook, Twitter, video and increased messaging give us a strong foundation to quickly derive results and fine-tune market predictions. Enterprises are automated to the point where machines and sensors generate terabyte volumes each day, which must be collected, stored and analyzed.
Data management vendors have undergone upheaval, as evidenced by the adoption of open-source Apache Hadoop. Now, the promise of a Hadoop-based infrastructure is emerging to correlate volumes of structured and unstructured data, scale inexpensively and gain quick market insights. Will Hadoop and its ecosystem provide new enterprise capabilities in terms of resilience, security and ease of use?
Key technology considerations for today’s CIO looking to capitalize on big (and diverse) data include:
• Coexistence with other database and data management environments. These include standard relational environments (think Oracle) and analytical data warehouses (think Teradata). The caveat: Data movement and integration is necessary, but it increases capital expenditures on various extract, transform and load (ETL) tools. It can also increase operational costs.
• Storage and hardware. Innovative compression and data deduplication are critical to address big data head-on. Great strides have been made, and we are now seeing multiple layers of compression yielding up to a 40-fold reduction in capacity when compared to raw data. However, it’s important to consider how much of this compressed data will eventually require reinflation, and how this will affect your capacity. For example, if you are going to experience a 30 percent increase in demand for capacity upon reinflation, it may not be worth doing the compression in the first place.
• Query and analytics. Not all data is equal, and the range of queries and business analytics varies widely, depending on the use-case. Having the right tools for the job is a must. In many cases, a rapid-response SQL query will be sufficient to yield the needed information. In other cases, a deep analytics query requires a business intelligence tool with full dashboard and visualization capabilities. Deploying the right mix of proprietary technologies alongside open-source Hadoop will help your organization realize the promise of fast analytics at scale, while keeping operational costs from spiraling.
• Scale and Manageability. As organizations struggle with disparate database and analytics environments, the ability to scale up and out is important. Easy scale-out is why Hadoop has been quickly adopted by the enterprise. Massive parallel processing across low-cost commodity server clusters is key and requires less specialized skill sets than other data management options, which directly affects your IT resource investment.