Making the Most of Big Data

CIO Insight

By Dr. Hossein Eslambolchi
Date: December 2012

Big Data has been a much-discussed topic over the past few years and to be honest, it’s the first time for a long time in the tech industry where both business requirements and product innovation seem to be synchronized on how to gain real business value. Decision makers are excited at the opportunities Big Data presents which comes in the form of many more customer interactions within the marketplace and thanks to Facebook, Twitter, video and just increased messaging that is both targeted and personalized, there is a strong foundation to quickly derive fast results and better market predictions. Enterprises have become more automated where machines and sensors literally generate terabyte volumes each day, which must be collected, stored and analyzed. In the communications sector alone, daily WAP-logs for a Tier one provider are in the 10’s of billions, which amounts to several petabytes just after 3 months. What all of this boils down to is volumes of inbound multi-structured data which IT organizations must manage in the most efficient and cost effective way while continuing to give business users access to the right information at the right time.

Vendors in the data management market have undergone tremendous upheaval in recent times as witnessed by the adoption of open source Apache Hadoop which cut its teeth with teams of data scientists at Web 2.0 leaders including Yahoo, Google and LinkedIn. Now the promise of a Hadoop based infrastructure is seen as the optimal way to correlate volumes of structured and unstructured data, scale inexpensively and gain quick insights on what is happening in the market. In 2012, we will see Hadoop and all ecosystem players provide new capabilities striving to make it enterprise-grade with greater resilience, security and easier to use and deploy.

Key considerations for today’s CIO who must take advantage of innovative technologies in order to capitalize on Big and diverse data is:

 

  • Ability to co-exist alongside other database and data management environments including standard relational (think Oracle) and analytical data warehouses (think Teradata). The caveat here is that whilst you must be able to easily move data in and out of the various environments, you should strive for an architecture where you can achieve this but doing it regularly is not ideal. Data movement and integration is necessary but not ideal as it increases capex investment on various ETL-type tools and additionally operational costs.
  • Storage and hardware are critical components on any IT budget – in fact storage accounts for about 17% of the overall budget – but as data volumes grow and future capacity needs to be factored in, it can be extremely disruptive to review every 6-12 months. Market-leading compression and data de-duplication is absolutely critical because it addresses big data head-on. Great strides have been made in this area and we are now seeing multiple layers of compression yielding up to 40X when compared to raw data. Compression without re-inflation is a key consideration as a 30% hit on re-inflation may not be worth a small degree of compression.
  • Query and analytics – Not all data is equal and the range of queries and business analytics varies widely and of course depends on use-case. Having the right tool for the job is a must. In many cases a rapid response SQL query will give you what is needed and in other cases, a long running deep analytics query requires a BI tool with full dash-boarding and visualization. With newer Hadoop environments, there is a greater need for MapReduce skills or essentially Java programming capabilities, which may be considered a luxury in many IT groups. Deploying the right mix of proprietary technologies alongside open-source Hadoop is very important of you need to realize the promise of fast analytics at scale and at the same time keeping operational costs from spiraling.
  • Which brings us to Scale and Manageability – As organizations struggle with disparate database and analytics environments, the ability to scale up and out becomes more important. Easy scale-out is essentially why Hadoop has been so widely adopted by the enterprise. Massive parallel processing across low cost commodity server clusters is key and actually requires less specialized skill-sets, which directly impacts IT resource investment.

Big Data is a reality for every IT group whose job is to provide the business with information about it’s customers, prospects and markets in the fastest, easiest and most efficient way. Balancing all of this with both capital and operational cost constraints is the hardest part. However, the good news about Big Data is that we have never before seen such a wealth of data from transactions and interactions and never have we witnessed this level of technology innovation to really drive down costs. I personally look forward to watching how the next decade of data management will unfold. Have a great 2012.