Your big data analytics and business intelligence project is humming along nicely for now, but accessing the data through simple extensions of existing architecture may not be the best approach for the long term.
Consider the following properties of "big data":
- On the Web (and not part of a previous in-house data feed),
- On multiple clouds,
- Involving social media such as Facebook, Twitter, and Linkedin,
- Unprecedented in size,
- Sometimes not yet "clean" or even available,
- Mostly unstructured, and
- Involving at least a few new tools, such as Apache's Hadoop and Hive, and graph/triple stores.
Separately, each of these might constitute a variant of an existing data warehouse setup. Combined, they represent such a different operating environment that, for planning purposes, you should go back to treating them separately. In other words, you should first ask what architecture works best for big data analytics, given likely future needs, then ask how you can combine that with an existing – probably data warehouse-oriented – data center architecture.
So what are those likely future needs? Initial indications are that each organization will want a unique mix of the following:
- Targeted access to medium-sized customer social media data – real-time data not needed, for customer loyalty and marketing purposes;
- Ditto for prospects – for sales purposes, and real-time data would be valuable;
- Real-time access to a small-sized amount of social media data on any Web user, for security purposes, if access to company data is sought;
- Real-time access to "strategic threat" data such as bad publicity or disasters affecting the company, usually small-scale but sometimes very large;
- Access to humongous amounts of big data for market analytics – real-time data not needed;
- Access to large to very large amounts of social media data for industry-specific or organization-specific new product research and development. Again, real-time data not needed, but the faster the better.
This mix determines the amount of data typically sought and the required delivery speed as well as any tradeoffs between "data cleanliness" and "data timeliness."
Now let’s consider an optimal architecture for each of these cases:
- For targeted customer-data access, you may need querying tools on each cloud, feeding a common in-house data store where the data can be analyzed without exposing it to competitors.
- For targeted prospect and sales-process data, you may need to add local databases in each cloud for rapid delivery of information about a particular target.
- For security scans, you may need alerting and single-user-targeting querying software sitting next to Hadoop and feeding directly to in-house administrators.
- For "strategic threat" data, you may need local high-performance, high-capacity databases in each cloud that coordinate with each other via federation across clouds to pre-analyze and, if possible, automatically react to these threats before they reach the data center or the rest of the organization, which should be fed directly.
- For market analytics, you may need cloud-local “cache” high-performance databases to pre-filter data so that it can be reduced to data-warehouse size, and, if possible, to pre-clean the data, as existing extract, transform, load (ETL) tools may not yet be adapted to this new kind of data.
- For R&D or development purposes, you may need in-house but separate analytics databases, with data federation to allow querying across clouds.
Suppose you need all six? Then you might consider:
- Data federation and cross-database querying software such as products by Composite Software and Denodo.
- High-performance, high-capacity database technologies such as in-memory and columnar, from solutions such as EMC Greenplum or Sybase IQ.
- Low-cost, flexible, cloud-adapted querying/analytics tools like Birst or Tableau.
- An in-house grid architecture for research and development (R&D).
Now, how do you combine this with your existing architecture? Approaches fall into two camps, generally depending on the size of the organization:
- Small to medium-sized businesses (SMBs) often have no – or a vestigial – data warehouse in place. In that case, platform as a Service (PaaS) architectures that run as much as possible on the cloud are a good way-station as the necessary data warehousing capability begins to arrive.
- Large businesses with mainframes, small-server farms, data warehouses, data marts, and other existing infrastructure in the architecture should indeed create a PaaS architecture, preferably using solutions offered by existing vendors such as IBM that provide a way to combine PaaS on public clouds and existing business intelligence (BI)/analytics/data-warehousing architectures. However, you should anticipate that for some needs, these solutions do not yet suffice. In those cases, solutions such as the above should provide well-fitting extensions to vendor capabilities.
Above all, do not assume that simply inhaling big data in massive amounts from one cloud directly into the data warehouse is an optimal long-term solution. While you are chasing that hope, your competitors will be using their IT to nimbly perform targeted, deeper analytics on their customers of one and drive their brands deeper into your markets. Setting up firewalls between your in-house and cloud analytics capabilities is one thing; not having any public-cloud-resident analytics is quite another. To summarize:
- Accept that some of your analytics should take place outside corporate boundaries;
- Accept that big — and "unclean" — data needs separate handling;
- Accept that big data and traditional data, for best results, should have separate but coordinated architectures.
ABOUT THE AUTHOR: Wayne Kernochan is president of Infostructure Associates, an affiliate of Valley View Ventures. Infostructure Associates aims to provide thought leadership and sound advice to vendors and users of information technology. This document is the result of Infostructure Associates-sponsored research. Infostructure Associates believes that its findings are objective and represent the best analysis available at the time of publication.
This was first published in July 2012