DWH, BI & BigData

New Trend

Data warehouse (DWH) is an industry proven concept of data preparation for decision support related tasks. Business Intelligence (BI) encompasses usage of such methods as OLAP and vendor specific tools for conducting data analyses that help with decision-making processes performing and improving information resources. Data in a warehouse is aggregated on a subjects focus, relies on relational databases built-in integrity and consistency features and usually makes evolution of some fact transparent. Because operational data is stored for relative short period of time and is usually under heavy transaction load (OLTP), it profits especially by transactional features of RDBMS. A data warehouse contains incrementally growing data extracted from an online database. This extraction is facilitated by using ETL tools (such as Pentaho DI) and relies upon SQL based techniques. Data in DWH is structured in data marts, which include a set of information pieces relevant to a specific business area.

Data in DWH is the information source for diverse BI techniques and tools. BI uses empiric inspect and adapt process control for business consumption of data for large scale time intervals. One of the tools used by BI is reporting, where selected from DWH data can be prepared properly for each business aspect and visualized in an end user friendly way. Such Open Source reporting tools as Pentaho Reporting or BIRT can be proposed for generating of sophisticated BI reports. Generated reports can be exported in multiple formats, e.g. in MS Office PowerPoint format. One should consider, however, that complex charts would be exported from such reporting tools not in native format, but as pictures, i.e. as read-only snapshots.

Main problem BI had for years was handling of non-structured and semi-structured data. Such data would be stored in BLOB fields, diminishing possibility of their content querying or demanding some kind of pe-processing. Additionally, growth of data volumes proceeds in geometric progression. In one study titled The Digital Universe Decade Are you ready?, IDC, May 2010 IDC on behalf of EMC, claims that the total size of digital data created and replicated will grow to 35 zettabytes by 2020.

Traditional relational data management systems are built around up-front schema definition and relational references and are not optimal for storing of semi-structured data with possible usage for analytical purposes. NoSQL databases can be devided in column-oriented data stores, key/value pair databases, and document databases.

Amazon Dynamo's derivate Apache Cassandra is a key/value store. It puts some high-availability ideas to the forefront. The most important of the ideas is that of eventual consistency. Eventual consistency implies that there could be small intervals of inconsistency between replicated nodes as data gets updated among peer-to-peer nodes. Because of its masterless architecture, Cassandra scale easily with the addition of nodes.
Document database MongoDB provides strong consistency, a single master (per shard), a richer data model, and secondary indexes. The last two of these attributes go hand in hand; if a system allows for modeling multiple domains, as, for example, would be required to build a complete web application, then it's going to be necessary to query across the entire data model, thus requiring secondary indexes.

Technology of Big Data is associated oft with Apaches OpenSouce project Hadoop. This project is a clone of Google's MapReduce framework, published first time on 2004. Together with Google File System (GFS), they have been used to scale increasing data volumes processing needs. Hadoop laid the groundwork for the rapid growth of NoSQL and, although is best known for MapReduce and its distributed filesystem (HDFS), is also used for a family of related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing.

Some of the Hadoop sub-projects:

  • Hive - A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.
  • HBase - A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
  • Avro - A serialization system for efficient, cross-language RPC, and persistent data storage.
  • Pig - A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.

MapReduce derives its ideas and inspiration from concepts in the world of functional programming. Map and reduce are commonly used functions in functional programming. MapReduce algorithm breaks up both the query and the data set into constituent parts - that's the mapping. The mapped components of the query can be processed simultaneously - or reduced - to rapidly return results.