Big data architecture is the foundation for big data analytics. Think of big data architecture as an architectural blueprint of a large campus or office building. Architects begin by understanding the goals and objectives of the building project, and the advantages and limitations of different approaches. It’s not an easy task, but it’s perfectly doable with the right planning and tools.
System architects go through a similar process to plan big data architecture. They meet with stakeholders to understand company objectives for its big data, and plan the computing framework with appropriate hardware and software, data sources and formats, analytics tools, data storage decisions, and results consumption.
If you’re in the market for big data tools, see our list of the top big data companies.
Not everyone does need to leverage big data architecture. Single computing tasks rarely top more than 100GB of data, which does not require a big data architecture. Unless you are analyzing terabytes and petabytes of data – and doing it consistently — look to a scalable server instead of a massively scale-out architecture like Hadoop. If you need analytics, then consider a scalable array that offers native analytics for stored data.
You probably do need big data architecture if any of the following applies to you:
With use cases like these, chances are that your organization will benefit from a big data architecture expressly built for these challenging tasks. Plan for an environment that will capture, store, transform, and communicate this valuable intelligence.
Big data architecture includes mechanisms for ingesting, protecting, processing, and transforming data into filesystems or database structures. Analytics tools and analyst queries run in the environment to mine intelligence from data, which outputs to a variety of different vehicles.
The architecture has multiple layers. Let’s start by discussing the Big Four logical layers that exist in any big data architecture.
In addition to the logical layers, four major processes operate cross-layer in the big data environment: data source connection, governance, systems management, and quality of service (QoS).
Big data architecture includes myriad different concerns into one all-encompassing plan to make the most of a company’s data mining efforts.
Let’s look at a big data architecture using Hadoop as a popular ecosystem. Hadoop is open source, and several vendors and large cloud providers offer Hadoop systems and support. There are also numerous open source and commercial products that expand Hadoop capabilities.
Hadoop architecture is cluster architecture. Hadoop runs on commodity servers, and recommends dual CPU servers with 4-8 cores each, and at least 48GB of RAM. (Using accelerated analytics technologies like Apache Spark will speed up the environment even more.) Storage must also be highly scalable.
Another option is cloud Hadoop environments where the cloud provider does the infrastructure for you. The cloud might add latency, you’ll be in a shared environment, and you don’t want to be locked-in. But the cloud is an excellent choice for a new Hadoop installation, or when you know that you don’t want to grow your data center racks or IT staff to support on-premise Hadoop.
Loading data onto the clusters is an ongoing event. Hadoop supports both batched data such as loading in files or records at specific times of the day, and event-driven data such as loading transactional data as the transactions occur. Software tools for loading source data include Apache Sqoop for batch loading and Apache Flume for event-driven data loading.
Your big data environment will also stage the incoming data for processing, including converting data as needed and sending it to the correct storage in the right format. Additional activities include partitioning data and assigning access controls.
Once the system has ingested, identified, and stored the data it will automatically process it. This is a 2-step process of transforming the data and analyzing it. Transforming the data simply means processing it into analytics-ready formats and/or compressing it.
In Hadoop, this is MapReduce territory. MapReduce is the core component of Hadoop that filters (maps) data among nodes, and aggregates (reduces) data returned in response to a query. MapReduce achieves high performance thanks to parallel operations across massive clusters, and fault-tolerance reassigns data from a failing node. MapReduce works on both structured and unstructured data.
Many analysts and vendors run MR with additional filters, like adding collaborative filtering to MR to identify user preferences in Twitter data. Other analytics products replace it, such as Google’s proprietary Cloud Dataflow.
One of Hadoop’s shining features is that once data is processed and placed, different analytics tools can operate on the unchanging data set. There is no need to re-process it for different tools, or to copy it to different locations. The same copy of data serves for all queries.
Output covers a variety of destinations, including reports and dashboard visualization for users or next step triggers in business processes.
Micro- and macro-pipelines enable discrete processing steps. Micro-pipelines operate at a step-based level to create sub-processes on granular data. In a typical scenario, one source of data is customer transactional data from the company’s primary data center. The data enters Hadoop so company analysts can investigate customer churn. However, compliance is an issue because the data includes customer credit card numbers. A micro-pipeline adds a granular processing step that cleans credit card numbers from the analyst team’s reports.
Macro-pipelines operate on a workflow level. They define 1) workflow control: what steps enable the workflow, and 2) action: what occurs at each stage to enable proper workflow.
Big data architecture takes ongoing attention and investment. Before you run screaming for the hills, remember that a well-executed big data architecture will do much of this for you behind the scenes. You can offload even more planning and management tasks if you’re working with consultants and service providers.
Despite complexity and cost, big data architecture lets you extract vital business information from your otherwise opaque data for higher profit and lower risk. Done well, these results are more than worth the price of admission.
SEE ALL
Datamation is the leading industry resource for B2B data professionals and technology buyers. Datamation's focus is on providing insight into the latest trends and innovation in AI, data security, big data, and more, along with in-depth product recommendations and comparisons. More than 1.7M users gain insight and guidance from Datamation every year.
Advertise with TechnologyAdvice on Datamation and our other data and technology-focused platforms.
Advertise with Us
Property of TechnologyAdvice.
© 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this
site are from companies from which TechnologyAdvice receives
compensation. This compensation may impact how and where products
appear on this site including, for example, the order in which
they appear. TechnologyAdvice does not include all companies
or all types of products available in the marketplace.