Are your on-premise servers slowing you down? Is your data scattered across silos due to M&A growth? These challenges are common, and this primer will help you understand how modern data infrastructure relates to the AI wave.
Why modernize data infrastructure?
Henry Ford revolutionized manufacturing with an assembly line of parts. A modern company is an assembly line of information -- product info goes to sales, orders go to the customer team, fulfillment info goes to accounting, and so on. Data availability and accuracy enables production, while data unavailability and inaccuracy blocks production. When blockages occur, humans are incredibly resourceful – we hunt down the information or develop workarounds to eventually get production going again.
AI can handle many jobs in this production line, as we discussed in October’s jobs framework, with one key assumption: data availability and access. Unlike today’s production, which depends on humans to bridge gaps, AI systems will rely entirely on data. Without modern infrastructure, AI’s potential to automate processes will remain unrealized. Consider retrieval-augmented generation (RAG), which relies on accurate, available data. The only way an organization can fully participate in AI innovation is with enabling data infrastructure.
What are the parts of a modern data stack?
The modern data stack is made up of 5 distinct layers:
data:image/s3,"s3://crabby-images/fa76a/fa76ab2bcf50e7767e9ad1498d0bc1fcf396e42e" alt=""
Let’s dive into each layer:
1. Ingestion: Captures and brings data into the system from various sources like apps, sensors, APIs, or files. A continuous flow of raw information allows businesses to work with real-time data.
Example: Airline company JetBlue uses Fivetran to pull maintenance data from its on-premise servers into central systems, enabling predictive maintenance and reducing flight delays.
2. Storage: Organizes and securely stores data centrally to avoid silos and ensure data accessibility. Example: Kitchen appliance maker Sub-Zero stores diagnostic data from millions of connected appliances on Snowflake, allowing the customer care department to help owners quickly solve issues over the phone, reducing the need for costly in-person service calls.
3. Processing: Transforms raw data into consistent, usable formats. This step also finds and removes errors, increasing reliability and preparing data for analysis.
Example: Solar power company Sunrun grew through acquisition, and underlying businesses had different data formats e.g. different conventions for contract lengths, pricing and billing cycles. Sunrun uses dbt to transform data regardless of source, resulting in consistent, usable data across the consolidated company.
4. Analytics: Uses the processed and stored data to find trends, patterns, and insights.
Example: Automotive parts distributor Örum uses Tableau to create and share reports of key sales metrics, allowing the team to shift their time from creating reports to distilling insights.
5. AI and Machine Learning: Similar to the Analytics layer, the AI layer leverages the stored and processed data to make predictions, automate processes and answer questions.
Example: Buy-now-pay-later company Klarna uses OpenAI to power its AI customer support chatbot. The chatbot has access to Klarna’s central database, allowing it to answer and resolve customer queries.
A sixth and final layer cuts across these five – Data Governance and Security ensures data privacy, and compliance. Governance refers to standards and policies for secure access, while security is about guarding sensitive information against breaches.
What about Data Lakes & Warehouses?
Data lakes and warehouses fit primarily into the Storage layer, and are used depending on the types of data stored:
Data Warehouses: Popularized in the 1980s-1990s to systematize business reporting, data warehouses offer fast, SQL-compatible access for generating reports and insights. Before data is stored, it goes through a process known as ETL (extract, transform, load) to organize it. This makes warehouses slow to set up but well-organized for ongoing use. Structured data (i.e. tables) fits well in warehouses, such as information in CRMs, accounting systems, and inventory records.
Data Lakes: Popularized in the 2010s, data lakes store raw data. Data loading is fast because cleaning and transformation happen during analysis. Unstructured data—like customer communications, images, installation guides, architectural plans, or product reviews—fits well in lakes. Data lakes are ideal for machine learning use cases because they allow for raw data exploration, uncovering unexpected correlations and insights.
The combination of these two storage architectures is known as..
Data Lakehouses: Introduced in the late 2010s and gaining adoption now, lakehouses accommodate both unstructured and structured data while providing capabilities for users to organize and explore data according to their needs.
How to get started with data infrastructure
Steve Jobs famously said: “You've got to start with the customer experience and work backward to the technology. You can't start with the technology and then try to figure out where to sell it.”
The same is true with data infrastructure: start with an underlying issue for your staff or customers where the root cause is data accessibility. Solving this builds the case for modernizing infrastructure, paving the way for larger AI initiatives. Examples of common issues include:
Validation across multiple data sources: Centralizing data into a unified store allows mismatches across sources (e.g. customer and manufacturer) to be flagged and resolved before impacting orders or production timelines.
Accurate inventory: Consolidating inventory data from multiple legacy systems into a central store enables real-time product availability, allowing sales teams to identify cross-selling opportunities and improve customer satisfaction.
Delivery tracking: Consolidating shipment data across regional systems into a single platform allows customers to gain real-time tracking and delivery updates.
Personalized recommendations: Consolidating browsing behavior, purchase history, and customer feedback into a central system allows individually tailored product recommendations.
In the coming AI age, modern data infrastructure is a prerequisite for competitiveness. Start by identifying a specific business challenge, and use it as a springboard to modernize your data systems. Early investment ensures your organization can innovate, scale, and move ahead as the world becomes ever more data-driven.
___
Glossary
Structured Data: Data stored in fixed fields and formats (e.g., tables in a database). It’s straightforward to query and analyze because it follows a clear schema.
Unstructured Data: Information without a predefined model or organization (e.g., text documents, images, videos). It requires extra processing to make it analytics-ready.
Data Warehouse: A highly organized repository for structured data. Designed for fast, reliable reporting and analysis (e.g., standardized dashboards, KPI tracking).
Data Lake: A large storage system that can hold raw data in any format—structured or unstructured—without enforcing a predefined schema. Flexible but needs careful management.
Data Lakehouse: A hybrid approach combining the flexibility of a data lake with the governance and performance features of a data warehouse.
Server: A powerful computer that manages network resources, handles requests, and provides data or services to other machines or users.
On Premise: Hardware and software infrastructure physically located and managed within a company’s own facilities.
Cloud: Computing services (e.g., servers, storage) delivered over the internet, offering on-demand scalability and reduced on-site hardware needs.
Multicloud: The practice of using multiple cloud services from different providers to optimize cost, performance, and reliability.
ETL (Extract Transform Load): The process of moving data from one system to another, cleaning and restructuring it along the way.
Metadata: Data about data—labels, tags, or descriptions—that helps users find, understand, and manage information.