TechDogs-"Data Modelling In The Age Of AI: Why Structure Still Matters"

Artificial Intelligence

Data Modelling In The Age Of AI: Why Structure Still Matters

By Vikramsinh Ghatge

Overall Rating

Overview

We invited Kshitij Aranke, a data engineer with almost a decade of experience across US and UK startups and big tech companies, including Amazon, LinkedIn, and dbt Labs, to share his insights on data modelling in the age of AI. With the data landscape having gone through at least three major waves of change over the last ten years, his perspective couldn't be more timely.
 
Based on your experiences working with large tech companies and startups, how has the role of data modelling changed as companies are moving towards developing and utilizing AI-driven products and making decisions using AI? Are there still any common misconceptions companies have today about data modelling?

The biggest shift is velocity.

Data modelling used to be about structuring data for reporting, carefully designed schemas, long iteration cycles, and predictable downstream use in dashboards. It was optimized for stability.

In AI-driven environments, that model breaks down.

Today, data modelling sits directly in the critical path of product development. It’s no longer a downstream concern. It determines how quickly teams can experiment, train models, ship features, and iterate. The faster you can define, evolve, and trust your data model, the faster your AI product moves.

Modern systems demand data models that are:
 

  • Continuously evolving rather than fixed

  • Interpretable by both humans and machines

  • Designed for real-time and event-driven use cases

  • Aligned across product, analytics, and ML workflows


In other words, data modelling has shifted from a reporting discipline to a velocity enabler.

The most common misconception I still see is the belief that data modelling is no longer worth the ROI, especially in the age of AI.

This idea usually comes from two assumptions:
 

  1. Storage is cheap, so structure doesn’t matter

  2. AI can “figure it out” later


Both are fundamentally flawed.

In reality, poor data modelling creates compounded costs:
 

  • Slower experimentation cycles

  • Unreliable model outputs

  • Increased debugging and data reconciliation overhead

  • Misaligned metrics across teams


So, AI doesn’t resolve these problems; it amplifies them at scale. And the question isn’t whether data modelling is worth the investment. It’s whether you can afford the drag on velocity and trust without it. In high-performing AI teams, strong data modelling isn’t overhead; it’s a force multiplier for speed and decision quality.

Many organizations are racing to adopt AI, but they often find that their underlying data structures were designed for reporting rather than intelligent systems. In your experience, what are the most common data modelling gaps that prevent organizations from successfully building AI or machine learning applications?

Most organizations are struggling because they lack intentionality. Two patterns appear across organizations.

First, there's a complete lack of standardized data definitions. In addition to using different systems and teams to store and analyses the same core entities (i.e., customers, transactions, and products), those same teams and systems have different criteria for what defines a customer in their datasets. A customer defined by one organization may not be defined similarly by another. Similarly, revenue, churn, and engagement metrics that were once similar start to diverge. This creates confusion in reporting. Moreover, in the context of AI development, these differences create a significant barrier to developing machine learning systems. To detect a signal in noisy data, machine learning systems require consistency in definition. When data definitions are inconsistent or contradictory, models do not learn meaningful patterns; instead, they learn patterns based on random variability.

Second, data is typically ingested directly from the source systems where it was created without being transformed to meet the requirements of its intended use. Many teams ingest raw data from operational systems and expect that either analysts or other downstream consumers will transform and understand the data at some future point. However, source systems are designed primarily to support transactional operations, not learning or inferential operations. As such, organizations are left with duplicate and conflicting records, incomplete relationships between entities, and no clearly defined structure for behavioral or event-based data. All of this means that transformation logic is pushed down into each of the various use cases where the data is used, rather than being addressed in a centralized manner.

As a result of all this, data is fragmented both in terms of its content and how it can be used. Each team essentially builds its own version of reality with the data, resulting in further inconsistencies and inefficiencies throughout the organization. Ultimately, such fragmentation has predictable consequences for the development of AI systems. They become increasingly difficult to develop, take longer to complete, and ultimately are less likely to provide reliable results.

In many companies, the same metric can have multiple definitions depending on the team using it. As analytics expands across product, marketing, and operations, this inconsistency becomes more visible. Why are shared metric definitions important when you expand analytics across teams?

While scaling analytics and AI capabilities across an organization is the major challenge, it is not about how to achieve consistency across stakeholders. The issue is how to create fairness among the various stakeholders within the organization.

Different departments will have legitimate reasons to define the same metric (e.g., ARR) differently. For example, Growth teams may define ARR by contracts signed, whereas finance defines ARR by the actual cash received from those contracts. Both are valid since both teams are optimizing for different results.

The problem arises because both definitions exist simultaneously, but with no coordination or alignment.

At small scales, there is confusion in the dashboard. At large scales, the problem is much worse. All the AI systems, forecasting, and experimentation that teams undertake to optimize their processes will be based on different versions of reality. Thus, one team will be optimizing for pipeline growth, while another team will be optimizing for cash flow certainty, etc. All the while, none of the teams know they are working at cross purposes.

Therefore, as organizations expand the use of analytics, shared metric definitions are also more important, but not necessarily in a "one size fits all" rigid manner of defining every metric.

High-performing organizations instead define metrics in a way that takes into account the needs of each stakeholder and makes the tradeoffs explicit. In other words, the goal is to provide clarity around what is being measured, why it matters, and whom it is for - not to force stakeholders to agree on a single definition of every metric.

When there is clarity around the metrics, teams can move quickly and operate in sync.

What do you think the semantic layer will mean for how organizations develop, maintain, and use common definitions and data models?

The rise of the semantic layer is as much about the need for data to be able to be used at the speed of AI as it is about the tools themselves.

For most of our history, we have had definitions of metrics all over the place: on dashboards, in SQL, in specific teams' logic, etc. Each time someone analyzed something, it was built from the ground up. This is no longer tenable once you add LLMs into the mix.

What's changed is that LLMs do not calculate metrics; they only select which ones to use.

A semantic layer that is designed well calculates and normalizes key metrics and dimensions and their relationships beforehand. As opposed to having to generate long, complicated queries or write logic in real-time to find the correct metric, an LLM can quickly determine the correct metric based on the definition and provide an answer nearly instantaneously. All of the hard work was done previously.

The design of the semantic layer is also being transformed. The semantic layer is now the interface between human-generated questions, machine reasoning, and trusted business logic.

To that extent, it is not only about consistency anymore, but rather it is about minimizing latency and maximizing the reliability of decision-making. If your metrics are clearly defined and centrally managed, AI systems will be able to operate with greater speed and confidence. However, if they are not, you will experience reduced response times, less consistent results, and an ongoing need to validate every answer.

While that represents a great deal of progress, the difficult aspects of developing a semantic layer have not changed.

You can now define and control versions of your definitions, as well as abstractly expose them to users. But you cannot eliminate the organizational challenges associated with defining what a particular metric is intended to represent, who "owns" that metric, and when a metric should change. The issues are organizational in nature, not technological.

However, the cost of failure has increased significantly. When humans try to analyses data using inconsistent definitions of metrics, you experience confusion. When AI uses the same definitions, you experience fast, scalable confusion.

Having worked across both large tech companies and startups like Vouch Insurance, how does the approach to data modelling differ across organizational stages?

The biggest difference I’ve noticed isn’t just in scale or complexity - it’s how much process exists around the data.

Companies such as Amazon and LinkedIn build their data models with the same level of commitment to governance, compliance, and risk as they do to each other. For instance, how one treats personally identifiable information (PII) is much more than a checklist. It is about the creation of strong systems, processes, and automated control mechanisms that span across hundreds of teams and services. Once core entities such as users, transactions, and products are defined and consistent, the definition will remain so, regardless of who defines them. At that scale, inconsistencies in definition could have enormous ramifications both operationally and from a regulatory standpoint. Therefore, the process of developing the data model - reviews, lineage, and standards - is just as important as the actual data model.

Conversely, early-stage startups often lack the same level of process. Speed and flexibility take precedence over anything else, and therefore, use practical and lightweight models to address current business issues around growth, product usage, and revenue. While this model works well for experimenting and testing new ideas, without the existence of clear standards and definitions, the definitions can become fragmented as the company grows, thus creating a great deal of confusion and rework.

However, those startups that scale successfully recognize when the informal, "speed first" approach no longer works, usually after many teams begin to develop analytics, experimentation, and artificial intelligence on common datasets. At that time, implementing formalized models, governance, and processes is not an option — it is required to create and maintain the trust within the data and to allow teams to continue to work quickly without breaking things.

Therefore, companies that grow very large are successful due to their ability to follow process discipline, while startups are successful by moving quickly and then adopting process at the correct time. The primary difference in approach is generally based on the existence of systems to manage sensitive data and enforce consistency at scale.

Data governance often gets a reputation for slowing down innovation. Yet without governance, data systems quickly become inconsistent and difficult to maintain. How can organizations build strong governance around data definitions while still maintaining the agility required for product development and experimentation? And what role do data engineers play in shaping governance culture?

Instead of viewing governance as a static, top-down control mechanism, governance should be viewed as a living, collaborative process. The best method I have seen to implement this type of governance is for one team to define a data definition for its specific use case. Then, as additional teams identify similar uses for data, they can collaborate to adopt and standardize the original definition. With this method, teams maintain ownership and speed of execution, yet the organization will begin to converge on common, trusted definitions for each metric.

The key to implementing effective governance is to concentrate on creating clarity versus restrictions. For example, core metrics such as revenue, active users, or conversion rates require consistent definitions across all tools and teams. However, teams should participate in defining these metrics. When teams are involved in defining metrics, debates over numbers become discussions around how to improve definitions instead of who wins the argument over whose numbers "win".

Data engineers play an essential role in this model. By developing reusable models, robust pipelines, and automated quality controls, data engineers enable the "best way" to use data to also be the easiest way. Governance culture ultimately develops from the systems and tooling developed by the data engineer (versus policy documents). When implemented correctly, it enables speed of experimentation and innovation while assuring that data is consistent, reliable, and usable throughout the entire organization.

AI and analytics systems depend heavily on structured, reliable data. When the underlying models are flawed, the effects can ripple across the organization. Have you seen examples where poor data modelling created significant downstream problems for analytics or machine learning systems?

Yes, I've seen it over and over again. When multiple teams have created their analyses based upon the same data model that has inconsistencies in definitions, they often reach opposing conclusions and make different decisions for the organization.

For example, two teams are calculating the same metrics (such as customer churn or revenue) but with slightly different methods. Each team is confident in their calculations and presents them as such. The result is not simply a little confusion down the line, but the flaws of poor data modelling can trickle into your reporting dashboard, machine learning models, and experimentation platform. This can ultimately lead to poor predictions, incorrect product development decisions, and wasted resources from engineering.

The modern data stack has brought software engineering concepts, like version control, modular transformations, and testing, into analytics workflows. How have these software engineering principles influenced modern approaches to data modelling and data reliability? And do you see analytics engineering becoming a separate discipline from traditional data engineering?

Modern data modelling and reliability have been significantly influenced by the software engineering disciplines that have evolved an entire Analytics Development Lifecycle (planning and development; testing; deployment; operation; monitoring, analysis, and discovery). The lifecycle ensures that all teams handling analytics work on analytics workflow(s) using similar principles of software development: modularity, versioning, automated testing, and continuous monitoring. At this point, I do not view analytics engineering as a separate field of study from data engineering. In fact, I believe that the lines are becoming blurred - domain experts in business and product will be able to perform many of the same infrastructure and engineering tasks that traditionally would have required a data engineer. This convergence will allow teams to develop and deploy faster, to create more context-aware models, and to incorporate reliability and governance directly into their development processes, instead of treating them as downstream concerns.

Even with the right architecture and tools, creating shared definitions often requires alignment across teams with different priorities. From a leadership perspective, what helps organizations create a culture where shared definitions and data clarity really stick? What leadership mistakes tend to undermine data alignment?

Well, to establish a culture of data clarity & shared definitions that "stick", there is a need for a person to have defined responsibility and accountability, i.e., someone has to be accountable for defining the data dictionary, for reconciling different metric definitions, and for assisting teams in establishing consistent standards going forward.

Leadership plays a huge role in setting an example and fostering this kind of behavior; however, common mistakes can quickly destroy alignment. For example, rapidly implementing new metric(s) or new data models without having implemented those changes in a controlled A/B testing environment will likely cause confusion and misaligned reporting and erode trust among teams.

Strong leaders strike a balance between speed and discipline - i.e., they will ensure that all changes made to the data model or metrics are transparent (i.e., visible), tested, and documented, and will empower teams to work collaboratively to define metrics instead of forcing them from the top down. When done correctly, this process ensures data clarity is sustainable and embedded within an organization’s daily workflow processes.

Looking ahead, how do you see data modelling evolving over the next five years as AI becomes more embedded in analytics and decision-making? What skills will future data engineers need to succeed in this environment?

Over the next five years, I expect the practice of data modelling to evolve toward a highly integrated and collaborative workflow. The lines will continue to blur in regard to who does what, and more is expected out of each team member. With AI becoming an increasingly important component to both analytics and decision-making processes, data engineers will need to move past simply developing data pipelines and managing infrastructure and instead gain a deeper understanding of their organizations and the specific types of data that various stakeholders require to make informed decisions. To achieve success, data engineers will have to combine their technical expertise with their domain knowledge. They will have to anticipate how the data they create will flow through models, how it will be interpreted, and ultimately how it will impact product/strategic decisions. In this collaborative work environment, the best data engineers will function as both builders and translators of AI-based decision systems. These data engineers will ensure that these systems are powered by reliable, semantically rich data while at the same time being able to rapidly experiment and adapt to changing business needs.

Tue, Jan 20, 2026

Liked what you read? That’s only the tip of the tech iceberg!

Explore our vast collection of tech articles including introductory guides, product reviews, trends and more, stay up to date with the latest news, relish thought-provoking interviews and the hottest AI blogs, and tickle your funny bone with hilarious tech memes!

Plus, get access to branded insights from industry-leading global brands through informative white papers, engaging case studies, in-depth reports, enlightening videos and exciting events and webinars.

Dive into TechDogs' treasure trove today and Know Your World of technology like never before!

Disclaimer - Reference to any specific product, software or entity does not constitute an endorsement or recommendation by TechDogs nor should any data or content published be relied upon. The views expressed by TechDogs' members and guests are their own and their appearance on our site does not imply an endorsement of them or any entity they represent. Views and opinions expressed by TechDogs' Authors are those of the Authors and do not necessarily reflect the view of TechDogs or any of its officials. While we aim to provide valuable and helpful information, some content on TechDogs' site may not have been thoroughly reviewed for every detail or aspect. We encourage users to verify any information independently where necessary.

Loading comments...

  • Dark
  • Light