Over the past two years, most of the noise around AI has focused on the model race – whose model is bigger, faster or scoring better on benchmarks.
But as AI moves from pilots into the core of products and workflows, a familiar pattern from the early days of cloud is re‑emerging: systems are more programmable than ever, but they are also much harder to run.
Field CTO, Asia-Pacific & Japan, Datadog.
And that means we now know where the most important competition in AI is shifting: from who has the “best” model to who can operate AI reliably, efficiently, and safely at scale.
AI is now hitting operational limits, not model limits
When looking at real‑world telemetry from thousands of production systems, a clear picture starts to form. Nearly 1 in 20 AI requests fails once applications reach scale, and a majority of those failures now stem from capacity limits such as rate limits, quotas and concurrency caps, rather than from model bugs or poor accuracy. That is a very different story from the benchmark charts most teams used to obsess over.
The amount of data sent per request is also climbing. Across many production estates, median users have more than doubled their token usage, while heavy users have seen volumes grow several‑fold. That growth is both a symptom of more ambitious AI use cases and a direct driver of cost and IT infrastructure stress.
You can see the impact most clearly in what many teams now describe as GPU sprawl: fragmented fleets spread across clouds and on‑prem clusters. Some GPUs sit idle while others are consistently saturated, and there is very little correlation between where GPU hours are spent and where they create business value.
The result is familiar to anyone who lived through the early adoption of cloud computing – runaway spend, unpredictable performance and capacity crises that appear out of nowhere.
How this is playing out in APAC
Across Asia‑Pacific, and especially in ASEAN, we’re currently seeing structural pressures: AI adoption is accelerating, but operational maturity is uneven.
Singapore is further along on governance and observability, driven in part by regulatory expectations and a more mature cloud landscape. Meanwhile, markets such as Indonesia, Malaysia and Thailand are moving very fast on deployment, often pushing AI into customer‑facing services while operational practices catch up.
As organizations across these markets roll out multi‑model and agent‑based architectures, they are running into reliability issues, limited visibility and inconsistent model performance. Token usage is increasing quickly, but optimization practices, such as prompt caching and context engineering, are underutilized.
That gap between readiness and deployment is already creating operational and cost debt that will be harder to unwind later.
The four operational disciplines AI teams need
With the evolution of AI resembling the early days of cloud, the good news is that we can predict, at least a little, where things are headed.
Now, the question AI leaders should be asking is this: which disciplines distinguish the teams that will cope best with this complexity?
In my view, there are four that teams working with AI need to adopt to see sustainable success:
1. Establish visibility and attribution
You cannot operate what you cannot see, and AI is no exception.
Teams need to see how GPU hours and tokens map to specific applications, teams and use cases, so they can connect that usage to latency, error rates and user impact.
That makes it possible to separate business‑critical workloads from background noise, and provide clarity into which services are driving cost or consuming capacity.
When usage is visible and attributable on a single view, decisions about where to optimize, protect capacity or dial back become much less emotional and much more data‑driven.
2. Enforce control and guardrails
Without guardrails, AI systems will consume as much capacity as you give them.
Practical controls include rate limits and budget caps, along with safeguards on agent behavior to stop unbounded retries, loops and poorly bounded workflows from exhausting shared resources.
These controls are about making consumption predictable and ensuring that one runaway experiment cannot impact core production services.
Without this discipline, AI programs tend to hit economic limits long before they hit technical ones. You end up with impressive prototypes, but unsustainable unit economics.
3. Optimize GPU utilization before scaling supply
Most teams reach for more GPUs when what they really have is a utilization problem.
GPU instances already account for a significant share of compute costs, and that proportion only grows as organizations push deeper into training and inference at scale.
But idle or underutilized GPUs create the sense of a shortage even when there is headroom in the estate. In turn, many teams can see their overall GPU bill climbing, but cannot see which workloads are driving consumption, or pinpoint the steps needed to improve efficiency.
What we learned during the early days of cloud is that in these instances, overprovisioning becomes the safest default – but then spend balloons even when there is stranded capacity in the fleet.
Treating GPU infrastructure as a first‑class system means tracking utilization so that teams can distinguish genuine capacity shortages from misallocation or fragmentation. Then, they can decide whether to free up capacity or truly add more supply.
4. Design for efficiency at the application layer
High AI costs and rates of failure come from how applications are put together, not from the models themselves.
Inefficient patterns, poor routing across providers and unoptimized prompts all drive up token usage and increase the risk of timeouts, errors and inconsistent behavior.
But with proper visibility into prompts, agents and tools in production, teams can see how requests actually flow through the system and tune for quality, latency and cost in a controlled way.
That turns the application layer from a black box into a place where efficient engineering choices are deliberate, measurable and aligned with business outcomes.
What leaders should do in the new AI race
The early days of cloud taught us that programmability without operational discipline can be as much a liability as an advantage. AI is now at a similar inflection point: the winners will not just be those with access to the most powerful models, but those who treat AI as a long‑term engineering and operations capability.
A useful test for any organization is whether it can explain where AI spend goes, how agents behave in production and which workloads it would protect first if capacity were suddenly cut.
If the honest answer is “I don’t know yet”, then the next phase of the AI journey is clear: stop chasing the next model release, and focus on building the operational foundations that will help you scale AI safely and sustainably.
We’ve reviewed and ranked the best business cloud storage services.
This article was produced as part of TechRadar Pro Perspectives, our channel to feature the best and brightest minds in the technology industry today.
The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/pro/perspectives-how-to-submit
https://cdn.mos.cms.futurecdn.net/kntXJzuBjvVTZvqgBeCKfN-1920-80.jpg
Source link




