Inference at the Edge

The Next Great Compression Problem

May 10, 2025

This week, two headlines quietly pointed toward a much bigger shift in how, and where, AI will run.

First, VSORA, a French chip startup, raised $46 million to bring a new kind of inference chip to market—one that beats GPUs on power and performance without requiring a data center. Their goals is running smaller, smarter models everywhere: phones, cars, sensors.

Second, EdgeRunner AI has raised $17.5M to build air-gapped, on-device AI agents for the military, focusing on environments with limited connectivity or critical security. EdgeRunner is positioning itself at the forefront of sovereign edge AI, offering a secure, hyper-local alternative to cloud-based models.

Put together, these stories point toward the same conclusion: inference is heading to the edge. The problem? Cloud isn’t necessarily ready to scale with it.

Why inference is moving

The AI industry is witnessing a paradigm shift from cloud-centric models to edge computing. This transition is driven by the need for real-time processing, data privacy, and reduced latency.

Running a model once it’s been trained is called inference. It’s the part that turns a photo into a label, an audio prompt into a reply, a camera feed into driving instructions. And it’s getting expensive to do that in centralized data centers. Bandwidth costs, latency, energy, and regulatory pressure are piling up. The narrative around AI has focused for too long on training. Foundation models like GPT-4 or Claude required massive compute clusters and months of optimization, but that’s not where the real volume lives. Once a model is trained, it needs to be run. Constantly. Billions of times a day.

Unlike training, which happens once in a controlled setting, inference has to happen in real time, under constraint, across many environments. And it is exploding. We don’t retrain foundation models every day. But we’re running them billions of times a day, on everything from fraud checks to search suggestions to robot vision. Inference is where the cost, and the bottlenecks, now live.

That shift in volume matters. Whoever can run models faster, more efficiently, and in more places will define the next wave of AI infrastructure. Not because they built the biggest model, but because they figured out how to operate it at scale.

The compression bottleneck

If edge AI is going to scale, we need to get serious about compression. Everyone talks about deploying large language models on-device. Few are honest about what that actually requires. The basic constraint isn’t compute, but size.

Right now, the standard techniques are quantization and pruning. Both shrink the model and make inference faster. (Quantization reduces precision, say, from 32-bit floats to 8-bit integers. Pruning cuts weights.) But both come at a cost: you lose accuracy. And unlike compressing an image or a video, you’re not just dropping quality. You’re degrading decision logic.

The problem is that tradeoffs don’t generalize. A model compressed for a low-power sensor might be useless in a drone navigating a smoke-filled urban environment. Accuracy, latency, energy draw, and thermal limits all shift based on context. A one-size-fits-all model won’t cut it.

A 2023 study from MIT and MIT-IBM Watson AI Lab showed that certain model compression significantly reduced accuracy in downstream tasks, particularly in low-resource or noisy environments. And yet, these same methods were effective in chat applications where response precision was less critical. That gap matters.

Enter specialization

For a while, there was a belief that a single model could be fine-tuned into anything. But inference at the edge is showing the opposite. Models that work in phones are different from those used in aircraft or medical systems. They have different power budgets, failure tolerances, and real-time requirements.

This trend pushes toward specialization. It means building systems that can adapt a model to the environment it's running in. That includes compression, retraining, and monitoring—all done closer to the device.

This changes the nature of the AI stack. Instead of one model served everywhere, we are moving toward many models served in context. And that introduces a new coordination problem that infrastructure today is not ready to handle.

Why infra needs to adapt

Most of today’s infrastructure is still built for cloud workloads. That means: fast backhaul to a central location, huge racks of GPUs, and traditional protocols like TCP or QUIC pushing traffic through big pipes.

Public cloud margins also depend on centralization. If edge inference scales, it threatens not just technical architecture but business models built on metered compute. This is why Nvidia’s grip on inference hardware is being quietly challenged by startups building silicon for one task only: running small models extremely well.

That doesn’t work when you’ve got thousands of sensors or agents making decisions locally and trying to stay in sync.

Edge inference needs a different kind of infrastructure:

Power-efficient chips tuned for small batch inference.
Protocols that minimize latency, not just maximize throughput.
Smarter networking that knows when to sync and when to wait.
Distributed systems that don’t fall apart when they lose connectivity.

This is where the next layer of innovation is starting to show up. Not in bigger models, but in how we route, compress, and coordinate smaller ones in the wild. And if we take anything away from recent compression studies, it’s that maybe we need compression strategies that are mission-specific: tailored to environment, hardware, and risk tolerance. Until we treat compression as a systems-level challenge, edge AI won’t live up to the hype.

Why not stay in the cloud?

Some will argue this is a solved problem. Networks are faster, GPUs are everywhere, and inference will always be more efficient in the cloud. Why rebuild infrastructure for the edge?

The answer depends on what you're optimizing for. The cloud is massive, centralized, and optimized for throughput. The edge is messier, noisier, but faster where it counts.

If your goal is to batch massive queries, sure, we centralize. But if your product needs millisecond (or faster) response times, resilience under poor connectivity, or energy-aware compute, you can’t afford that round trip to a data center. Look at autonomous drones, industrial robotics, battlefield systems, consumer AR glasses. ‘Big compute’ isn’t the only future. It’s also ambient, context-aware intelligence. And that needs edge inference.

If inference stays centralized, real-time systems will either become brittle or fail outright. This is the line between a drone that can respond and one that crashes…. Between a fraud detection system that stops theft, and one that flags it after funds are gone.

What's next

As more companies build agentic systems—self-driving cars, co-pilots, personal assistants—they’ll run up against this wall. Compression and coordination become the core problem. And the winners won’t be those with the biggest models. They’ll be the ones who make the smallest models work everywhere, fast.

That’s why the battle for edge inference is really a battle for infrastructure. Not just chips, but the protocols, data flows, and system design underneath it all.

We're about to find out which companies can rethink those layers without breaking the entire stack in the process.

Lastly: Infrastructure isn’t a supporting layer. It’s the shape of the system itself. The firms that treat inference as a coordination problem—not just a silicon problem—will define the next era of intelligent systems. Everyone else will be fine-tuning pipelines for an architecture already collapsing under its own weight.

So my questions then become: Are we underestimating how fast infra assumptions are breaking? And what does the AI stack look like when the network, not the chip, is the bottleneck?

CipherTalk

Discussion about this post