Why AI Will Move to the Endpoint - Architecture & Governance Magazine

By Dave Gruver, Global Field CTO, SHI

Since its inception, artificial intelligence has lived in the cloud, with prompts from users and applications routed via the web to AI located on GPU clusters in cloud servers. But using these models means sacrificing privacy and control because data has to leave the business environment to use the model, not to mention high processing costs. AI token costs now exceed employee salaries in some organisations while Uber’s CTO, Praveen Neppalli Naga, admitted it had run through its entire AI budget for 2026 in just four months.

This all points to the need to localise AI and it’s because of that we can expect to see AI processes begin to move from the cloud to the device. While that may seem to go against the grain, the truth is we’ve been here before. Back in the 1970’s and into the 1980’s, mainframe computers featuring large centralised systems with limited access which ran over expensive infrastructure were the norm. But the rise of home computing in the form of PCs and laptops moved computing onto the device, resulting in the democratisation of the technology and an explosion of applications.

We’ve been here before

We’re now seeing history repeat itself, with the frontier AI models that utilise massive GPU clusters which are chiefly accessible to specialised AI researchers supplanted by AI models that locally on AI PCs and workstations. And it’s an evolutionary change that has been made possible by three key breakthroughs. Firstly, AI PCs have come onto the market that feature faster processors, more memory and with integrated CPUs, GPUs and NPUs designed for these workflows. At the same time, the realisation that most real-world tasks don’t require the mass computing abilities of a frontier model has seen the emergence of smaller language learning models (LLMs). And finally, we’re seeing the optimisation of models through compression and quantisation, the take up of frameworks as ONNX, OpenVINO and Llama.CPP, and more efficient use of those CPUs, GPUs and NPUs.

Yet despite these drivers, deploying AI at the local level remains complex, with most architectures currently stitched together manually by only the hardiest of AI enthusiasts. This is because building an AI stack is multi-faceted. It requires selecting a pretrained model over a platform such as HuggingFace, AI inference which sees the model utilise the new data it has been trained upon, selection of a framework such as LLMware, Langchain or LlamaIndex to create retrieval-augmented generation (RAG) workflows, and pipeline code. Only then can model conversion and optimisation occur using a hardware acceleration ecosystem from one of the big tech providers.

Go local or go home

That kind of build is a big ask and it goes some way to explaining why we’re not seeing widespread deployment of these local architectures. But it’s a change that is coming and organisations that don’t move with the times risk being caught out by shadow AI (i.e. AI-enabled browsers and unmanaged local models operating outside of formal controls) or saddled with cloud-based AI while their competitors harness the benefits of on-device AI. Because the benefits are considerable.

AI at the endpoint will provide not just smarter workflows, faster decision making and lower latency but also real security gains, for instance. AI-enabled data loss prevention (DLP) platforms can analyse data locally using neural processing units (NPUs), flagging and tagging sensitive information such as personally identifiable information (PII) in real time, and preventing unauthorised sharing or movement of data.

Analysing security events locally will also improve security telemetry and reduce reliance on external platforms and cloud-based solutions. In fact, because AI is on the device, it can even operate offline. This can be a significant enhancement to workers out in the field. Plus, because the organisation can leverage local hardware for AI inference this will lower operational expenses and of course reduction of those costs for processing in the cloud.

We can also expect to see local AI become more personalised to the user, effectively becoming an exoskeleton that amplifies their capabilities, refines output or even proposes alternative ways of processing workloads. This also ensures that the AI will have no rights greater than those of the user, simplifying the management of the AI. All of which results in a better digital employee experience (DEX), meaning less help desk tickets, shorter resolution timeframes, and lower servicing costs, which means maintenance costs will also reduce.

What might a local AI architecture look like?

While these are strong arguments, they don’t solve the issue at the heart of on device AI – is there a there a better way to design the architecture? One solution is to leverage an integrated model inference operating system to create a managed local architecture.

Built on top of integrated backend inference engines i.e OpenVINO, ONNX, GGUF, QNN, this would comprise three core management disciplines that pass information between them. The first, AI Knowledge Management, oversees how data is sourced through document parsing, semantic and text search, dataset building, connection to sources and speech-to-text as well as interfacing with a vector database server and conducting web searches.

The second is Model Lifecycle Management which covers the way in which the model functions. It allows the business to keep a continually updated catalogue of hundreds of models, a private repository of customised or fine-tuned smaller learning models (SLMs), and to utilise multi-modal models. It’s remit also includes prompt management, generation management, embedding, classifiers, and on-device and API inferencing. Quantisation and optimisation is then controlled via a hardware platform.

The third management system is for governance and security controls. This ensures the business can test models, carry out model integrity checks, review guardrails pre- and post-deployment, verify sources, and optimise routing. It also captures inference history and compliance logs, so can be used to prevent AI drift.

Layered over all three top is agent-based process orchestration which would carry a services catalogue, process engine, process visualisation and low code or no-code support for PCs. Finally, an interaction layer governs the user-facing or system integration elements. This houses custom chat templates, custom agents, enterprise templates, user interface apps, and batch processing, has an API server and supports integration with other systems. The entire architecture is then encapsulated within the compute infrastructure (GPU, CPU, NPU).

Owning the AI

Designing an architecture in this way to support AI at the endpoint can help reduce risk, friction and increase ROI all while making these tools more accessible to a broader set of employees. The key is integration, so that endpoint security, DEX insights, and management platforms all function as a coordinated system, rather than in silos. And, while it will require some significant planning and execution, deploying an architecture to support local AI will reap rich rewards. Sensitive data will remain within the environment. Network latency will be reduced providing real-time response rates. Teams will be able to choose and customise the models they want to use and build and test workflows. And there will be no token fees for model usage, eliminating those costs associated with AI processing.

AI on the endpoint is therefore not just inevitable but desirable. It will see every team be able to leverage the advantages of the technology and the resulting experimentation will see thousands of workflows leverage data within the business. Whether its interrogating documents with a RAG query, extracting data and performing analysis from a PDF pr CSV sheet, summarising security logs to generate an incident report or delving deep into API data to analyse the metrics and generate insights, these workflows will enable the business to derive real value from the data it already owns, truly democratising the technology.

Dave Gruver joined SHI in the role of Field CTO End User Compute in January of 2023. Prior to SHI, Dave spent 26 years at AT&T, focused on end user computing. A successful track record dates back to 1996, deploying Windows 3.11 to call centres. Dave worked his way up to the role of AVP, responsible for all end user compute operations across the company. During his tenure he was responsible for strategy architecture and support of over 700,000 end point devices serving over 300,000 employees and contractors. He oversaw the transition from on prem to cloud-based services and modern management practices. Dave brings that experience to SHI where he assists SHI customers as they modernise, transform and grow their organisations and build stronger employee experiences.