← Home About Contents

Language Without Logic: Understanding LLMs in Observability

Essay Observability LLMs Deterministic systems Oct 15, 2025 Eric Fruhinsholz

Large language models are extraordinary at explanation, summarization, and linguistic abstraction. But observability is not a language problem. It is a proof problem. This essay walks through the most common illusions and why hybrid architectures are the only sane path.

I. The Boundary Between Language and Logic

Across the observability world, there is a growing temptation to believe that Large Language Models can replace years of engineered logic. The idea sounds almost inevitable: if we give an LLM all our logs, metrics, and traces, it will “understand” the system, connect the dots, and identify the root cause. Why build correlation graphs, statistical models, or dependency maps when an LLM can read everything and tell us what went wrong?

This belief gains momentum from another popular assumption: that LLMs are uniquely good with unstructured data. Logs and traces are messy, irregular, and textual, so it seems natural to think that a model trained on billions of words could find patterns in them. But this overlooks a subtle truth. The so called “unstructured text” used to train LLMs was not truly unstructured. It carried meaning, grammar, and internal logic. The model learned by discovering the statistical regularities that arise from this human structure: syntax, causality expressed in language, and semantic patterns that repeat across contexts. It is precisely this pattern that the model can re discover during inference.

Those same patterns do not exist in logs or traces. Even though these artifacts are text based, their statistical signatures are entirely different. They do not follow human grammar or semantics, and they rarely encode causal meaning in linguistic form. Instead, they exhibit mechanical repetition, timestamp noise, and arbitrary identifiers that may even fool the model into perceiving patterns that are not there. What looks like structure to a human observability engineer, such as a log format, a metric schema, or a dependency map, appears to the model as random text.

A related misconception fuels the same illusion. People assume that because LLMs can talk about data, they can analyze it. In reality, they do not compute or reason over numbers. They predict the next word that most often follows a given sequence of symbols. When their answers seem analytical, it is because they recall the appearance of analysis from patterns in their training data, not because they have performed it.

The illusion deepens because LLMs sound intelligent. When a model outputs a fluent sentence like “The latency increased because Service C’s CPU usage spiked,” it feels like analysis. It looks like reasoning. Yet inside the model, there is no logic, no inference, no causal graph, only the prediction of the next most likely token based on previous ones.

There is also a practical dimension that often goes unmentioned: cost. Each LLM inference is expensive compared to a deterministic algorithm. A correlation query that a statistical engine can execute in milliseconds may take a minute of model inference and thousands of tokens, at hundreds of times the cost. At scale, that difference is not just a technical inefficiency, it is a structural one. Using a probabilistic language model to compute what a simple correlation function can derive exactly is the analytical equivalent of heating a house with a candle: it works, but at a painful cost.

In this article, we will break that illusion through a concrete, incremental demonstration. We will start with a simple four service system and observe how a LLM misinterprets data even when it has everything it needs. We will then show why adding more data or expanding the context window does not improve reasoning, why unstructured metrics remain opaque to linguistic models, and why causal understanding requires a completely different architecture.

The goal is not to dismiss LLMs. They are exceptional at abstraction, summarization, and linguistic interpretation, roles that make them valuable in observability pipelines when used properly. But they are not reasoning engines, and treating them as such leads to confident nonsense. Real intelligence in observability arises when structured logic and language cooperate.

The argument throughout this piece follows one guiding principle: language belongs at the edges, logic at the core.

II. The Mirage of Computation

The first illusion is that a language model performs arithmetic.

It doesn’t.

It performs pattern completion.

Ask a model “What’s 17 × 23?”, it replies 391. That looks like computation, but it isn’t multiplication. It is retrieval: the token “391” is simply the most probable continuation of the sequence “17 × 23 =”.

In practice, these examples almost always work today. The models have grown large enough and seen enough text that it is difficult to trick them with basic arithmetic. But that success only shows that the training distribution has expanded, not that reasoning has emerged. The mechanism is unchanged: the model still predicts the next most likely token, it does not compute.

Conceptually, when you try “91 × 97”, which appears less often in text, the probability map thins out and the model begins to stutter or hallucinate. The apparent intelligence vanishes the moment the pattern leaves the distribution of examples it has memorized.

Under the hood, every step depends on self attention. The transformer’s attention mechanism was designed to overcome the limits of sequential memory. Each token compares itself to every other, assigning weights that measure how relevant past information might be for the next prediction. In theory, this should give the model a form of perfect recall, a flat memory where everything can be seen at once. To emit one new symbol, the model recomputes a dense web of relationships over the entire prompt.

Arithmetic, however, is hierarchical. Parentheses nest. Operators have precedence. Intermediates build on intermediates. A human or a calculator manages this with a stack: push on “(”, pop on “)”. A Transformer cannot. It must re infer those pairings at every generation step through attention weights alone.

You can watch the structure collapse as depth grows:

Expression Runs Depth Average tokens generated Typical time
(6885 + 4662) 10 1 ≈ 107 tokens < 3 s
((2569 + 1311) * 487) 10 2 ≈ 340 tokens ≈ 4.5 s
(((289 + 553) * 15483) / 50124) 10 3 ≈ 5053 tokens ≈ 52 s
((((285 + 4563) * 499) / 50124) * 6415) 10 4 ≈ 5976 tokens ≈ 49 s
(((((285 + 4563) * 499) / 50124) * 6415) / 878) 10 5 ≈ 6407 tokens ≈ 1 m

The test was conducted using GPT-5-2025-08-07, the latest model available at the time this article was written.

The slowdown is not software lag, it is architectural. The number of attention operations grows roughly with the square of the context length, and the number of generated tokens balloons in tandem. Each extra parenthesis forces the model to rescan and reinterpret every previous symbol, inflating both compute and cost.

An algebraic parser solves the same hierarchy in linear time; the LLM does it by repeatedly re reading its own history, trying to guess which symbols relate. The result is a curious form of pseudo reasoning: the model can sometimes reach the correct answer, but only after spending thousands of token level probability updates to simulate what one stack operation would accomplish.

A computation that a student completes in seconds can trap a trillion parameter model for minutes. Even when correct, the process is grotesquely inefficient. What a CPU executes through explicit rules, the LLM achieves by statistical imitation of language. It is not calculating, it is re describing what calculation sounds like.

This limitation is well known in the field, which is why modern systems no longer rely on a pure language model to reason. Frameworks such as OpenAI’s function calling or the emerging Model Context Protocol (MCP) introduce an external reasoning layer. When the model encounters an equation, a code block, or a structured query, it does not solve it, it delegates it. The model recognizes the linguistic shape of a problem and routes it to a deterministic engine such as a Python interpreter, a symbolic math solver, or another tool connected through MCP.

These reasoning frameworks also introduce their own limitations. The model can only use the tools it has been given, and each must be described in context. A growing toolset quickly consumes the model’s input window, while the logic that decides which tool to invoke must be handcrafted and conflict free. The orchestration that enables reasoning therefore adds its own layer of complexity, and that complexity lives outside the model.

A related point often raised is the emergence of LLM native reasoning methods such as Chain of Thought (CoT) and Tree of Thought (ToT) prompting. These techniques improve apparent reasoning by structuring the model’s text generation into intermediate steps or branching paths. Yet even here, the model does not truly compute; it extends patterns of linguistic reasoning learned from text. The logical structure comes from the orchestration around it, from the tree search, the scoring, and the pruning, not from the model itself. They remain forms of pattern completion shaped by external control. Each additional path requires new calls to the model, increasing both inference cost and energy consumption while yielding the appearance, not the substance, of deeper reasoning.

This hybrid approach works and in practice it is the only way to make a general purpose assistant appear intelligent. Yet it also exposes the core truth: the orchestration that enables this behavior is deterministic and rule based. The reasoning happens elsewhere. The LLM provides language, not logic. The intelligence lives in the orchestration, not in the network itself. In other words, the model belongs at the edge, while classical algorithms remain at the core.

III. The Memory That Isn’t There

If the first illusion is that a language model can think, the second is that it can remember. In observability, the challenge is not one question, one answer, or one file. It is hundreds of millions of small pieces of information: logs, traces, metrics, and events, all emitted across thousands of nodes and thousands of services.

A single customer operation may touch hundreds of microservices, each producing thousands of spans and log lines. The resulting context is vast, fluid, and noisy.

In theory, the transformer architecture seems ideal for this. Its self attention mechanism allows each token to look at every other, creating the impression of perfect recall. In practice, the illusion breaks. As the context window fills, relevance decays. Early information is still present in text, but the statistical weight that connects it to the present fades. The model does not forget in the human sense, it simply stops attending (it stops assigning weight to what came before).

This effect is easy to reproduce with text. Give a model a clear clue early in a long prompt, then surround it with irrelevant sentences. As the sequence grows, the probability that it will recall or act on that clue drops sharply. Add enough noise, and it answers as if the clue never existed. The context has not vanished, it has merely drowned in its own statistics.

This matters for observability because real systems are saturated with noise. Logs are verbose. Metrics stream endlessly. Traces branch exponentially. Even a single service under load can generate more tokens than a model can meaningfully attend to. The longer the prompt, the more the signal dissolves into background probability.

You can see this effect with a simple experiment:

Input token length Runs Average time Accuracy
7826 10 23 sec 100%
160,505 10 25 sec 50%
171,363 10 26 sec 30%
205,381 10 21 sec 0%

Accuracy refers to the model’s ability to produce the correct answer. The evaluation was performed using GPT-5-2025-08-07, with a context window of 500,000 tokens (including both the question and the response) at the time this article was written. In each test, a clue was placed at the beginning of the text, and the question required the model to recall and use that information to respond correctly.

And there lies the real difficulty. When a prompt is crafted, the author does not know what is noise and what is not, because that is precisely what the model is being asked to determine. In observability, the input is not an answer but a field of chaos from which a cause must be found. Yet for the model, every token carries equal statistical weight. It cannot know which part of the text is the clue and which part is distraction.

The result is not understanding, but correlation without hierarchy, a form of attention that sees everything and grasps nothing.

IV. The Illusion of Understanding

The most convincing illusion of all is that a language model understands what it sees. It does not. It imitates the shape of understanding. When a model reads observability data, it has no concept of what a service is, what a trace represents, or what “normal” means in the context of a workload. It knows only that certain patterns of tokens tend to appear near others.

A log line that says “timeout” and one that says “error” are connected not by system semantics but by frequency in text. The model’s sense of causality is purely linguistic.

The failure becomes clear when we ask the model to explain a realistic scenario. Imagine a collection of traces that mix long running batch jobs with real time service calls. The batch paths are slow by design, the real time paths are meant to be fast. One of the real time services begins to degrade, but the absolute latency increase is small compared to the total duration of the batch. The model compares all paths and declares that the batch job is the bottleneck since it takes the longest time. It is correct statistically and completely wrong operationally.

To make this failure tangible, consider a simple trace dataset:

Path A → B → C, real time API called from UI, average duration 550 msec Path E → B → C → D, batch job, average duration 122500 msec Path F → B → D, Async batch job, average duration 1000 msec Path A → B → H → D, Async batch job trigger from UI, average duration 535 msec
Prompt Run Average time Accuracy
Generic prompt 10 2 min 0%
Specific prompt looking for latency 10 2 min 20%
Specific prompt looking for correlation cpu latency 10 2 min 60%

The test was conducted using GPT-5-2025-08-07, the latest model available at the time this article was written.

In this example, Service C shows a clear increase in CPU and latency, and every path that includes C slows slightly. Yet the total duration of each path is dominated by Service D, which is a batch component that is slow by design.

A human engineer immediately notices that C is deviating while D remains stable. The language model does not. It sees that D is both slowest and common to all traces, so it concludes that D must be the cause. From its perspective, the statistical signal of D’s consistent slowness outweighs the subtler relative variation in C.

The result is an answer that sounds correct but misses causality. The model identifies the largest number, not the meaningful change. It cannot weigh relative deviation, understand dependency, or distinguish expected latency from anomaly. It observes the data but cannot interpret the design behind it.

The model detects difference without comprehension. It recognizes that something changed but not what that change means. To it, all differences are equal and all time is the same. Without embedded knowledge of system behavior, dependencies, or intent, it cannot distinguish a healthy slow process from a degraded fast one. It sees the data but not the design.

This is the essence of the illusion. The output looks intelligent because it mimics the language of explanation: “Service D is slow, therefore it causes latency.” The words are fluent, the logic seems sound, but underneath, no reasoning occurred. What was produced was not an insight but a statistical continuation of the sentence “Latency increased because…”.

True understanding requires structure, a map of what depends on what, what normal looks like, and what deviation matters. None of that exists inside the model. The transformer predicts plausible continuations, not causal chains. It replaces inference with coherence.

And this is why, in observability, language alone cannot replace logic. The system does not need to sound intelligent, it needs to be correct.

V. The Prompt Paradox

Even if a language model could remember and reason, its answers would still depend entirely on how we ask the question. The prompt becomes the lens through which the model sees the system, and the lens shapes the world it describes.

In observability, this creates a paradox. To ask the right question, we must already suspect the kind of problem we are looking for. If we prompt the model to “find correlations between CPU and latency,” we have already decided that CPU is a likely cause. But we could have asked instead about queue length, database I O time, thread contention, or memory allocation. Each version of the prompt defines a different reality.

The problem is not that the model fails to follow instructions, but that the instructions encode our bias. When we must describe what we want to discover, the act of prompting replaces investigation with confirmation. The search space collapses to what we can imagine, and the intelligence of the process shifts from the model to the prompter. The model does not uncover new truth; it reflects the boundaries of our own understanding.

In structured analytics, algorithms can scan hundreds of signals, rank their correlations, and surface what matters statistically. A language model cannot do that without being told what to look for. It does not explore, it completes. It extends the language of the question, not the logic of the data.

This is why prompting feels powerful but remains fragile. It works when we already know where to look, and it fails when we do not. True insight requires a system that can reason without instruction, one that can ask its own questions. Until then, prompting will remain a form of guided storytelling, not discovery.

VI. The Cost of the Illusion

All these experiments were small, yet each query took more than a minute to complete on average. The result was slow answers, partial truths, and sometimes confident mistakes. The computational effort behind them was disproportionate to their insight. What looked like reasoning was in fact expensive guessing.

Behind every LLM response lies an immense amount of computation. Each token prediction requires billions of matrix multiplications across thousands of GPUs. These chips run at hundreds of watts each, and a single large inference can consume as much energy as a household appliance running for several minutes. A complex multi minute query may draw the equivalent power of keeping a light bulb on for a day.

While model efficiency continues to improve, the physics remain the same. Probabilistic reasoning at trillion parameter scale is not free. What a traditional statistical engine can compute deterministically in milliseconds, a language model reproduces through dense probabilistic search at hundreds of times the energy cost.

The irony is that these models consume vast energy to simulate what simpler algorithms already know how to do. Every layer of linguistic inference hides layers of heat. We pay in kilowatt hours for the appearance of understanding.

VII. The Scale of Reality

Each of these experiments was carefully designed to make a point. The examples were simple, the variables controlled, and the questions well defined. In the real world, we do not have that luxury. At Salesforce, where we monitor critical user journeys and thousands of microservices, telemetry does not arrive as a single neat dataset. It comes as a flood.

A single synthetic customer journey can generate more than 1MB of logs across all involved services for one run. A single distributed trace can exceed 5MB once tags, attributes, and contextual metadata are included. None of this can fit meaningfully into a model’s context window. Even if it did, interpreting it would take minutes and cost hundreds of times more than deterministic analytics.

Worse, the model would not know what to look for. It would wade through normal variability, irrelevant events, and repeated noise, searching for a pattern in data that has no defined problem. In doing so, it could easily chase false correlations or apparent anomalies that are statistically insignificant. What feels like reasoning would again be pattern completion, only now scaled to gigabytes of telemetry.

And even then, a single trace or a single set of logs tells us nothing about anomaly or cause. To detect degradation, we must compare good and bad scenarios across thousands of executions. No current LLM can reason across that quantity of structured unstructured hybrid data in context. The limits are not only architectural but physical.

This is where the boundary becomes clear. The model is not the brain; it is the voice. The real intelligence lies in how we connect structured logic to probabilistic language. The LLM should stand at the edge, interpreting intent, summarizing evidence, and directing either human reasoning or the next deterministic step in the orchestration layer. At the core, classical systems perform the actual analysis, correlation, and detection. The brain is not the model itself but the architecture that links them together.

VIII. The Missing Nuance: Hybrid Architectures

The critique of pure LLM inference remains valid. The practical power of these models in observability comes from hybrid architectures that combine language and logic. The LLM does not replace algorithms. It translates human intent into structured actions that deterministic systems can execute.

Initial signal

Human: A growing number of customers are reporting slow performance in feature X. What is happening?

Orchestration

LLM: Uses retrieval and embeddings to surface potentially related textual information such as past incidents, problem reports, runbooks, and service ownership linked to feature X.

It does not yet know the cause but gathers prior knowledge to form an initial hypothesis and outline possible investigation paths.

This information must then be interpreted, and structured operational data must be added to the context before analysis can begin.

In practice, even the orchestration layer has limits. The model can only access the tools defined for it, and describing those tools consumes valuable context space. As the number of tools grows, the control logic that determines which one to call becomes increasingly difficult to maintain and prone to conflict.

Embeddings are not a one time artifact. They must be created, versioned, refreshed, and re indexed as services, schemas, and traffic patterns evolve. Without governance for drift detection, retraining schedules, and temporal indexing, the notion of similarity degrades over time and retrieval quality falls. In practice this means the platform maintains embeddings as core infrastructure, with clear ownership, reproducibility, and retention policies.

At enterprise scale, building and maintaining these embeddings is itself a frontier problem. A platform like Salesforce processes petabytes of logs and traces every day. Creating and storing embeddings for every record would be computationally and financially prohibitive. The cost of generating, hosting, and refreshing these high dimensional representations at scale cannot be ignored. In practice, the system must rely on sampling, aggregation, and on demand embedding to keep the semantic layer manageable. Each of these choices, however, shapes what can later be retrieved or reasoned about. The challenge is not only to maintain embeddings, but to decide which parts of the system deserve to be represented at all.

It is important to note that this process is reactive. It starts when an incident already exists and needs explanation. In parallel, independent anomaly detection, Bayesian networks, or other machine learning algorithms must continuously analyze telemetry in the background. These systems detect deviations and can proactively trigger the LLM when a pattern resembles past issues. The LLM’s role is to contextualize and communicate findings, not to perform the detection itself.

Scope resolution

Resolves feature X to its underlying services and endpoints through the service catalog and dependency map.

Determines the relevant time window and the corresponding telemetry to collect for analysis.

Hypothesis testing

LLM: Generates structured queries to retrieve logs, metrics, and traces for the suspected services, and frames the comparison against a recent baseline provided by the deterministic core.

It inspects attributes that appear to have shifted and formulates possible explanations.

Deterministic core

Executes these queries, computes the actual deltas, anomaly scores, conditional correlations, and service impact metrics.

The results then need to be prepared and reformatted before being passed back to the model.

This preparation includes aggregation, sampling, and normalization so that the data can fit within the model’s context window and be represented in a form the model can process.

Each of these steps involves choice. We decide which time ranges to keep, which attributes to group, and which anomalies to highlight. In doing so, we inevitably lose information. Important signals may be averaged out or hidden behind normalization. By simplifying the data to make it manageable for the model, we shape the narrative it will later construct. The LLM does not see the system as it is, but as we have compressed it to be.

What reaches the model, therefore, is already an interpretation, not the raw truth. If the summarization is incomplete or biased, the model may pursue the wrong hypothesis. This is not the model’s fault; it is the cost of working around its architectural limits.

Synthesis

LLM: Summarizes what is confirmed and what remains uncertain, suggests the next check, and requests missing context when needed.

Decision

Human: Validates or adjusts the hypothesis, adds its own knowledge and analysis, and decides whether to continue the investigation or pivot to a deterministic path.

This architecture avoids the pitfalls of pure inference. The LLM stays at the edge for intent, semantic retrieval, and synthesis. The core remains deterministic for computation, correlation, and causality.

There is a crucial layer in between. Orchestration is where the logic code lives. It coordinates which data to pull, how to transform it, and when to involve the model. It also performs reduction so that raw telemetry can fit within model limits. Reduction is necessary, but it is also a filter. Each transformation drops detail. An important signal can be lost at this step.

We cannot send everything to the model and hope it will find the problem. Choosing what to include already shapes the outcome. Observability data must be curated, aggregated, and interpreted before it reaches the LLM. The intelligence of the system therefore lives in the architecture that selects, structures, and verifies what the model can see.

The hybrid model confirms the principle rather than refuting it. Language belongs at the edge, logic at the core, and orchestration stands between them as the bridge where intelligence is designed, not guessed.

IX. Language at the Edge, Logic at the Core

The experiments, the numbers, and the examples all lead to the same truth. Large language models are extraordinary linguistic machines, but language is not thought. They can describe, summarize, and hypothesize, yet they cannot reason, remember, or decide. Their strength is expression, not inference.

In observability, this distinction matters. A system that must identify the cause of failure cannot rely on a model that only predicts the next most plausible sentence. Logs, metrics, and traces demand structure, context, and comparison. They require rules, baselines, and statistical rigor. These are not weaknesses of language models; they are reminders that language is only one layer of intelligence.

The path forward is not to make LLMs bigger, but to make systems smarter around them. The real innovation lies in architecture. The LLM should serve as the interpreter, the assistant at the edge that helps humans navigate complexity, translate findings, and express hypotheses. The analytical core should remain deterministic, powered by algorithms that understand numbers, relations, and time.

Between the two lives orchestration, where intent becomes computation and data becomes narrative. This is the true brain of the system. It decides which data to retrieve, how to reduce it, and what the model can see. Each transformation is both necessary and dangerous, because reduction introduces bias and information loss. The goal is not to eliminate this filtering but to design it with awareness, so that meaning is preserved as data moves from logic to language.

When language and logic cooperate through careful orchestration, observability becomes a dialogue between explanation and evidence. The model provides fluency, the core provides truth, and the orchestration ensures that one does not distort the other. Together, they can accelerate discovery without replacing understanding.

At Salesforce, this principle defines the next generation of observability systems: language models translate intent, retrieve prior knowledge, and present findings, while deterministic analytics surface, rank, and verify the signals. The models help engineers communicate and navigate, the algorithms detect and decide. Real intelligence comes from the architecture that connects the two.

The illusion of understanding dissolves when we recognize where understanding truly lives. It does not live in words. It lives in the structure that gives them meaning.