
Artificial Intelligence
Why Human Judgement Has Become A Runtime Dependency And What That Means For The Future Of The Web
Overview
In this conversation, Sergey Polyashov, Chief Operating Officer at Toloka, explains why human feedback is moving beyond training pipelines and becoming a runtime dependency for agentic AI. He discusses how expert judgment is being embedded directly into workflows, why LLM-as-a-judge systems still need human calibration, and how deterministic evaluation methods can provide cleaner signals for complex enterprise tasks.
The interview also explores what agent-driven discovery means for publishers, brands, and businesses that have long depended on clicks, rankings, and SEO. As AI agents increasingly mediate decisions on behalf of users, companies will need to optimize not only for human attention but also for machine-readable trust, structured context, and reliable signals. At the center of this transformation is a simple but powerful idea: as AI becomes more autonomous, human judgment becomes more essential.
TD Editor: The internet ran on search, recommendation, and direct action as three distinct layers for two decades. You're saying that architecture is collapsing. What's actually driving that, and why now?
Those three layers were always more separate than they looked. Search had its own optimisation logic - relevance judgements, and A/B tests on click-through. Recommenders lived on implicit signals: dwell time, completion rate. Conversion funnels were their discipline. Different teams, different metrics, different cultures within the same company.
What generative agents have done is collapse all of that into a single surface. A user states an intent in natural language, and the system handles retrieval, ranking, reasoning, tool use, and transaction without a search bar or a checkout form ever appearing. That's not theoretical anymore. You can watch it happen today in OpenAI's agentic browsing tools, in Perplexity, and in what The Browser Company is doing. They've taken the entire discovery-to-purchase arc and folded it into a chat interface.
The conversion data are pretty striking: AI-search-driven traffic is converting at around 14% versus under 3% for traditional search. The user's intent hasn't changed. What's changed is that the friction is gone. But here's the problem that comes with that: when you compress ten documents into one synthesised answer, the stakes per decision go up sharply. A click-through rate tells you something in aggregate. An agent that makes the wrong purchase still logs a successful event. The failure only shows up at the level of the actual outcome. That's why human feedback has migrated from training pipelines into runtime.
TD Editor: Toloka has been deep inside RLHF pipelines for years. How has the actual function of human feedback shifted as agents become the dominant paradigm?
When InstructGPT came out in 2022, human feedback was essentially a training-time project. You gathered preference pairs, built a reward model, ran PPO, and shipped. That framing made sense for the models of that era - InstructGPT outputs scored substantially better with labellers than GPT-3, and RLHF became foundational almost overnight.
What we've learnt since is that pure RLHF, trained on short response pairs, doesn't generalise cleanly to multi-step agent workflows. Errors accumulate across tool calls. The reward model develops blind spots in exactly the places where the stakes are highest. The failure modes are well-documented at this point: reward hacking, distributional shift, evaluators who encode noise rather than real signal.
The architectural response has been to push human feedback out of the training phase and into the deployment loop. Modern human-in-the-loop systems interleave judgement continuously, using active learning to prioritise which outputs actually need review. Uncertain cases. Edge cases. High-stakes decisions. The mental model shifts from "humans label, model trains, and model deploys" to something more like a continuous dialogue where the model proposes and humans verify at low-confidence thresholds and both sides improve over time. Human judgement stops being a project milestone. It becomes infrastructure.
TD Editor: Toloka recently launched Tendem and an MCP integration that lets agents request expert judgment mid-workflow. Architecturally, why does that matter, and what does it actually look like in practice?
The thing that's changing is where in the stack humans sit. Traditionally, expert review happened after the model produced output, a post-hoc check. With Tendem, expert judgement starts functioning as part of the execution layer itself.
Most enterprise AI systems don't fail because the model is fundamentally incapable. They fail at edge cases - moments where the agent lacks enough context or confidence to make a reliable call. A tiny error rate might look acceptable in aggregate, but in compliance, healthcare, and financial operations, even small mistakes can create real downstream risk that doesn't show up until it's expensive.
What the MCP integration enables is basically a structured escalation path. The agent hits a low-confidence scenario or crosses a policy threshold, it requests input from a verified domain specialist in real time, gets a decision, and continues the workflow. Human expertise stops being a separate review stage and starts behaving like a production dependency — with latency expectations, auditability requirements, escalation logic, quality guarantees. All the operational characteristics you'd normally associate with infrastructure services.
The challenging problem, interestingly, isn't the integration itself. It's calibrating when the system should escalate. Too aggressive, and the workflow slows to a crawl. Too relaxed, and the agent acts autonomously in situations where it really shouldn't. Getting that threshold right is more art than science right now.
And the real advantage Toloka has here isn't the interface layer - plenty of companies can build an API. It's the operational network underneath: large pools of verified specialists across domains and languages, fraud prevention, response quality at scale. That's genuinely difficult to replicate.
TD Editor: LLM-as-a-judge systems are everywhere right now. How reliable are they actually, and where do they quietly break down?
It is more complicated than the marketing suggests. Honestly.
At a high level, LLM judges work reasonably well. There's a strong correlation between GPT-4-class evaluations and human preferences - Spearman correlations in the 0.8 to 0.9 range. That's good enough to make automated evaluation very attractive for benchmarking and large-scale testing, and I understand why organixations are adopting it fast.
The issue is that aggregate metrics hide where the failures actually happen. When you get into hard comparisons - cases where two models perform similarly - consistency starts to degrade. Recent stress tests showed that even the leading judge models struggle to maintain stable preferences in roughly a quarter of difficult evaluations. Close-gap scenarios, where the quality difference is genuinely subtle, produced about twice the inconsistency seen in more obvious comparisons. That's not a rounding error.
It matters because the hardest evaluations are often the ones most consequential to organisations. If two systems are near the top of a leaderboard, small ranking errors affect procurement decisions, fine-tuning strategy, deployment choices. You can have benchmarks that look statistically solid in aggregate and still be unreliable at the exact decision boundary that actually matters.
Agent-as-a-judge systems are a meaningful step forward, especially for code generation - agents can reason about requirements and execution results in ways static evaluators can't. But even there, reliability depends on the evaluation framework being grounded in human-defined requirements. Remove the human reference point, and the system starts judging itself against its assumptions. At that point you've got a loop with no external anchor. Automated evaluation will scale because the economics demand it. But it remains only as reliable as the human baseline used to calibrate it.
TD Editor: Toloka Arena uses deterministic scoring, comparing the final database state against a "golden state". Why is that such a powerful evaluation method, and where does it hit a wall?
The strength of deterministic evaluation is that it cuts through interpretive subjectivity. In the Arena manufacturing benchmark, the agent is working inside a simulated enterprise stack, ERP systems, manufacturing execution software, inventory databases, internal messaging tools, policy documentation. The task isn't answering a question correctly. The agent has to identify the right system, extract the right data, interpret business rules that frequently interact in non-obvious ways, then act across multiple tools in the correct sequence. At the end, the system hashes the resulting database state and compares it to a known-good result. Either the workflow reached the correct operational state or it didn't.
It's conceptually similar to how competitive programming platforms evaluate submissions, scaled up for enterprise environments with multiple interconnected systems. The signal is clean. You're measuring outcomes, not passing judgement on the style or quality of the reasoning that got there.
The downside is how much it costs to do this right. Building realistic environments is slow and expensive - much slower than most people expect. Each benchmark requires domain-specific tooling, policy layers, datasets, and validated execution paths. A single environment can take months of engineering time and domain expertise to build properly.
That's probably the biggest structural limitation for the field right now. There's no shared evaluation infrastructure. What we actually need are open environments, standardised interfaces, and publicly available golden trajectories, something closer to what ImageNet became for computer vision. Without that, high-quality deterministic evaluation stays locked inside organisations wealthy enough to build it themselves.
TD Editor: What does agent-driven discovery mean for publishers, brands, and businesses built on web visibility? Is this actually the end of SEO?
SEO doesn't disappear. But its position at the centre of the web economy starts to weaken, yes. The deeper shift is that websites are increasingly being visited, evaluated, and in some cases transacted with by AI agents rather than humans directly. That changes what optimisation even means.
For years, digital strategy was built around human attention - rankings, clicks, landing pages, conversion funnels. Now companies are effectively optimising for two audiences simultaneously. Humans still need clarity, trust, and a good experience. Agents need structured information, reliable context, and signals they can interpret with high confidence. Those requirements don't always align.
That's where generative engine optimisation starts to emerge as a real discipline - not just appearing in a ranked list of results, but being surfaced inside an AI-generated answer or selected during an agent-driven decision flow. The commercial model changes too. I'd expect more transaction-based economics around agent ecosystems: commissions, preferred integrations, category partnerships, and potentially paid placement within recommendation flows. Less like keyword advertising, more like infrastructure for machine-mediated commerce.
What's genuinely interesting, though, is that this environment may actually reward credibility more consistently than traditional SEO did. Old-school search was heavily shaped by tactical manipulation: backlinks, keyword density, and content farms. AI systems aren't impossible to game, but reasoning-based evaluation works differently from a static ranking algorithm. Authentic reviews, long-term reputation, expert citations, consistent product quality - these signals are more important when an agent is trying to make a purchasing decision on behalf of a user. Surface-level optimisation often isn't enough.
Companies that built real trust and strong products before this shift are probably better positioned than those that leaned heavily on search mechanics. That's a meaningful inversion.
Tue, Jun 2, 2026
Liked what you read? That’s only the tip of the tech iceberg!
Explore our vast collection of tech articles including introductory guides, product reviews, trends and more, stay up to date with the latest news, relish thought-provoking interviews and the hottest AI blogs, and tickle your funny bone with hilarious tech memes!
Plus, get access to branded insights from industry-leading global brands through informative white papers, engaging case studies, in-depth reports, enlightening videos and exciting events and webinars.
Dive into TechDogs' treasure trove today and Know Your World of technology like never before!
Disclaimer - Reference to any specific product, software or entity does not constitute an endorsement or recommendation by TechDogs nor should any data or content published be relied upon. The views expressed by TechDogs' members and guests are their own and their appearance on our site does not imply an endorsement of them or any entity they represent. Views and opinions expressed by TechDogs' Authors are those of the Authors and do not necessarily reflect the view of TechDogs or any of its officials. While we aim to provide valuable and helpful information, some content on TechDogs' site may not have been thoroughly reviewed for every detail or aspect. We encourage users to verify any information independently where necessary.
Loading comments...
