Why Human-AI Teamwork Isn't Really Teamwork Yet

A systematic review of 105 studies reveals that 72% of human-AI interactions use passive supervision patterns, not genuine collaboration. Exploring the taxonomy of interaction patterns and what true partnership requires.

Why Human-AI Teamwork Isn't Really Teamwork Yet
Photo by Randy Fath / Unsplash

I came across this research while reading about human-AI interaction design. The paper's title—"Human-AI collaboration is not very collaborative yet"—caught my attention as it addresses a topic I find interesting: the gap between what AI systems can do and how humans actually work with them in practice. This systematic review by Gomez et al. (2024) from Johns Hopkins University examines how humans interact with AI in decision-making contexts across 105 empirical studies. The paper offers insights into interaction patterns and collaboration dynamics that I hadn't encountered before in this level of detail.

In this essay, I'll synthesise the key findings from this systematic review of 105 empirical studies to examine why human-AI collaboration remains largely superficial for the most parts in practical scenarios. I'll first establish the research context, then explore the core findings organised around three pillars that explain the collaboration deficit.

So let's dive in!

Mapping the Interaction Landscape

The researchers conducted a systematic review spanning 2013-2023, analysing 105 peer-reviewed studies featuring actual user interactions with AI systems in decision-making tasks. They excluded studies focused purely on perceptions or technical contributions, concentrating instead on empirical evidence of how humans and AI actually interacted.

Their methodology yielded a taxonomy of seven distinct interaction patterns, analysing 131 separate interaction sequences across domains including healthcare (26 studies), finance/business (15), generic tasks (20), law (6), and social media (8). The researchers deliberately excluded robotics and gaming scenarios to focus on screen-based interfaces where interaction design choices were most transparent.

The central contribution was a structured vocabulary for describing and differentiating human-AI interactions. The seven patterns identified were:

  1. AI-first assistance (51% of observations): The AI prediction and decision problem were presented simultaneously, with users deciding whether to accept or reject the advice.
  2. AI-follow assistance (21%): Users made an initial prediction before seeing the AI's recommendation, allowing comparison against their preliminary judgment.
  3. Secondary assistance (12%): AI provided supplementary information (like risk scores) rather than direct predictions, which users had to interpret and apply.
  4. Request-driven AI assistance (19%): Users actively controlled when and how they received AI input, rather than having it automatically presented.
  5. AI-guided dialogic engagement (5%): AI facilitated a dialogue-like exchange, guiding users through information provision in an iterative process.
  6. User-guided interactive adjustments (7%): Users could modify the AI's outcome space through feedback, corrections, or parameter changes.
  7. Delegation (7%): Either the user or AI could delegate decision-making responsibility based on assessed confidence and capability.

So why is Human AI Collaboration not really Collaborative?

Pillar 1: The Dominance of Passive Supervision Patterns

Over 72% of observed interactions fell into just two patterns—AI-first assistance (51%) and AI-follow assistance (21%). These were fundamentally supervision paradigms, not collaboration paradigms.

AI-first assistance presented users with the AI's prediction alongside the decision problem simultaneously. The user's role reduced to accepting or rejecting the AI's recommendation. A radiologist seeing an AI's tumor detection result before examining the image themselves exemplified this pattern—convenient, but cognitively limiting.

AI-follow assistance asked users to make an initial judgment, then presented the AI's prediction. While this seemingly gave users more agency, research showed participants rarely revised their provisional diagnoses when AI disagreed with their earlier assessment—suggesting confirmation bias rather than genuine reconsideration.

Together, these patterns positioned humans as "quality control" rather than collaborative partners. As the paper noted: "The human role was limited to supervising the AI predictions" in the vast majority of studies. This represented supervision masquerading as collaboration. True collaboration required bidirectional influence where both agents actively shaped the decision-making process. In these patterns, however, information flowed one way—from AI to human—with humans relegated to gatekeeping roles. The AI operated independently, generating predictions without human input, while humans merely validated or rejected outputs they had no role in creating. This asymmetry meant neither agent learned from the other, and the distinctive strengths of human judgment remained largely untapped.

Pillar 2: The Scarcity of Interactive Functionality

Genuinely interactive patterns—those involving sustained dialogue, user-guided adjustments, or dynamic delegation—appeared in fewer than 20% of cases:

  • AI-guided dialogic engagement: Only 6 instances, mostly conversational interfaces where AI guided users through information gathering
  • User-guided interactive adjustments: Just 9 cases where humans could modify AI's outcome space
  • Request-driven AI assistance: 25 instances where users actively controlled when to receive AI input

The researchers observed a critical gap: "Much of the HCI literature on AI assistance has concentrated on intermittent scenarios (i.e., turn-taking). This is in contrast to continuous user interaction scenarios, where user input is sustained and can receive AI feedback at any given moment."

This intermittent, turn-taking approach resembled passing notes more than genuine collaboration. The distinction mattered: in these systems, each agent took discrete turns—the AI generated output, paused, and waited; the human reviewed, responded, and waited. There was no opportunity for mid-process adjustments, real-time clarifications, or iterative refinement. Real collaboration—whether human-human or human-AI—required continuous, dynamic exchange where both parties could interject, question, and build on each other's contributions in the moment. A collaborative partnership allowed for immediate course corrections when one party spotted an issue, spontaneous exploration of alternative approaches, and the fluid back-and-forth that characterised productive teamwork. Current systems rarely supported this kind of sustained, responsive interaction..

Pillar 3: The Neglect of Cognitive Realities and Human Agency

Perhaps most problematically, prevalent interaction patterns actively induced cognitive biases rather than mitigating them:

Anchoring bias in AI-first patterns: When AI predictions appeared before users formed independent judgments, that initial information disproportionately influenced final decisions—even when users knew it might be wrong. The researchers noted: "The AI-first interaction also makes the user susceptible to the 'anchoring bias', a phenomenon where a person's judgment is biased based on initial information."

Confirmation bias in AI-follow patterns: When users made initial predictions then saw AI recommendations, they tended to seek supporting evidence for their hypothesis rather than genuinely reconsidering. One cited study found participants "rarely revised their provisional diagnoses when the AI inferences differed from their earlier assessment."

Loss of agency: Direct presentation of AI inferences created a "lack of sense of agency" for users—the subjective feeling of controlling one's actions and influencing external events. When AI assistance was mandatory and automatic rather than request-driven, users became passive recipients rather than active collaborators.

These cognitive distortions undermined the foundation of effective collaboration. Anchoring meant users couldn't bring fresh perspectives to problems—the very contribution that made human judgment valuable. Confirmation bias prevented the critical evaluation that might catch AI errors or identify cases where human expertise should override algorithmic predictions. Loss of agency reduced motivation and engagement, transforming decision-making from an active reasoning process into mechanical validation.

The evaluation metrics reinforced this problem. Across domains, researchers primarily measured:

  • Objective efficacy: Decision accuracy (the most common metric)
  • Trust and reliance: User-AI agreement, compliance frequency
  • Efficiency: Decision time

Notably absent from most studies were measures of genuine collaboration quality, mutual learning, or complementary expertise utilization. The field measured whether humans accepted AI advice, not whether human-AI teams achieved outcomes neither could reach alone. This evaluation framework optimised for compliance rather than complementarity, incentivising systems that made humans more agreeable rather than more capable.

Three Key Takeaways

We've Confused AI Assistance with AI Collaboration

The paper revealed a fundamental category error. Most systems provided AI assistance—one-way information flow from AI to human—rather than AI collaboration, which required bidirectional influence and shared agency. The prevalence of AI-first and AI-follow patterns demonstrated that researchers were building recommender systems and calling them collaborative partners.

True collaboration required what the researchers called "co-creating solutions in partnership with AI systems, actively involving them in the decision-making process. By combining the strengths of each agent, human intuition and expertise synergise with AI's computational efficiency and data-driven insights." This shift would directly address the passive supervision problem: instead of humans merely validating AI outputs, both agents would contribute their distinct capabilities throughout the decision process. The AI could surface patterns in data whilst humans provided contextual judgement and domain expertise, with each agent's input actively shaping the other's reasoning. This kind of a bidirectional flow would have transformed quality control into genuine partnership.

Interaction Design Has Been an Afterthought

The researchers identified a critical imbalance: "There is often a disproportionate emphasis on the technological advancements, overlooking the critical aspects of user interface and experience. This oversight is apparent in many empirical studies where interactions with AI agents are typically reduced to basic actions like menu selections or button clicks."

The field had focused on what information AI should provide (the purview of explainable AI) whilst neglecting how and when to present it. The lack of a common vocabulary for interaction patterns prevented systematic thinking about these design choices. Deliberate interaction design could enable the continuous, dynamic exchange missing from current systems. Rather than turn-taking, designers could create interfaces supporting sustained dialogue—where users questioned AI reasoning in real-time, adjusted parameters on the fly, and received immediate feedback on proposed alternatives. This would replace the "passing notes" dynamic with fluid conversation, allowing both agents to refine their contributions iteratively rather than in discrete, disconnected steps.

The Evaluation-Design Feedback Loop is Broken

Current evaluation practices reinforced simplistic interactions. By primarily measuring whether humans accepted AI advice (trust/compliance metrics) rather than team performance or complementary contributions, the field incentivised systems that made humans more compliant rather than more capable.

The researchers noted: "Even when labelled as high-stakes, the lack of real consequences can influence user engagement. We must think more about differences in how users behave in experimental tasks and in equivalent real-life scenarios." Different metrics would drive different designs. If researchers measured complementary performance—whether human-AI teams handled diverse cases better than either agent alone—they would optimise for systems that leveraged distinct strengths rather than induced agreement. Measuring appropriate reliance rather than blind trust would encourage designs that helped users calibrate confidence based on context. Assessing mutual learning would prioritise interactions where both agents improved over time. These evaluation shifts would address the cognitive bias and agency problems directly: systems optimised for complementarity would need to preserve human judgement rather than anchor it, maintain user agency rather than diminish it, and support active reasoning rather than passive acceptance.


Pathways Forward: Designing for Genuine Collaboration

Embrace Continuous, Bidirectional Interaction

Move beyond turn-taking paradigms to support sustained dialogue. This means:

  • Continuous feedback loops: Allow users to provide input at any moment and AI to respond dynamically
  • Mixed-initiative interaction: Both human and AI can initiate exchanges, ask questions, or suggest alternatives
  • Persistent context: Maintain shared understanding across interactions rather than treating each decision atomically

Design for Complementarity, Not Compliance

Recognise that optimal human-AI teams leverage different strengths rather than having humans rubber-stamp AI decisions:

  • Explicit capability signaling: Systems should communicate what they can and cannot do reliably, helping users understand when to trust AI vs. rely on human judgment
  • Uncertainty-aware interfaces: Show not just predictions but confidence bounds, enabling users to focus effort where AI is uncertain
  • Complementary task allocation: Automatically route straightforward cases to AI, ambiguous cases to human review, and edge cases to collaborative deliberation

Mitigate Rather Than Induce Cognitive Biases

Thoughtful interaction sequencing can reduce rather than amplify bias:

  • Cognitive forcing functions: Request-driven assistance (where users must explicitly ask for AI input) can reduce anchoring by ensuring users form independent assessments first
  • Secondary assistance patterns: Providing AI-generated risk scores or probabilities rather than direct recommendations forces users to interpret and apply information, maintaining engagement
  • Contrastive explanations: When AI disagrees with user assessments, show not just why AI chose its answer but why it didn't choose the user's answer

Support Multiple Collaboration Modes

Different tasks and contexts require different interaction patterns. Rather than committing to a single approach, design systems that flexibly support multiple modes:

  • Exploratory mode: User-guided adjustments for "what-if" analysis
  • Learning mode: AI-guided dialogue for building user expertise
  • Efficiency mode: AI-first assistance for routine cases
  • Deliberation mode: Secondary assistance for complex, ambiguous situations
  • Delegation mode: Task allocation based on confidence and capability

The taxonomy presented in this paper provides exactly this vocabulary for designing multi-modal systems.

Evaluate Collaboration, Not Just Accuracy

Shift evaluation metrics from individual performance to team performance:

  • Complementary accuracy: Measure not just final accuracy but whether the human-AI team handles different case types better than either agent alone
  • Appropriate reliance: Track whether users correctly trust AI on cases where it's reliable and correctly override it where it's not (the paper notes several studies beginning to measure "appropriate trust")
  • Mutual learning: Assess whether humans develop better mental models of problems through interaction and whether AI improves from human feedback
  • Agency preservation: Measure whether users feel in control and understand their contributions to decisions

One particularly promising direction:

Instead of asking "Did the human agree with AI?", ask "Did the human-AI team discover insights neither could reach independently?"

The path forward requires shifting our mental model from AI as a tool to AI as a teammate. Tools are passive instruments we control; teammates are active agents we coordinate with.

Three research priorities emerge:

  1. Develop interaction patterns for continuous collaboration, moving beyond intermittent turn-taking
  2. Study human-AI teaming in ecologically valid contexts with real consequences and sustained engagement
  3. Create evaluation frameworks centered on complementarity rather than compliance

As the researchers conclude: "The taxonomy presented here serves as a valuable resource to inform the design and development of AI-based decision support systems, ultimately fostering more productive, engaging, and user-centered collaborations."

The question isn't whether AI can make accurate predictions—increasingly, it can. The question is whether we can design interactions that make human-AI teams more capable than either agent alone. This systematic review suggests we're not there yet, but it also provides a roadmap for getting there.

The future of AI isn't about building smarter algorithms. It's about designing better partnerships.


References

Gomez, C., Cho, S. M., Ke, S., Huang, C.-M., & Unberath, M. (2024). Human-AI collaboration is not very collaborative yet: A taxonomy of interaction patterns in AI-assisted decision making from a systematic review. Under submission, arXiv:2310.19778v3.