From general AI platforms like ChatGPT to healthcare diagnostic chatbots, the field of human-computer interaction (HCI) research made enormous strides in replicating elements of human behavior to enhance the user experience during human-computer interaction. On Sept. 30th, Michael Bernstein, an associate professor of computer science at Stanford University, presented a talk at the Center for Language and Speech Processing about current efforts to improve behavioral simulations. His talk was titled “Generative Agents: Interactive Simulacra of Human Behavior“ and spotlighted the simulated town, Smallville, which his team worked on.
Behavioral simulation has been around for decades. Rooted in Thomas Schelling’s famous “model of segregation” and agent-based modeling, these simulations use simple and easily quantifiable mathematical models to simulate human behavior. Recently, these simple models have been employed to ask hypothetical “what-ifs,” which help researchers understand the impact different combinations of policies could have had on past events, such as policy design during the pandemic. Behavioral simulations are increasingly prevalent in daily life as well. From educational bots used to simulate conflict resolution in business classes to video games such as the Sims, it has become much more important for new AIs to understand our social norms and integrate more effectively into our social environment.
“Every system and model makes assumptions about how design will shape behavior implicitly or explicitly. Consequently, it becomes necessary to consider moving beyond user studies and explore the second-order and third-order effects of these systems,” Bernstein explained.
However, this is a goal that is more easily articulated than achieved.
“The problem is… human behavior is complex, it’s contingent, it depends on things. But today’s models are rigid,” Bernstein explained during the talk.
Large language models (LLMs) such as ChatGPT and character.ai have opened new avenues for behavioral simulation. These models are trained on high volumes of data and are exposed to a wide range of human behaviors, so they can be prompted to take on various backgrounds, experiences and traits.
The Smallville that Bernstein and colleagues developed is a simulated “town” that hosts 25 generative agents, each initialized with a custom-made sprite and a brief paragraph describing their identity and relationships with other agents. The agents can perform “behaviors” by first using words to describe their actions, which are then translated into concrete movements.
The agents may interact in three ways: dialogue, which forces agents to recall past experiences from memory; inner voice, which allows the researchers to suggest actions to agents; or manipulating the game world to see how agents respond to a change in their situation. An example that Bernstein gave was of one agent throwing a Valentine’s Day party where a “crush” developed between two other agents who attended.
“There is believable, humanlike behavior that can be elicited through these kinds of models,” Bernstein remarked.
Bernstein further explained how the researchers integrated higher-level memory recall and reflection into the simulation.
“If you just give agents raw episodic memory, they’re not going to make decisions,” Bernstein articulated, emphasizing one of the main issues in replicating human behavior. Standard agent memory streams store textual representations of everything they observe, retrieving memories based on recency, importance and relevance. To “add on” to this memory, researchers retrieved small subsets of agent memories and added reflection, allowing agents to step back and have “shower thoughts,” thus integrating higher-level reflections into their memory.
The simulation architecture was evaluated across several conditions: self-knowledge, memory, planning, reactions and reflections. Human evaluators ranked responses from each condition and translated them into a TrueSkill rating. A separate set of humans were asked to personify agents and respond to the same situations the agents responded to. Their responses were evaluated similarly to the responses from the generative agents. Whether or not the architecture “worked” was determined by the percentage of responses from the generative agents matched responses from humans.
Results from the study demonstrated that the full architecture worked, but removing any key component, such as reflection or planning, resulted in declines in overall performance, as predicted. Additionally, the project still ran into some errors: agents would either fail to retrieve certain memories or embellish, adding new, nonfactual information that wasn’t in their memory previously.
Bernstein discussed one of the key risks of these interactive simulations. Behavioral simulations can cause users to form parasocial relationships with generative agents. To combat this, Bernstein argued that computational agents should explicitly disclose their nature. Additionally, generative AI comes with its own risks, including deep fakes, generation of misinformation and tailored persuasion, which in Bernstein’s opinion, is one of the biggest risks of generative AI. He emphasized throughout the talk that AI should not be used as a substitute for human involvement.
“Like any method, there are things [that generative AI] is good at and bad at…you really should be engaging with stakeholders and users and communities and use [these models] to ask questions that could not have been asked otherwise,“ he said.
Bernstein went on to discuss the accuracy of the model in replicating known behavioral science studies. Bernstein explained a test he ran where humans and human-simulating agents underwent the same two-hour interview script from the American Voices project. To evaluate agent behavior, researchers measured the percentage of these surveys for which the agent's answer matched the person’s actual answer. They also adjusted for self-inconsistency because humans don’t answer the same way twice if you ask them. After testing on experiments replicated from five behavioral studies and five economic games, a full generative agent model was able to replicate responses from the human participants approximately 85% of the time.
Bernstein then concluded by proposing a new model of behavior for generative agents to follow, known as jury learning, in which generative AI aims to simulate the attitudes of individuals best fit for making a particular decision and aggregates these behaviors together to arrive at a final decision.
“We change the metaphor…instead of just saying ‘what’s the right answer?’ We say ‘pick a jury of the kinds of people who ought to be empowered to determine the behavior of this classifier,’ Bernstein clarified near the end of his talk.
Traditional classifiers that attempt to predict human behavior, such as AI toxicity detectors, are based on simple neural networks that classify things in a binary manner (for example, as “toxic” vs. “not toxic”). However, such binary classification substantially limits the level of complexity in human messaging and communication, and generative agents that classify in this way also fail to provide extra weight to groups that would be most adversely affected by a particular “toxic” message. Jury learning could thus add procedural legitimacy to how AI makes decisions and ensure that decisions are representative of the group most affected by a particular decision.