Tech research company OpenAI has just released an updated version of its text-generating artificial intelligence program, called GPT-4, and demonstrated some of the language model’s new abilities. Not only can GPT-4 produce more natural-sounding text and solve problems more accurately than its predecessor. It can also process images in addition to text. But the AI is still vulnerable to some of the same problems that plagued earlier GPT models: displaying bias, overstepping the guardrails intended to prevent it from saying offensive or dangerous things and “hallucinating,” or confidently making up falsehoods not found in its training data.
On Twitter, OpenAI CEO Sam Altman described the model as the company’s “most capable and aligned” to date. (“Aligned” means it is designed to follow human ethics.) But “it is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it,” he wrote in the tweet.
Perhaps the most significant change is that GPT-4 is “multimodal,” meaning it works with both text and images. Although it cannot output pictures (as do generative AI models such as DALL-E and Stable Diffusion), it can process and respond to the visual inputs it receives. Annette Vee, an associate professor of English at the University of Pittsburgh who studies the intersection of computation and writing, watched a demonstration in which the new model was told to identify what was funny about a humorous image. Being able to do so means “understanding context in the image. It’s understanding how an image is composed and why and connecting it to social understandings of language,” she says. “ChatGPT wasn’t able to do that.”
A device with the ability to analyze and then describe images could be enormously valuable for people who are visually impaired or blind. For instance, a mobile app called Be My Eyes can describe the objects around a user, helping those with low or no vision interpret their surroundings. The app recently incorporated GPT-4 into a “virtual volunteer” that, according to a statement on OpenAI’s website, “can generate the same level of context and understanding as a human volunteer.”
But GPT-4’s image analysis goes beyond describing the picture. In the same demonstration Vee watched, an OpenAI representative sketched an image of a simple website and fed the drawing to GPT-4. Next the model was asked to write the code required to produce such a website—and it did. “It looked basically like what the image is. It was very, very simple, but it worked pretty well,” says Jonathan May, a research associate professor at the University of Southern California. “So that was cool.”
Even without its multimodal capability, the new program outperforms its predecessors at tasks that require reasoning and problem-solving. OpenAI says it has run both GPT-3.5 and GPT-4 through a variety of tests designed for humans, including a simulation of a lawyer’s bar exam, the SAT and Advanced Placement tests for high schoolers, the GRE for college graduates and even a couple of sommelier exams. GPT-4 achieved human-level scores on many of these benchmarks and consistently outperformed its predecessor, although it did not ace everything: it performed poorly on English language and literature exams, for example. Still, its extensive problem-solving ability could be applied to any number of real-world applications—such as managing a complex schedule, finding errors in a block of code, explaining grammatical nuances to foreign-language learners or identifying security vulnerabilities.
Additionally, OpenAI claims the new model can interpret and output longer blocks of text: more than 25,000 words at once. Although previous models were also used for long-form applications, they often lost track of what they were talking about. And the company touts the new model’s “creativity,” described as its ability to produce different kinds of artistic content in specific styles. In a demonstration comparing how GPT-3.5 and GPT-4 imitated the style of Argentine author Jorge Luis Borges in English translation, Vee noted that the more recent model produced a more accurate attempt. “You have to know enough about the context in order to judge it,” she says. “An undergraduate may not understand why it’s better, but I’m an English professor…. If you understand it from your own knowledge domain, and it’s impressive in your own knowledge domain, then that’s impressive.”
May has also tested the model’s creativity himself. He tried the playful task of ordering it to create a “backronym” (an acronym reached by starting with the abbreviated version and working backward). In this case, May asked for a cute name for his lab that would spell out “CUTE LAB NAME” and that would also accurately describe his field of research. GPT-3.5 failed to generate a relevant label, but GPT-4 succeeded. “It came up with ‘Computational Understanding and Transformation of Expressive Language Analysis, Bridging NLP, Artificial intelligence And Machine Education,’” he says. “‘Machine Education’ is not great; the ‘intelligence’ part means there’s an extra letter in there. But honestly, I’ve seen way worse.” (For context, his lab’s actual name is CUTE LAB NAME, or the Center for Useful Techniques Enhancing Language Applications Based on Natural And Meaningful Evidence). In another test, the model showed the limits of its creativity. When May asked it to write a specific kind of sonnet—he requested a form used by Italian poet Petrarch—the model, unfamiliar with that poetic setup, defaulted to the sonnet form preferred by Shakespeare.
Of course, fixing this particular issue would be relatively simple. GPT-4 merely needs to learn an additional poetic form. In fact, when humans goad the model into failing in this way, this helps the program develop: it can learn from everything that unofficial testers enter into the system. Like its less fluent predecessors, GPT-4 was originally trained on large swaths of data, and this training was then refined by human testers. (GPT stands for generative pretrained transformer.) But OpenAI has been secretive about just how it made GPT-4 better than GPT-3.5, the model that powers the company’s popular ChatGPT chatbot. According to the paper published alongside the release of the new model, “Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.” OpenAI’s lack of transparency reflects this newly competitive generative AI environment, where GPT-4 must vie with programs such as Google’s Bard and Meta’s LLaMA. The paper does go on to suggest, however, that the company plans to eventually share such details with third parties “who can advise us on how to weigh the competitive and safety considerations … against the scientific value of further transparency.”
Those safety considerations are important because smarter chatbots have the ability to cause harm: without guardrails, they might provide a terrorist with instructions on how to build a bomb, churn out threatening messages for a harassment campaign or supply misinformation to a foreign agent attempting to sway an election. Although OpenAI has placed limits on what its GPT models are allowed to say in order to avoid such scenarios, determined testers have found ways around them. “These things are like bulls in a china shop—they’re powerful, but they’re reckless,” scientist and author Gary Marcus told Scientific American shortly before GPT-4’s release. “I don’t think [version] four is going to change that.”
And the more humanlike these bots become, the better they are at fooling people into thinking there is a sentient agent behind the computer screen. “Because it mimics [human reasoning] so well, through language, we believe that—but underneath the hood, it’s not reasoning in any way similar to the way that humans do,” Vee cautions. If this illusion fools people into believing an AI agent is performing humanlike reasoning, they may trust its answers more readily. This is a significant problem because there is still no guarantee that those responses are accurate. “Just because these models say anything, that doesn’t mean that what they’re saying is [true],” May says. “There isn’t a database of answers that these models are pulling from.” Instead, systems like GPT-4 generate an answer one word at a time, with the most plausible next word informed by their training data—and that training data can become outdated. “I believe GPT-4 doesn’t even know that it’s GPT-4,” he says. “I asked it, and it said, ‘No, no, there’s no such thing as GPT-4. I’m GPT-3.’”
Now that the model has been released, many researchers and AI enthusiasts have an opportunity to probe GPT-4’s strengths and weaknesses. Developers who want to use it in other applications can apply for access, and anyone who wants to “talk” with the program will have to subscribe to ChatGPT Plus. For $20 per month, this paid program lets users choose between talking with a chatbot that runs on GPT-3.5 and one that runs on GPT-4.
Such explorations will undoubtedly uncover more potential applications—and flaws—in GPT-4. “The real question should be ‘How are people going to feel about it two months from now, after the initial shock?’” Marcus says. “Part of my advice is: let’s temper our initial enthusiasm by realizing we have seen this movie before. It’s always easy to make a demo of something; making it into a real product is hard. And if it still has these problems—around hallucination, not really understanding the physical world, the medical world, etcetera—that’s still going to limit its utility somewhat. And it’s still going to mean you have to pay careful attention to how it’s used and what it’s used for.”