On Thursday, OpenAI launched its new o1 models, offering ChatGPT users their first opportunity to test AI models that pause to “think” before responding. There has been a lot of excitement surrounding these models, dubbed “Strawberry” within OpenAI. But does Strawberry live up to its hype?

Sort of.
Compared to GPT-4o, the o1 models appear to be one step forward and two steps back. ChatGPT o1 is excellent at reasoning and answering complex questions, but it is around four times more expensive to use than GPT-4o. OpenAI’s most recent model lacks the tools, multimodal capabilities, and speed that made GPT-4o so effective. In reality, OpenAI concedes that “GPT-4o is still the best option for most prompts” on its support page, and notes elsewhere that GPT o1 struggles with simple tasks.

“It’s impressive, but I don’t think the improvement is that significant,” said Ravid Shwartz Ziv, an NYU professor who studies AI models. “It’s better at certain problems, but you don’t have this across-the-board improvement.”

For all of these reasons, GPT o1 should only be used to answer the big questions. To be clear, most people do not employ generative AI to answer these types of queries today, owing to the poor performance of current AI models. However, o1 is a preliminary step in that direction.

Thinking about large ideas.
ChatGPT o1 is unique in that it “thinks” before answering, breaking down large problems into small steps and tries to determine when one of those steps is correct or incorrect. This “multi-step reasoning” is not entirely new (researchers have recommended it for years, and You.com employs it for difficult questions), but it has only just been practical.

“There’s a lot of excitement in the AI community,” said Workera CEO and Stanford professor Kian Katanforoosh, who teaches machine learning classes, in an interview. “If you can train a reinforcement learning algorithm paired with some of the language model techniques that OpenAI has, you can technically create step-by-step thinking and allow the AI model to walk backwards from big ideas you’re trying to work through.”

ChatGPT o1 is also unusually expensive. In most models, you pay for both input and output tokens. However, ChatGPT o1 introduces a hidden process (the small stages the model divides huge problems into), which adds a significant amount of compute that you never see. To protect its competitive advantage, OpenAI hides some aspects about this approach. However, you are still taxed for these in the form of “reasoning tokens.” This underscores why you should be cautious when using ChatGPT o1, so you don’t get charged a ton of tokens for asking where the capital of Nevada is.

The concept of an AI model that allows you to “walk backwards from big ideas” is strong, however. In practice, the model is relatively good.

After 12 seconds of “thinking,” ChatGPT emailed me a 750+ word response, eventually advising me that two ovens should suffice with some careful planning, allowing my family to save money and spend more time together. But it spelled down its reasoning for me at each stage, explaining how it took into account all of these external aspects, like as expenses, family time, and oven management.

ChatGPT o1 advised me on how to prioritize oven space at the house hosting the gathering, which was wise. Oddly, it suggested that I rent a portable oven for the day. However, the model outperformed GPT-4o, which required numerous follow-up inquiries about the specific meals I was bringing before providing me with bare-bones advise that I found less useful.

Asking about Thanksgiving dinner may be frivolous, but you can see how this tool may be useful for breaking down complex jobs.

I also asked ChatGPT o1 to assist me in planning a hectic day at work, which included travel to and from the airport, many in-person meetings in various places, and my workplace. It offered me a pretty detailed strategy, but it may have been a little too much. The additional processes can be intimidating at times.

To put it simply, ChatGPT o1 overthinks everything. When I asked where you could locate cedar trees in America, it returned an 800-word response explaining every type of cedar tree in the country, including its scientific name. It even had to consult with OpenAI’s policies at one point, for some reason. GPT-4o did a much better job of addressing this question, giving me roughly three words detailing how the trees may be found throughout the country.

Adjusting expectations
In some ways, Strawberry was never going to live up to the expectations. Reports about OpenAI’s reasoning models stretch back to November 2023, just when everyone was wondering why OpenAI’s board fired Sam Altman. That sparked speculation in the AI community that Strawberry was a sort of AGI, the enlightened version of AI that OpenAI aims to eventually create.

To clear up any questions, Altman confirmed that o1 is not AGI, so you won’t be puzzled after using it. The CEO also lowered expectations for this launch, stating that “o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.”

The rest of the AI community is coming to terms with a less thrilling debut than anticipated.

“The hype sort of grew out of OpenAI’s control,” said Rohan Pandey, a research engineer at ReWorkd, an AI startup that creates web scrapers using OpenAI models.

He hopes that o1’s reasoning capacity is sufficient to tackle a certain set of complex problems where GPT-4 falls short. That is certainly how most industry professionals perceive ChatGPT o1, but not as the major step forward that GPT-4 signified for the industry.

“Everyone is expecting a step function shift for capabilities, and it is unclear whether this represents that. In an interview, Brightwave CEO Mike Conover, who previously co-created Databricks’ AI model Dolly, stated that he believes it is that simple.

What is the value here?

The core assumptions that guided the development of o1 date back years. In 2016, Google employed similar tactics to produce AlphaGo, the first AI system to defeat a world champion of the board game Go, according to Andy Harrison, a former Googler and CEO of the investment firm S32. AlphaGo trained by repeatedly playing against itself, basically self-teaching until it achieved superhuman abilities.

He observes that this raises an age-old issue in the AI community.

“Camp one believes that you can automate workflows using this agentic technique. “Camp two believes that if you had generalized intelligence and reasoning, you wouldn’t need the workflow and, like a human, the AI would simply make a judgment,” Harrison explained in an interview.

Harrison claims he’s in camp one, whereas camp two demands you to trust AI to make the correct decision. He doesn’t think we’ve arrived yet.

Others, however, see o1 as less of a decision-maker and more of a tool for challenging your thinking on major issues.

Katanforoosh, the CEO of Workera, gave an example of how he planned to interview a data scientist for a position at the company. He tells ChatGPT o1 that he only has 30 minutes and needs to evaluate a certain set of talents. He can work backwards with the AI model to see if he’s thinking correctly, and ChatGPT o1 will recognize time limits and such.

The question is whether this useful gadget is worth the high price. As AI models become more affordable, o1 is one of the first AI models in a long time to increase in price.

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts