GPT-o1's Arrival is Pivotal

It represents a paradigm shift. Plus, our next webinar is on using LLMs as a prof.

Graham Clay
September 16, 2024 • Estimated Reading Time: 19 minutes

In partnership with

[image created with Dall-E 3 via ChatGPT Plus]

Welcome to AutomatED: the newsletter on how to teach better with tech.

Each week, I share what I have learned — and am learning — about AI and tech in the university classroom. What works, what doesn't, and why.

In this week’s piece, I discuss the biggest news of the fall, this time from OpenAI: the release of a new large language model (LLM), o1. Its name is terrible but it can solve complex problems that require sequential reasoning better than PhDs in fields like physics and chemistry — and it succeeds by “pausing to think.” I outline three upshots for educators.

I also open pre-registration for my October 4th webinar on “How to Use LLMs as a Professor.”

⬆️ The New Paradigm ⬆️
⏳ Our Next Webinar: “How to Use LLMs as a Prof”
✨Upcoming and Recent Premium Posts

⬆️ The New Paradigm ⬆️

This past Thursday, OpenAI announced a new series of models called “o1.”

The first model in this series, “o1-preview,” is now available in ChatGPT to Plus and Team users, with Edu and Enterprise users getting access this coming week. Its compact and faster cousin, “o1-mini,” performs comparably for “STEM reasoning” and will be available for free soon.

While the series name is unfortunate — like most LLM names, let’s be honest — these models represent a significant step forward in the history of AI.

I repeat: this is a big moment.

If you haven’t been paying attention to each new model as it comes out because you think its gains over its predecessors are slight, fair enough!

But this time is definitely different. Read on for why…

o1 models are built on the prior model from OpenAI, 4o, but they are trained via reinforcement learning to spend more time thinking through complex problems before responding, similar to human reasoning.

That is, OpenAI reports that the focus in o1’s development is how 4o applies its powers, not what it can do irrespective of how well its powers are applied.

When you ask o1 models to solve a problem, they “pause to think” — in the background, they produce strings of text that express how an intelligent person would think through approaches to solving the problem, comparing a range of avenues, before pursuing the top options and retracing their steps when they hit roadblocks, until they arrive at their best answer. These strings of text act as guardrails or crutches, making it much more likely that their best answer is truly good.

More specifically, they use “chain of thought” methods that prompt engineers — including yours truly (see here and here, as well as our Insights Series for new subscribers) — have recommended to get prior models to produce better outputs.

OpenAI explains that they have decided to hide the raw outputs of these methods but you can still get a summary of them by pressing the dropdown next to the time-of-thought counter:

An example from a department course scheduling use case of o1 that I am working on.

For those like myself counting (read: paying via API), the raw outputs of o1’s reasoning count as part of the context window (they are now called “reasoning tokens”), but they are discarded after the output is produced — not unlike a human who arrives at a conclusion, treats it as a given going forward, and forgets how they arrived at it. (More on these weeds here.)

To illustrate the improvements, OpenAI compares o1 to 4o via a suite of examples from ciphers to coding, English, and chemistry. You should check them out for yourself here. You can also watch videos of o1 helping a geneticist here, an autonomous coding developer here, an economist here, and a quantum physicist here. Terence Tao, the Fields Medalist mathematician, also offers his thoughts here (screencapped).

This brings me to the import of this paradigm shift for educators. There are at least three clear consequences that professors should be aware of.

Edu Upshot 1: Fading Content-Based AI-Immunity

As a result of the (big) tweak in how they are trained, o1 models perform radically better on a wide range of benchmarks than their predecessors, including very advanced and complex tasks. And they are better than PhDs or PhD students at some of these tasks.

If the benchmark is sensitive to reasoning ability, the effect is bigger, including from the perspective of human judges, when compared to 4o:

OpenAI, “Learning to Reason with LLMs”

Of course, the natural rejoinder at this point is that performing better than 4o at tasks like computer programming, data analysis, and mathematical calculations isn’t the same as outperforming PhDs.

Don’t hold your breath.

Here’s OpenAI summarizing their benchmark results:

In our tests, the next model update performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13% of problems, while the reasoning model scored 83%. Their coding abilities were evaluated in contests and reached the 89th percentile in Codeforces competitions.

OpenAI, “Introducing OpenAI o1-preview A new series of reasoning models for solving“

And here are some of the graphics OpenAI provided, displaying the size of the improvements:

CAPTION

I personally can verify the gains in formal logic, as o1 can reliably complete any of the formal logic problems I have assigned my undergraduates, while 4o and prior models would sometimes take wrong turns.

(As a sidenote, I have also found it to be radically better at helping me develop specific plans for complex projects, like a project I am working on to use AI to help departments schedule their courses to maximize the preferences of their faculty.)

The gains continue when it comes to AIME 2024 questions, as well as Codeforces code competition questions. Indeed, on the PhD-level science questions (GPQA Diamond), o1 outperforms the PhDs that OpenAI gathered to compare to it:

CAPTION

What this means is that using advanced content to stump LLMs — thus stymying students who aim to misuse them to complete their assignments without achieving the learning objectives — has diminished stepwise in its viability, at least for reasoning-heavy fields.

When I first released our ✨Premium Guide on discouraging and preventing AI misuse, content-based immunity was one of the viable methods that I discussed. I advocated for it because of our “AI-immunity Challenge,” where we tested professors’ take-home assignments for AI-immunity and found content-based immunity increased as complexity, depth, and field-specific nuances increased.

When I updated this Guide last month, I moderated some of my recommendations, but it looks like they will require further adjustments.

Professors should experiment with 1o to test any assignments that they had previously deemed AI-immune with respect to content.

Likewise, professors should consider turning to the other methods we have discussed.

Edu Upshot 2: AI Training Gains New Twist

With the self-evident power of the o1 models on global display, we should expect there to be a bonanza of improvements from Google and Anthropic, not to mention OpenAI themselves.

The natural thought is that AI training gains new import for educators, at least for those fields affected by o1-type improvements. As many of the AI optimists expected, there are ever fewer domains where AI ignorance is productive or wise.

This is a reasonable take, but there is a wrinkle.

The gains realized by o1 models are owed primarily to their use of what had been one of the most powerful prompting techniques, without being prompted to do so.

So, the next question to ask is:

“Do you need to know how to prompt them?”

Some experts think that the answer is “yes” and prompt engineering will become more and more a matter of being able to understand and articulate what you’re “trying to build.” At least, this is how Michelle Pokrass of OpenAI’s API team described it on the Latent Space podcast (transcript here):

Alessio Fanelli [00:43:59]: Will prompt engineering be here forever? Or is it a dying art as the models get better?

Michelle Pokrass [00:44:04]: I mean, it's like the perennial question of software engineering as well. It's like, as the models get better at coding, you know, if we hit 100 on Swebench, what does that mean? I think there will always be alpha in people who are able to clearly explain what they're trying to build. Most of engineering is like figuring out the requirements and stating what you're trying to do. And I believe this will be the case with AI as well. You're going to have to very clearly explain what you need, and some people are better than others at it. And people will always be building, it's just the tools are going to get far better.

From Latent Space’s interview with Michelle Pokrass of OpenAI

It remains to be seen whether this sort of ability will require you to speak the “language” of LLMs (as has been the case to date, to some degree), require you to understand the fundamentals of the relevant content or use case (as I have advocated repeatedly; e.g. here and here), or require you to be excellent at managing something that “understands” without understanding yourself (a dynamic I discussed back in April).

So, my view is that, so far, the verdict is still out.

Now, I should say that OpenAI provides some “advice on prompting” o1, included below:

However, those of us who have spent (too much) time working with custom GPTs and their LLMs have found that OpenAI’s advice isn’t generally very reliable, especially as use cases get more complicated.

Perhaps this latest advice is good. Perhaps not. It remains to be seen.

If it is good advice, it sounds like there are still important rules of thumb, like using formatting techniques that were previously recommended (I recommend them in our ✨Premium custom GPT Tutorial) and being careful to control how one’s prompt interfaces with the information accessed via RAG (also covered in the aforementioned Tutorial).

And surely there are more best practices beyond the four (!!!) OpenAI lists.

Furthermore, as noted, it isn’t clear if o1 will be better at tasks that don’t centrally involve “reasoning” as OpenAI characterizes it. For instance, per one of OpenAI’s benchmarks, apparently o1 is worse than 4o at “personal writing” while it is comparable at “editing text” and the AP English Language exam — and it isn’t clear if this is due to prompting or what.

Those professors who are training their students to use AI need to heavily experiment with o1, read widely from a range of practitioners, and continue to learn about what this means for their fields. It’s likely that students will need to use a range of models for the range of tasks they need to complete, each of which with varying prompt engineering needs.

Edu Upshot 3: o1 Models Have Crucial Limitations

Finally, and relatedly, it is important to note that o1 models are not as good as other models at many tasks. Here’s OpenAI on the topic:

As an early model, [o1] doesn't yet have many of the features that make ChatGPT useful, like browsing the web for information and uploading files and images. For many common cases GPT-4o will be more capable in the near term.

But for complex reasoning tasks this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.

OpenAI, “Introducing OpenAI o1-preview”

Likewise, there are “long context” environments where o1 cannot be used at all, not only because it has a 128k-token context window but also because its reasoning tokens are included in its context window, at least until it discards them. 128k tokens is approximately 100k words, which sounds like a lot, but isn’t when you include an incredibly complex train of thought. Matters will get worse once file uploading is allowed and the context window is crammed with information-rich documents, images, and videos.

At the moment, then, Gemini 1.5 Pro and Claude 3.5 Sonnet are better for long context tasks, like drafting grant applications, analyzing swathes of literature, writing complex new content to match similar old content, and so on. Indeed, o1 makes ChatGPT no better at style, as discussed above.

o1 is not the be-all and end-all.

For now, we need to treat the AI space like a Swiss Army knife, using each one for the task its best suited for, while recognizing its limitations. The same goes for our students.

⏳ Our Next Webinar:
“How to Use LLMs as a Professor”

If you want to get better at using AI for…

lesson planning
creating course content (e.g. quizzes from lecture recordings),
grading and assessment
research tasks (e.g. drafting grant applications or reviewing literature)
administrative tasks
field-specific tasks

Then the next AutomatED webinar is perfect for you.

I will cover how to use large language models (LLMs) to complete a range of professorial tasks like those listed above, including effective ways to prompt long context and use retrieval augmented generation (RAG).

I will cover ChatGPT (including 4o, the new o1, and custom GPTs), Claude, Gemini 1.5 Pro, Copilot for Microsoft 365, and Gemini for Workspace.

And I will show you how to better integrate these LLMs in your workflow, so they fit neatly into how you operate on a day-to-day basis.

This is a rare window into a service I provide to professors in my one-on-one consultations.

The last AutomatED webinar, which occurred on September 6th and covered how to train your students to use AI, had 24 registrants from SUNY Fredonia to UNC-Chapel Hill and Butte-Glenn Community College.

100% of responding participants gave it an A (“Excellent!”) afterwards.

Here’s how one participant described it:

“Clearly presented, with acknowledgment that we are all learning as we go. Ample time for questions and, most significantly, thoughtful answers. Thanks!”

An attendee at my Sept. 6th webinar

Another simply said “more please.”

Wish granted!

The next AutomatED webinar — “How to Use LLMs as a Professor” — will occur on Friday, October 4th from 12pm to 1:30pm Eastern Daylight Time.

Pre-registration is open now here, with a 50%-off discount available for those who sign up now (and a money-back-for-any-reason guarantee):

✨Premium subscribers can pre-register with an additional 10% discount (-$15) here:

[Link visible to Premium subscribers only]

More information can be found on our website here, where I will add a detailed webinar schedule later this week.

📬 From Our Partners:
A Newsletter That’s Just News

Receive Honest News Today

Join over 4 million Americans who start their day with 1440 – your daily digest for unbiased, fact-centric news. From politics to sports, we cover it all by analyzing over 100 sources. Our concise, 5-minute read lands in your inbox each morning at no cost. Experience news without the noise; let 1440 help you make up your own mind. Sign up now and invite your friends and family to be part of the informed.

✨Upcoming and Recent Premium Posts

September - Tutorial on Using AI with Canvas

September - Tutorial on All Major Functionalities of Microsoft 365 Copilot

August 21, 28 - Updated Guides on AI Assignment Design, on Discouraging AI Misuse, and on Syllabus AI Policies

August 9 - Tutorial: How to Draft Grant Applications with Gemini

July 31 - Tutorial: How to Grade and Analyze Your Teaching with ChatGPT

What'd you think of today's newsletter?

Graham

Let's transform learning together.

If you would like to consult with me, want me to develop an AI workflow to save you time, or have me present to your team, let’s meet!

Feel free to connect on LinkedIN, too!