AI-Immunity Challenge: Lessons from a Clinical Research Exam

What we learned from using AI to try to crack an exam's iterated questions, verboten content, and field-specific standards.

[image created with Dall-E 2]

Welcome to AutomatED: the newsletter on how to teach better with tech.

Each week, I share what I have learned — and am learning — about AI and tech in the university classroom. What works, what doesn't, and why.

Let’s take a look at the first assignment we tested for our AI-immunity challenge.

With the free and publicly available version of ChatGPT, we tried to crack this real Clinical Research exam in under an hour. It was a challenge because of its interrelated questions — and the verboten and unique content that they concerned — as well as the field-specific standards we had to meet.

Nonetheless, we did just good enough to earn a passing grade from the professor.

We would have done better, had we used an improved version of ChatGPT and had we prompted it more about how to meet the standards of Clinical Research. In fact, we got a B+/A- when we did this.

There are a lot of takeaways here for professors in a range of fields — takeaways that I summarize at the end if you want to skip ahead. Read on for more details…

🛡️ The AI-Immunity of Assignment #1

The assignment is a written take-home mid-term exam for a Clinical Research course linked to the Master of Advanced Study (MAS) degree at the University of California, Davis academic health system (UC Davis Health).

There are four features of the assignment that are relevant to the AI-immunity challenge. (Below, I include an image of the actual assignment instructions for those who want to read it.)

  1. Interrelated Questions

The assignment has four questions that are interrelated.

The first question requires the student to describe four examples of “hypothetical but realistic” biomedical research projects that (i) fail to meet a specific international ethical standard and (ii) are different from each other along one dimension.

The second question requires the student to hold these examples fixed but ensure that the research projects fail to meet a distinct international ethical standard.

Then, for the third and fourth questions, the student must retain their examples from the first two questions — holding fixed as many of the examples’ features as possible — but alter them such that they meet international and local ethical standards.

For every question, the student must also explain why the research projects meet or fail to meet the relevant standard.

This feature of the assignment increases its AI-immunity because it can be challenging to prompt AI tools to generate responses that build on their prior responses (or, more generally, build on supplied responses).

  1. Verboten Content

The assignment’s questions involve content that is verboten in some respect — ethically dubious or wrong.

The assignment’s first two questions concern violations of international ethical standards for research involving human subjects (namely, the 2013 Helsinki Declaration and the Belmont Report).

The final two questions require the student to take their descriptions of these problematic projects and reconfigure them to conform with international standards and UC Davis’ IRB (institutional review board) standards, respectively. However, the project that conforms with UC Davis’ IRB standards must also “be a cause of at least a modest amount of evil.”

This feature of the assignment increases its AI-immunity because it can be challenging to prompt AI tools to generate responses that involve verboten content because the developers of AI tools have been working hard to prevent their tools from being abused.

  1. Field-Specific Standards

The assignment’s instructions heavily emphasize that each answer needs to “be as technical and specific as possible.” In this context, this means that the student needs to use the terminology of Clinical Research and their expertise/experiences to describe “hypothetical but realistic” research projects, including the specific diseases or conditions involved, the treatments to be compared, and the specifics of their proposed experimental design. And the same goes for the student’s explanations of how these research projects interface with ethical standards.

This feature of the assignment increases its AI-immunity because it can be challenging to prompt AI tools to generate responses that are (i) highly detailed and (ii) sensitive to the specifics — and terminology — of a specialized field, especially given AI tools’ tendencies to hallucinate or fabricate such responses.

  1. (Somewhat) Unique Content

As mentioned, one of the questions requires the student to describe a research project that would conform with UC Davis’ specific IRB standards. A website describing these standards is linked in the assignment instructions. The site is complex, contains multiple linked pdf files, and references some requirements unique to UC Davis’ context.

This feature of the assignment increases its AI-immunity because it can be challenging to prompt AI tools to generate responses sensitive to idiosyncratic or unique content, especially if the content is complex, sizable, or hard to convey in the prompts that one gives the AI tools. 

However, since the content is not especially unique and the requirements of this particular assignment are not especially sensitive to this content, this feature of this specific assignment is not as central to its AI-immunity.

Here are the full instructions for the assignment:

The full assignment prompt

🔮 The Professor’s Expectations

We want to thank the submitting professor, Dr. Mark Fedyk, for submitting one of his assignments to our challenge. Mark is an associate professor at UC Davis, working in the School of Medicine Bioethics Program and the School of Nursing. We really appreciate his willingness to let us test his assignment for AI-immunity, as well as his responsiveness throughout the process. Mark provided amazing detail on his views on various aspects of his assignment and our submissions.

Mark told us that the Clinical Research course from which this assignment is taken, CLH 204, is “offered to three different student populations: students in the university's extension program, clinician researchers attached to our hospital (so physicians who have just finished a fellowship and are starting their research career), and graduate students in the biological or animal sciences in need of an elective.”

In explaining how he grades the assignment — before we gave him our first submission — Mark told us that

… the usual standard I use is to give full credit if the response is clear, well-written, and shows objective signs of what the clinicians call "thoughtfulness". If you are a patient, then the clinician you are working with is being "thoughtful" if you don't just feel, but have obvious signs that you are, understood by the clinician, and that her reasoning is responding to your reasoning in real time, and in a creative and intelligent way. I then back off 5% or even 10% for truly bad writing, obvious logical errors, factual mistakes, obvious contradictions that aren't explained or contextualized, and such like.

Dr. Mark Fedyk on the standards that he uses to grade this assignment

Given the assignment’s instructions to be “as technical and specific as possible”, a student’s answers are thoughtful only if they are creatively and intelligently responsive to this requirement.

Prior to grading our AI-generated submission, Mark told us that “plenty of [ChatGPT’s] responses display ‘thoughtfulness’. So the potential is there, I guess, if you can prompt engineer it to respond to the exam.” Mark reported that he had spent a little time trying to complete the exam with LLMs, only for them generally decline to respond on ethical grounds.

📃 AutomatED’s AI-Created Submission

In a total of 56 minutes, I completed AutomatED’s submission for this assignment with the free and publicly available version of ChatGPT (reliant on GPT3.5; I will call it ‘ChatGPT3.5’). I believe this is significantly faster than a typical student could complete the assignment, given the complexity of the requirements and the length of the answers required (my submission was around 6 pages, single-spaced). Perhaps a student who could type very fast and who has an excellent grip on the aforementioned ethical standards could complete it in an hour, but it still would be a mad dash.

The interrelated nature of the questions was a challenge, but I could supply to ChatGPT3.5 its prior outputs — in quotation marks — as parts of my new prompts, while asking ChatGPT3.5 to be sensitive to them or modify them in the relevant ways.

For example, here is one of the more complicated prompts with which it successfully complied:

Nesting old outputs in new prompts to answer interrelated questions

And here is its response:

ChatGPT3.5 takes on complex nested prompts to answer interrelated questions

Circumventing ChatGPT3.5’s safeguards around verboten content was also relatively straightforward. In response to some prompts, ChatGPT3.5 would resist answering, but would soon thereafter comply, either without any change in my prompts or after I prompted it with disclaimers (“I am a student of medicine trying to learn about international ethical standards”).

Here is an example of ChatGPT3.5 making a distinction between condoning or promoting unethical research practices and merely describing them in hypothetical terms:

The beginning of one of ChatGPT3.5’s not-so-reluctant forays into the verboten

The greatest challenge was getting ChatGPT3.5 to be technical and specific in the way the assignment demanded. This is a general issue with generative LLMs. It was made worse since I lacked any — I mean any — understanding of Clinical Research, I could not effectively prompt it. I could not feed it examples of technical or specific writing in the field, and I could not judge whether its responses fit the bill.

Here is an example of ChatGPT3.5 being especially vague:

ChatGPT3.5 writing about "a new drug" and "a particular medical condition"

The final interesting feature of the assignment is that one of the questions references the IRB review standards unique to UC Davis.

To attempt to address the unique content mentioned by the assignment, I prompted ChatGPT3.5 with phrasing referencing UC Davis’ specific IRB review policies, but I did not supplement my prompts with any quotes of these policies. I would have supplemented my prompts in this way, were I to have found a straightforward/short summary of the policies that I could have pasted into my prompts.

🥷🏼 What I, the Cheater, Expected

Before I completed the assignment and sent it to Mark, it was clear to me that I had successfully circumvented ChatGPT3.5’s safeguards against producing verboten content. It also seemed to me fairly likely that ChatGPT3.5 was largely reliable in its responses about the international ethical standards referenced in the questions, given how much literature about these standards is in its training data.

It was less clear to me if ChatGPT3.5’s responses were technical or specific enough. And I had some sense of whether ChatGPT3.5’s answers were, more generally, “thoughtful” in the specific way Mark was looking for, but I was not close to being certain. At least some of them seemed to be unsatisfactory in these dimensions, even to my untrained eye.

Finally, it was not clear to me if ChatGPT3.5 was being sufficiently sensitive to the uniqueness of UC Davis’ specific IRB policies — after all, like any lazy cheater, I was not going to spend the time to learn these policies myself.

‍👨‍⚖️ The Professor’s Judgment

The verdict is in. Mark graded our submission and awarded it a grade of…

Sorry, I can’t resist using ChatGPT to code gifs with Python.

Mark reported that the class average for this assignment this year was a 98 (both scores are out of 100). However, Mark also told me that "all [of his] students are either advanced PhD students or doing medical fellowships," so it is an especially strong cohort. He wrote that he "can imagine the outcome being different if the pedagogical context were changed. For example, if the comparison class weren't relatively seasoned learners […] and, instead, were mostly 1st and 2nd year undergraduate students, then I obviously could not have weighted clinical detail as highly as I did."

This brings us to what went wrong for ChatGPT3.5. Here are the two main reasons that Mark gave for why ChatGPT3.5 did not do so well:

There were two especially salient differences between the real exams and the sample you sent. First, in the real exams, the level of clinical detail is higher, and the clinical details usually are described by the students in a way that settles the ethical question. Students usually describe a specific disease or condition, talk in expert terms about a specific therapy, and get into the finer details about study structure and protocols (e.g. variations of cohort studies). The sample that you sent was very thin in those areas. The second difference is what might be called, artfully, violations of thoughtfulness. That is, the sample you sent had some pretty obvious errors — or at least puzzling constructions related to — comprehension, inference, and repetition.

Dr. Mark Fedyk on where ChatGPT struggled

Here is the above-noted example of ChatGPT3.5 being especially vague, with Mark’s comments:

Dr. Mark Fedyk expressing some concerns about ChatGPT3.5’s vagueness

↩️ AutomatED’s Do-Over

After receiving Mark’s feedback, I asked him if I could submit a new answer to the first question of the assignment — that is, to provide four examples of “hypothetical but realistic” biomedical research projects that (i) fail to meet the 2013 Helsinki Declaration and (ii) are different from each other along one dimension. There were two reasons why I wanted to try again:

  • I now had a better understanding of what Mark meant when he instructed his students to “be as technical and specific as possible.” With Mark’s description of clinical detail in hand, I could better prompt ChatGPT.

  • Given that I lost points due to the lack of thoughtfulness of the responses of the free and publicly available version of ChatGPT (ChatGPT3.5), and given that the paid version of ChatGPT (reliant on GPT4; call it ‘ChatGPT4’) is much improved in precisely this way, I thought that I could use the latter to avoid these errors.

Mark agreed to grade a new answer, and I was off to the races.

Here is one of the first prompts that I tried, where I embedded Mark’s description of clinical detail in my directions to ChatGPT4:

Prompting ChatGPT4 to meet the field-specific standards of Clinical Research

One of the best responses I got occurred after I urged ChatGPT4 to take its prior response — which was responsive to a prompt like the above — and provide even further clinical detail (which I specified using Mark’s description):

ChatGPT4 providing more extensive clinical detail

Upon showing ChatGPT4’s four hypothetical research projects to Mark, he told me “That is much better!”

However, Mark wrote that while the projects proposed by ChatGPT4 were significantly more clinically detailed, the details of some of them “lack scientific, clinical, and/or ethical plausibility.” Indeed, Mark told me that some of the “studies described don’t pass a critical threshold of scientific or clinical common sense; and that fact is relevant to the ethical analysis.”

For instance, ChatGPT4 proposed a very specific therapy that would not be part of the sort of trial it described, and ChatGPT4 proposed a randomized controlled trial for a particular project, but that sort of project would never have a design of that type. In another case, ChatGPT proposed a pain study involving placebos, but, as Mark put it, “a pain study almost certainly wouldn't use placebos, as that makes it pretty obvious to all involved which treatment is being offered, and it would define efficacy as ‘better than untreated pain’. In these studies, the usual standard is that of ‘non-inferiority’, meaning that efficacy would be something like ‘the new drug is at least as effective as the old drug’.”

Still, Mark reports he was “impressed with how good the respective ethical analyses (‘discussions’) of the hypotheticals are.”

In giving his overall judgment on the ChatGPT4-generated new answer, Mark wrote as follows:

So, expressed quantitatively, if a student were to submit these, I probably wouldn't suspect that they were LLM generated, but I probably would assign them grades of around 89/100, and write margin notes clarifying the relevant clinical and scientific facts. I'd also be struck by how mechnical the submissions are; they seem to be following the same internal structure, and the errors or weaknesses in the responses also seem to have a pattern to them. Most likely I'd suspect that the student had a particularly unusual educational background, as medical schools are particularly good at training students into the clinical and ethical intricacies of trials involving novel gene therapies and historically marginalized and/or exploited populations. My first thought wouldn't be that these are LLM generated responses.

Dr. Mark Fedyk on ChatGPT4’s performance

ChatGPT4 achieved the B+/A- range

🧐 Lessons Regarding AI-Immunity

Here is what worked about the assignment with respect to AI-immunity:

  • Especially with ChatGPT3.5, it worked for the professor to demand students’ answers to meet high field-specific standards. In this case, providing sufficient and accurate “clinical detail,” as Mark puts it, is something that can be hard for generic generative LLMs to do well and accurately, especially if the assignment’s rubric is very demanding in this dimension. Clinical detail is a field-specific concept that these LLMs are not especially sensitive to, given the breadth of their training data. However, with guidance about what clinical detail is, ChatGPT4 did a much better job, although it still provided some clinical details that lacked “scientific, clinical, and/or ethical plausibility.”

  • Likewise, ChatGPT3.5 was not especially “thoughtful” — that is, its responses suffered from some errors of “comprehension, inference, and repetition,” especially as my demands involving these capacities increased. Again, this was less of a problem with ChatGPT4.

  • It worked for the professor to direct students to engage with unique content. In this case, Mark told us that one of the examples that ChatGPT3.5 gave — of a research project that would pass UC Davis’ IRB review — is “outside of the [UC Davis] IRB’s jurisdiction.”

Dr. Mark Fedyk on ChatGPT and the UC Davis IRB.

  • Likewise, ChatGPT4 had trouble with unique content, too. While it did not hallucinate or fabricate in the same way that ChatGPT3.5 did, it tended to simply decline to engage with the specifics of UC Davis’ IRB guidelines:

ChatGPT4 recognizing its limitations with respect to unique content

Here is what did not work with respect to the assignment’s AI-immunity:

  • The interrelated questions did not work because there are many ways to embed the earlier responses of AI tools in new prompts so as to derive new responses that are sensitive to their old ones (and new information).

  • The reliance on verboten content did not work because there are many ways to get around AI tools’ safeguards around this sort of content. Perhaps an assignment could push farther in this dimension and find some AI-immunity, but I suspect it could be circumvented — not to mention the fact that the assignment may itself become unethical before it gets to that point.

Here are some changes that would increase the AI-immunity of this assignment, at least for the next few months:

  • If the assignment required students to focus mainly on the fine minutia of the clinical details of their hypothetical research projects — precisely the details that even ChatGPT4 sometimes struggled with — then this would make it more challenging to use AI tools to complete.

  • If the assignment required students to cite specific passages in the 2013 Helsinki Declaration, the Belmont Report, or the UC Davis IRB guidelines, then they could not rely on many of the generic LLMs like ChatGPT alone, at least in their present form. ChatGPT4 will cite broad sections of widely-available texts, like whole “articles” of the 2013 Helsinki Declaration, but will not generally provide citations that are both precise and accurate.

  • The above sourcing requirement could be heightened by requiring students to screenshot the relevant passages.

  • If students were required to base their answers on much more unique content — such as a specific part of UC Davis’ IRB guidelines or a unique course material supplied by the professor in some form (ideally orally) — then they could not rely on generic LLMs like ChatGPT alone.

Here are some (big) caveats:

It is important for professors to understand several expectations we have about the future.

  • As this case showed, ChatGPT4 has already been found to be significantly more “thoughtful” than ChatGPT3.5, and this partially explains its improved performance on a variety of tests. The issues that Mark noted with ChatGPT3.5’s answers with respect to “comprehension, inference, and repetition” will likely become less and less common over the course of this year as the LLMs are honed.

  • More importantly, there are already free and publicly available AI tools that integrate generic LLMs like ChatGPT with functionality that enables them to analyze uploaded files. ChatPDF, for instance, applies the power of ChatGPT3.5 to uploaded pdfs, thereby empowering students to ask it questions about the pdf’s content. It can provide quotes of the pdf that are sensitive to the questions asked, and it can direct students to where to find these quotes (so that they could screenshot them, for example).

  • In fact, OpenAI are themselves preparing to publicly release a companion plugin for ChatGPT called ‘code interpreter’ that enables users to upload large files which can then be analyzed. Likewise, Microsoft is working hard to release its so-called ‘Copilot’, which has similar functionality for all filetypes in the Microsoft ecosystem. These developments cut against designing assignments that rely heavily on unique content for their AI-immunity, and they indicate that professors’ energy might be better dedicated to focusing on assignments that require students to meet field-specific standards. Mark’s students, for instance, could easily upload pdfs of the 2013 Helsinki Declaration or the UC Davis IRB guidelines, but it is more challenging to get the LLMs to conjure sufficient and accurate clinical detail.

Here are our takeaways:

  • Professors should be thinking about analogs to “clinical detail” and “thoughtfulness” — namely, field-specific standards that are challenging for AI tools to meet in completing tasks like those their assignments involve. Crucially — and we cannot emphasize this enough — you should experiment to see if you can get tools like ChatGPT to generate responses that are sufficiently sensitive to your field’s standards to receive passing grades on your take-home written assignments. It is very hard to predict how AI tools will perform from the armchair (hence this challenge).

  • There are many pedagogically appropriate assignments that cannot be made AI-immune. Perhaps they should be assigned nonetheless, especially in contexts where students are not incentivized to plagiarize in the first place. Professors should reflect on their unique contexts in making these decisions.

  • Yet, as we have discussed when reflecting on the general structure of AI-immunity efforts, there are two paths to AI-immunity for assignments that are pedagogically appropriate but especially susceptible to AI plagiarism: namely, through in-class work and through pairing. Pairing requires the professor to find a second assignment or task that the student must complete in connection with the first assignment that effectively reveals, when completed, whether the student plagiarized the first with the help of AI tools. The professor’s goal is to incentivize students to complete both assignments honestly and earnestly. For instance, the professor could require students to present, discuss, or defend their answers to the original assignment in the classroom or on Zoom.

  • Then again, each professor should be asking themselves: is this assignment a case where I should be training students to use AI to complete it, rather than designing my assignments to be AI-immune? On some level, almost every field needs to take an honest look at itself in this respect. Last week, we discussed this topic at greater length and provided some considerations, as well as a decision flowchart.

🎯 🏅 The Challenge

We are still accepting new submissions for the challenge. Professors: you can submit to the AI-immunity challenge by subscribing to this newsletter and then responding to the welcome email or to one of our emailed pieces. Your response should contain your assignment and your grading rubric for it. You can read our original post on the challenge for a full description of how the process works.

In our next piece on the challenge, we take on an annotated bibliography assignment. Stay tuned…

🔗 Links

Subscribe to keep reading

This content is free, but you must be subscribed to AutomatED: Teaching Better with Tech to continue reading.

Already a subscriber?Sign In.Not now