Using AI While Protecting Student Data

I explore five options for professors considering running student data through AI tools.

Presented by

Welcome to AutomatED: the newsletter on how to teach better with tech.

Each week, I share what I have learned — and am learning — about AI and tech in the university classroom. What works, what doesn't, and why.

In this week’s edition, I sketch paths to climbing what in the last edition I called the “privacy mountain.”

The core problem I want to address today is challenging to solve but can be easily expressed: privacy issues and risks related to student data are piling up with the proliferation of AI tools, such that there is now a seemingly insurmountable mountain between professors and the full-fledged deployment of these tools in the university setting.

What has changed with the new AI paradigm?

AI tools incentivize professors to run a great quantity of student data — and high quality fine-grained data, too — through AI tools to get more useful outputs from them.

But why does this bring privacy issues and risks?

Because the AI tools and their connections to the “outside” are potentially porous, either in themselves or in the hands of an ignorant or negligent (or even malicious) professor.

But why is the mountain of these issues and risks seemingly insurmountable?

Because addressing them — and climbing the mountain — requires knowledge, expertise, and, most importantly, significant intentional planning and action. The latter requires time and energy that many professors lack.

So, why not just move on from these tools?

Because, as I suggested in the last edition, the effective use of various AI tools has the potential to radically improve student outcomes via more responsive and personalized pedagogy. The possibilities are many and their impacts are far from trivial.

San Francisco, we have a problem.

In my last piece, I listed five main options for solving the problem. Below, I explain them in greater detail, with a twofold intention: first, I want to argue against the options where professors retreat from or severely limit AI use and, second, I want to offer an opinionated take on the other options. Hopefully my argumentation will help professors develop positions of their own that are sensitive to their contexts and other commitments.

Option 1: Don’t Run Student Data Through AI

One strategy would be to simply restrict AI use to professorial tasks that do not involve student data: lesson planning, faculty meeting transcription, brainstorming, general feedback on assessment (e.g. mentorship), etc. Do not use any AI tool with student data whatsoever, even if you use them elsewhere.

There are two general reasons to think that choosing this option would not be best.

First, it would significantly reduce the ability of professors to leverage the pedagogically useful aspects of AI tools that involve personalizing instruction, assessment, and communication to specific students’ needs. One of the most important dimensions in which AI tools can be leveraged is to customize the learning experience of individual learners and simulate a classroom full of highly skilled human tutors. In short, the costs of retreating are just too significant, at least for many fields.

Second, it is hard to see how it would be feasible, given how much student data is already and will be flowing through the ecosystems within which the most powerful AI tools will be operating. Nearly all universities and colleges in the US — and the world — use either the Google Workspace (GSuite) or Microsoft 365 (Office) for their core computing needs, including email. While a user must consent to integrate their Google Workspace with Bard via Bard Extensions (and probably will have to do so with Gemini when it comes out — see the Links below for more information), Google already has access to countless bits of student data through professors’ use of Gmail and other Workspace apps. The same dynamic will occur when Microsoft releases GPT-based Microsoft 365 Copilot in November, as Microsoft already has access to student data via Microsoft Outlook, Exchange, and the other Microsoft 365 apps and associated services.

Now, this is not to argue that professors should freely or recklessly run student data through any AI tool. There are undoubtedly many AI tools and many kinds of student data that should not come into contact, depending on the nature of the data in question and many other factors, as I will discuss below.

My point, rather, is that our reasons to not run any student through an AI tool should be consistent with our other views (and the broader information technology strategies deployed at our universities), and these other views should be sensitive to the inevitability — and dare I say safety — of ecosystems like Google Workspace.

For example, whatever reason we have to limit Bard’s access to a given bit of student data should not also cut against running that data through, say, Gmail, unless we are opposed to the general use of apps in Google’s ecosystem for that purpose — a position that is implausible and untenable for those of us whose universities run on this ecosystem. What is the principled difference between Gmail (or Drive, Sheets, Docs, etc.) and Bard?

Option 2: Limit Data to “Completely Safe” Categories

Another strategy would be for professors to be extremely selective about what sort of student data they run through AI tools. The goal would be to select only “completely safe” categories of data for exposure to the relevant AI tools.

What qualifies as completely safe?

It is hard to give a simple and general answer to this question because of the many different contexts in which professors find themselves. However, I think good frameworks for thinking about how each of us should answer this question are provided by relatively comprehensive laws that attempt to govern the use of student data, like the US federal law FERPA. (Other examples include the European Union’s GDPR.) These laws provide insight into many of the relevant sorts of considerations in defining the safety of various categories of data.

The core idea behind FERPA is that, unless a student who is at least 18 years old explicitly consents in writing to release or share their “personally identifiable” data beyond the relevant parties in their educational institution for legitimate educational purposes, it must not be released or shared, except for a relatively small amount of special cases (e.g., for certain student aid purposes or in a safety-related emergency).

Personally identifiable data, in turn, is data that could reveal the identity of the student associated with it, either in a single instance or in a sequence of disclosures, when combined with other plausibly available information or background context.

A two-column spreadsheet titled “Student 03902” with grades paired with assignment names is not necessarily personally identifiable. On its own, it does not identify Student 03902. Nonetheless, it would be actually personally identifiable, if the spreadsheet could be associated, via other plausibly available information, with Student 03902. For example, if it is public information that Student 03902 is me because, say, 03902 is my university-assigned and public student ID number, then the student data on my grades in the spreadsheet is personally identifiable.

For my purposes here, I think a good starting point for defining or distinguishing completely safe student data is as data that is not personally identifiable even in principle. That is, it is not possible to combine completely safe data with other plausibly available information or background to identify which student it belongs or relates to.

A paradigm of completely safe student data would be sufficiently aggregated student data — that is, data that has been created by combining and generalizing from individual bits of student data, such that there is no way to infer from the aggregated numbers about any individual student’s data.

So, with this background in hand, let us ask: should we take the position that we only use completely safe student data with AI tools?

While there are benefits of this option that are not available to someone who takes Option 1 — for instance, sufficiently aggregated student data can still be useful for a variety of professorial tasks, like lesson planning or assessment analysis — my view is that it suffers from the same worries facing Option 1. Student data that is not completely safe is incredibly useful to run through AI tools, and it also already is running through the ecosystems operated by the same companies running AI tools (like Google Workspace and Bard). Again, the cost is too large and it is not feasible.

Another option would be to get explicit written consent from students in order to use any of their data with AI tools.

In my experiences discussing with students whether they consent to me using their data, they always appreciate being asked, even if the data and use case are completely safe (and not covered by FERPA). Why? They tell me that they know that the university and their professors have access to so much of their data, but that they wish it were more transparent how it is being used. They appreciate being respected and included in the process of teaching them and assessing their work.

Given that I think that this general desire amongst students for greater data transparency is more than reasonable, it is my view that it is worth trying to respect. In general, professors should be honest and they should be in the position to convince (reasonable) students that their uses of student data are worthwhile and worth consenting to.

But the present question is the consideration of the option where professors seek to get consent for all uses of student data with AI tools, regardless of the use case or regardless of the nature of the data (e.g., whether or not it is completely safe). It is a way to address the problem via a blanket strategy of informing students of how exactly we plan to use their data in our uses of AI tools and then getting consent about precisely those use cases. It requires changing the paradigm around consent — the default stance of the professor becomes one of consent seeking rather than one of avoiding scenarios that would require consent. Should we take this option?

As I suggest when considering Option 5 below, my view is that relying completely on consent — whether the data in question is completely safe, personally identifiable, or whatever — is too extreme.

For me, the clearest exception cases to such a broad blanket policy are uses of student data that are completely safe (not even in principle personally identifiable) where it is costly or inconvenient to get consent and there are serious benefits of me running such data through AI tools.

The next clearest are cases where the data is personally identifiable in principle, but where I can take reliable steps to make it not personally identifiable in actuality (e.g., pseudonymize it, per the below) and where it is stored in an ecosystem I am already relying on to protect personally identifiable student data like sensitive emails. And I would even go so far as to suggest that consent is not morally required or needed in cases where I am using AI tools in such an ecosystem on data that is personally identifiable.

The less clear cases are cases where the AI tools are third-party like or Even if the relevant data is not personally identifiable in actuality, these are cases where consent should be sought because of the increased risk, the expectation of students to not have their data used this way (or their lack of expectation that it would be used this way), and so on. I will discuss this sort of mixed or combination strategy below in the discussion of Option 5…

(This is to set aside the question of conformity to my university’s data classification and usage policies, which are somewhat idiosyncratic. Professors should be aware of those institutional policies with which they must be compliant.)

Option 4: Pseudonymize or Anonymize

A fourth strategy would be for professors to pseudonymize or anonymize the student data before running it through the AI tools.

Pseudonyms are names that are unique to individuals but which do not reveal their identities, so pseudonymized student data is data that is identical to named data except that the names that it has make it not personally identifiable. Of course, to avoid personally identifying the students, the code or key mapping the pseudonyms to the students real names needs to not be available or accessible.

Anonymized student data lacks names entirely and, unlike pseudonymized data, it can be such that it is not personally identifiable even in principle. Sufficiently aggregated data, discussed above, is anonymized.

Our sister newsletter for the K12 space, Teacher’s AIed, today posted a piece on “Simplifying Small Group Differentiation with AI” that discusses this topic via an example using pseudonymization where the student’s names are mapped via a code or key where the pseudonyms are numbers (but the piece made me think: maybe famous people’s names would be even better as pseudonyms!).

Should professors pseudonymize or anonymize all student data before running it through AI tools?

To apply some of my arguments from above to this option, I would argue that pseudonymization should be used only when the student data needs to be run through the tool with names in order to keep track of who is who. Otherwise — if there is not a serious benefit of using named data — professors should use other strategies. Anonymization is generally going to create completely safe data, so my preceding comments about this category apply to it, too.

To be clear, this is not to argue that every time that student data needs to be run through any AI tool with names, it should be pseudonymized. If, for instance, it is permissible to run personally identifiable student data through certain AI tools’ ecosystems — like Microsoft’s — then pseudonymization is not required.

However a professor comes down on this issue, if they pseudonymize, they must use a pseudonymization code or key that is up to the task and stored safely.

An Advertisement

Biometric Authentication: No Cost Proof of Concept

The world’s fastest and most secure facial recognition and liveness detection vendor is offering a Zero Cost, Zero Commitment, Full Featured Proof of Concept.

Over 50 hours of engineering goes into customization of your mobile or web application. Live Chat, email, and telephone 24/7 support after implementation. Spaces are Limited.  

Compatible with IOS and Android, RESTful API’s, completely integrated into your existing operations.

Customer Onboarding | Liveness Detection | Face Recognition (1:1 & 1:N) | Age Verification | Iris Detection | Fraud Prevention against 60+ Spoofing Attacks

Speed to value in replacing manual and outdated verification methods with 3D liveness detection and AI driven face recognition with unparalleled accuracy in under 1 second.

Apply Now for your own full featured Proof-of-Concept.

Patent pending, ibeta and NIST certified compliant with the highest levels of data protection. Control your data in your own data centers.

Option 5: A Combination of the Above

A combination of these options would deploy different strategies for different categories of student data and different AI tools.

As can be inferred from my discussion thus far, I favor a mixed approach that combines the benefits of Options 3 and 4 in ways that align with my reasons for rejecting Options 1 and 2.

Here is what I have in mind (note: all of the below are treated as relative to a given AI tool use case where you would significantly benefit as a professor seeking to improve your pedagogy):

  1. If the student data is completely safe (i.e., the data in this context is not personally identifiable even in principle), then have no fear and use the AI tool, even if third-party.

  2. If the student data is not completely safe but also not personally identifiable even without you making any modification to it, then have no fear and use the AI tool, even if third party, so long as this does not violate any other policy. (You may want to let your students know you are using their data in this way for purposes of transparency, but not in the stance of asking for their consent, since none is needed.)

  3. If the student data is personally identifiable unless you modify it but you can pseudonymize it or anonymize it, then pseudonymize it or anonymize it before using the AI tool. Again, you may want to inform your students even though no consent is needed, and you should be careful to carry out the process properly.

  4. If the student data is necessarily personally identifiable (i.e., cannot be pseudonymized or anonymized), then either use the AI tools within your university’s Google / Microsoft ecosystem or get explicit written consent from your students to do so.

Obviously there are other ways to combine the preceding options. For example, you may think that you should seek consent in cases I think it is not needed — even though FERPA and your university do not require you to get it. Or, perhaps, maybe you think that third party tools should not be used in general because of their other data management practices (e.g., they use your engagement with them for training).

Closing Thoughts

I must say: all of this assumes that you are not being a fool — or have a decent chance of doing something foolish — with your tech practices.

In other words, I have not mentioned all sorts of considerations relevant to data security that apply beyond the AI domain: namely, everything from securing your account information (usernames and passwords) and preventing your devices from becoming infected with malware to not using Zapier to automate AI workflows in irresponsibly risky ways.

All of that goes without saying, or so I would hope!

I also have not discussed the institutional changes that I think the new AI paradigm should bring. In particular, I think the ways in which we get student consent with regards to data should be streamlined and improved. But that topic will have to wait for another edition…

🔊 AutomatED’s Subscriber Referral Program

Surely you have many friends and/or coworkers (note: that 'or' was intentional) who would benefit from subscribing to AutomatED. Whether it's the person who thinks AI is going to make higher education obsolete by 2030 or the hater who thinks it is all hype — or someone, like most of us, who falls in between — you should tell them about AutomatED.

To refer someone, click the button below or copy and paste the included link in an email to them (if you cannot see the referral section immediately below, you need to subscribe first and/or log in).

For each referral you make, you will get a discount on future premium content from our team.

💭🗨️ 1-on-1 Consultations with the AutomatED Team

The purpose of AutomatED’s 1-on-1 tech and AI consultations is to help professors (and others in higher education) with personalized guidance on integrating tech and AI.

Our default format is a one hour 1-on-1 consulting session on Zoom. Afterwards, we will provide you with a custom and actionable plan so that you are well-equipped to supercharge your fall courses.

Alternatively, maybe you're just keen on exploring various possibilities, or considering a different consultation format (we offer department/team consultations and presentations as well).

Either way, just tap on the button below to email us: