A trio of scientists from the University of North Carolina, Chapel Hill recentlypre-print artificial intelligence research showcasing how difficult it is to remove sensitive data from large language models such as OpenAI’s ChatGPT and Google’s Bard.
Once a model is trained, its creators cannot, for example, go back into the database and delete specific files in order to prohibit the model from outputting related results. Essentially, all the information a model is trained on exists somewhere inside its weights and parameters where they’re undefinable without actually generating outputs. This is the “black box” of AI.
Here, we see that despite being"deleted" from a model's weights, the word"Spain" can still be conjured using reworded prompts.However, as the UNC researchers point out, this method relies on humans finding all the flaws a model might exhibit and, even when successful, it still doesn’t “delete” the information from the model.“A possibly deeper shortcoming of RLHF is that a model may still know the sensitive information.