Haize Labs is using algorithms to jailbreak leading AI models

Haize Labs is using algorithms to jailbreak leading AI models

Don’t miss OpenAI, Chevron, Nvidia, Kaiser Permanente, and Capital One leaders only at VentureBeat Transform 2024. Gain essential insights about GenAI and expand your network at this exclusive three day event. Learn More

Regular readers of VentureBeat will know that there is a burgeoning scene of people racing to jailbreak the latest and greatest AI models from tech companies such as OpenAI, Anthropic, Mistral, Google, Meta, and others.

What’s the goal of these jailbreakers? To prompt AI models in ways that cause them to violate their built-in safeguards and produce NSFW content, even dangerous outputs — everything from detailed instructions on how to manufacture meth and bioweapons to nonconsensual porn to violent, gory imagery. In fact, we interviewed one such jailbreaker — who goes by the pseudonym “Pliny the Prompter” and @elderplinius on X — not too long ago.

But now a whole new startup has emerged to commercialize jailbreaking of LLMs on behalf of the AI companies themselves, helping them identify holes in their models’ security and alignment guardrails. Called “Haize Labs,” it launched last week with a flashy video on X showing off a number of disturbing and sometimes amusing examples of leading AI models being jailbroken to produce malicious or controversial outputs like those described above.

The startup was covered in The Washington Post and it offers what it calls the “Haize Suite,” or haizing suite, which its CEO Leonard Tang tells VentureBeat is essentially a collection of algorithms specifically designed to probe large language models (LLMs) like those that power ChatGPT and Claude for weaknesses.

Countdown to VB Transform 2024

Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now

Unlike Pliny the Prompter, the prolific LLM jailbreaker who chooses to remain pseudonymous, the founders of Haize Labs have come forward to reveal themselves as Tang, Richard Liu and Steve Li — all formerly classmates together at Harvard University, according to Tang.

VentureBeat recently had the opportunity to interview Tang over email about Haize Labs and what it offers to companies. In the course of the questioning, Hang revealed Haize Labs already counts among its growing list of clients Anthropic, which just released its newest model Claude 3.5 Sonnet today, taking the crown from OpenAI’s GPT-4o to become the most intelligent model in the world. Read on for more about the company’s genesis and approach to AI safety.

The following interview has been lightly edited for style and grammar:

VentureBeat: What is your name, age, location, and occupation? Who are you and what do you do for money and for your hobbies?

Leonard Tang: I’m Leonard Tang, age 22, freshly graduated from undergrad (Harvard). I’m the founder and CEO of Haize Labs, based in NYC :)

What is Haize Labs, why did you launch it, and when officially (I see your blog has been active since January 2024)? What inspired you?

Haize Labs is a startup that commercializes a bunch of research on adversarial attacks and robustness I’ve been thinking about throughout undergrad. We incorporated back in December and really got going end of January. I was inspired to tackle the fundamental research of solving AI reliability and safety; and it also seemed that everybody was turning a blind eye towards this problem amidst the AI hype.

The video teaser for Haize Labs is cool and appears to show lots of jailbreaks across different modalities — text, image, video, voice, code — as does your announcement thread on X. How many different models have you jailbroken and how many different modalities, so far?

Many, many different models and many different modalities. Well into the dozens on the model side, and all the major modalities: text, audio, code, video, image, web, etc.

Is anyone else involved in Haize Labs and if so, who? How many? If not, why call it Haize Labs if it is just you?

We have an awesome co-founding team of three! I’m joined by Richard Liu and Steve Li, two of my super close friends from undergrad. We’re also supported by an awesome set of advisors and angel investors — Professors from CMU and Harvard, the founders of Okta, HuggingFace, Weights and Biases, Replit, AI and security execs at Google, Netflix, Stripe, Anduril, and so on.

What does the name “Haize Labs” mean and why did you choose it?

“Haizing” is the process of systematically testing an AI system to preemptively discover and mitigate all its failure modes. Hence the name Haize Labs. But the original lore behind the name was because we wanted to pay homage to Hayes Valley. We were also thinking about how “Mistral” came up with its name, and Mistral reminded us of Misty, which reminded us of Hazy. But Hazy Research is the lab name of one of my would-be Stanford PhD advisors, so we had to choose something different, like Haze. Then we wanted “HAI” for “Human AI” — and settled on Haize.

When did you get started jailbreaking LLMs? Did you jailbreak stuff besides LLMs before?

LLMs — two years ago. I was “jailbreaking” (doing adversarial attacks on) image classifiers since 2020. Back then I was trying to bypass Twitter’s NSFW filters haha.

What do you consider your strongest red team skills, and how did you gain expertise in them?

Our strongest skills are we know how to write really great search/optimization algorithms to do red-teaming. We’ve just spent a lot of time on this research problem.

Can you describe how you approach a new LLM or Gen AI system to find flaws? What do you look for first?

We are kind of dumb, but our algorithmic haizing suite is very smart. And so the suite is basically able to bootstrap and discover novel attack vectors for different systems without much human intervention.

Have you been contacted by AI model providers (e.g. OpenAI) or their allies (e.g. Microsoft representing OpenAI) and what have they said to you about your work?

Yes. In fact some of these providers are our customers, most notably Anthropic.

Have you been contacted by any state agencies or governments or other private contractors looking to buy jailbreaks off you and what have you told them?

Not quite the government yet, but definitely government affiliates!

Do you make any money from jailbreaking? If so, what’s the business model?

Yes! As mentioned above, we have customers who pay for our automated haizing suite. The business model is sometimes services (e.g. for foundation model providers) and sometimes SaaS (i.e. at the application layer). There, we offer CI/CD haizing and a run-time defensive solution.

Do you use AI tools regularly outside of jailbreaking and if so, which ones? What do you use them for? If not, why not?

Yes — for code gen. Big fan of Cursor! Of course ChatGPT is critical too.

Which AI models/LLMs have been easiest to jailbreak and which have been most difficult and why?

Hardest: Claude, by far! Good job Anthropic :) Easiest: Many have been super trivial to break. Of course models like Vicuna and Mistral who don’t explicitly perform safety finetuning are very easy to break.

Which jailbreaks have been your favorite so far and why?

I like “Thorn in Haizestack” because it’s a super simple and clean idea. But just shows how standard RLHF [reinforcement learning from human feedback] is not a panacea for ensuring safety.

Editor’s Note: This jailbreak turns the standard “needle-in-a-haystack” LLM test, in which unrelated text is inserted into a larger corpus of text and an LLM asked to recall this specific inserted information as different from that surrounding it. In this case, the “thorn” is information that falls outside of the LLM’s guardrails, such as NSFW or dangerous content.

Are you concerned about facing any legal or civil action or ramifications of jailbreaking? E.g. AI model provider companies suing you or trying to bring charges?

Nope! I would say they were already very excited to work with us, and more are excited to work with us :)

What do you say to those who view AI and jailbreaking of it as dangerous or unethical? Especially in light of the controversy around Taylor Swift’s AI deepfakes from the jailbroken Microsoft Designer powered by DALL-E 3? I note some of the jailbreak examples you showed off in the Haize Labs video included NSFW content.

The whole point of Haize Labs is to go on the offense so that we can provide a defensive solution. Of course, the examples we released were meant to be more on the provocative side — but these are precisely the things we help prevent from happening.

On X, you note “we used our haizing suite” to jailbreak various LLMs. What is this suite?

The haizing suite is a suite of haizing algorithms — i.e. search and optimization algorithms that crawl the space of inputs to your LLM, guided with the objective of producing a harmful model output. There are a bunch of different algorithms — evolutionary programming, reinforcement learning, multi-turn simulations, VAE-guided fuzzing, gradient-based methods, MCTS-eseque methods, LP solvers, all sorts of fun things.

I also see on your website and X you are offering “early access,” presumably to the Haize Suite, via online sign-up form. Who is this designed for, who are you hoping applies, what will they receive by signing up, and will you be charging them money in exchange for providing them something?

This is our free (but very selective) Beta! We want folks who are thinking about how to adopt AI safely and responsibly to test out the platform. Think CISOs, developers, compliance buyers, and so on. They get access to our haizing suite, where they can haize any model against any set of harmful behaviors.

I also note you are taking requests online for jailbreaks of models. Do you plan to attempt and succeed at all requested jailbreaks? Is there any type of requested jailbreak you would refuse due to it being too dangerous? If so, what kind? Also, why are you taking requests to jailbreak LLMs and how long do you plan to keep doing it? Are you charging people money from them or not, and why

Not charging money! We just want to bring more awareness to this problem and see how people are thinking about safety. We absolutely plan to succeed at all requested jailbreaks. We will of course exercise rational judgement vis-à-vis what is too dangerous and not. For example, things like CSAM and revenge porn are simply immediately off the table. We plan on doing this for quite a while, maybe forever :)