How Jeli is improving incident response by exploring failure

Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.

When something goes wrong with an application or service, there can be a lot of finger pointing, accusations and overall stress for IT professionals.

Nora Jones, founder and CEO of Jeli, knows the pain of incident response well. Jones has spent much of the last decade in the IT trenches, including nearly two years as a senior software engineer at jet.com, which was acquired by Walmart in 2016. Jones spent two years in a similar role at Netflix and also had a seven-month stint as head of chaos engineering at Slack. Time and again she kept running into the same issues.

“I kept getting hired by places that were in trouble as they were scaling a lot and they were having a ton of incidents. And when that happens, employees get really distressed and things end up getting worse,” Jones told VentureBeat. “I kept getting hired to solve the same problems and I would come in and build the same tool, and I would help get the organization thinking about their incidents in a more positive way.”

Jones used her experience to found incident response vendor Jeli in 2019 and has been growing the company steadily over the last three years. Today, the company hit a major milestone announcing that it has raised $15 million in a series A round of funding. The new funding round was led by Addition and included the participation of Boldstart Ventures, Heavybit and Harrison Metal.

Event

MetaBeat 2022

MetaBeat will bring together thought leaders to give guidance on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, CA.

From chaos to organized incident response

At Netflix, Jones helped lead the streaming media company’s efforts around chaos engineering.

Chaos engineering is an IT approach where failure conditions are injected into a workflow, such as disabling a cluster node, to see how resilient an application service is, and identifying if it is able to recover from unexpected events. While Jones has more experience than most with chaos engineering, that’s not the focus for Jeli, though it has helped to inspire part of the platform’s approach.

Jones said that what she thought she was doing with chaos engineering was building tools that would automate things.

“What I really realized was by implementing chaos engineering, people were learning more about their own systems,” she said. “The real beauty of it was that they were learning about their different failure scenarios.”

Those failure scenarios helped organizations learn more about what they actually care about in terms of application and service delivery. Jones said that she also came to realize there was a need to evolve beyond just chaos engineering, which is largely about testing potential failure scenarios. Rather, there was a need to better understand actual failures that organizations experienced and how they reacted to them.

“What we’re trying to do is help companies understand how it was possible for failures to even occur,” Jones said. “We’re really helping organizations learn from the incidents they’ve already had and then we surface patterns behind some of the incidents.”

A snapshot of the Jeli incident panel. Image source: Jeli.

Jones added that an organization could choose to use one of the identified failure patterns that comes from a Jeli investigation and then use that pattern in a chaos engineering exercise to test resilience.

How listening and learning are the foundations of Jeli

The name Jeli itself was originally chosen by Jones because it was a name that she could get a domain for. She said that after the company was founded, she came up with a more elegant meaning for the company name. Jeli is now an acronym that stands for Jointly Everyone Learns from Incidents (JELI).

The acronym also helps to explain how the Jeli platform works. In Jones’ view, the thing that differentiates Jeli is that it analyzes how different members of an IT organization communicate with each other.

“When someone has an incident, they will start talking to each other about what happened on a Zoom call or in a Slack channel,” Jones said. “There’s a lot of value in how people talk to each other. When there’s an emergency situation, all rules and procedures kind of go out the window and everyone’s just trying to do what they can to stop the bleeding, but there’s actually real data in there.”

The data that can be analyzed includes identifying how long it took to get the right people involved in the response, as well as how long it took for an issue to be declared an actual incident. Other potential sources of data include recognizing how much time was spent in the diagnosis phase versus how long was spent remediating the incident.

Far too often, the cause of incidents is simply labeled as being the result of lack of patching or a service misconfiguration. Jones emphasized that incidents are often more complex and it’s imperative for organizations to understand the reasons why an incident occured.

“It bothers me when I see a report saying an incident was a simple line of code or it was an engineer hitting the wrong button,” Jones said. “There is a reason that line of code existed and there’s a reason that the engineer hit the wrong button and so I want more from those stories.”

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.