An Economist Explains Why So Many Business Ideas Fail to Scale

For most of history, business operated on instinct. Enterprises thrived or failed based on the intuition of the men and women who ran them. Entrepreneurs were celebrated for their innate understanding of markets. “Trust your gut” remains a slogan.

In recent years, however, businesses have embraced data to help make decisions, relying on the power of percentages to shape strategy. Much like the moneyball revolution in sports, in which analytics replaced folk wisdom, executives have acknowledged that the gut isn’t always reliable. Sometimes it helps to have proof.
John List portrait John List Courtesy John List

But for John List, a behavioral economist who has worked with dozens of companies, using data isn’t enough. Too often, he says, it’s deployed in limited ways, or used to justify a predetermined outcome. Too often, the result is the idea won’t scale.

Scaling is the subject of List’s latest book, The Voltage Effect, an engaging attempt by an academic to encourage business people to include some basic economic principles in their strategic thinking.

List, a professor at the University of Chicago and former White House economist, is an evangelist for experimental economics—the practice of testing theories with real-world experiments. He has used experiments to explore the racial biases of auto mechanics, the ethics of baseball card dealers and whether women in matriarchal tribal cultures are more competitive as men (he found they are).

I first wrote about List in 2011, when he was launching what was then his most audacious experiment. With $10 million from hedge fund billionaire Ken Griffin, List and fellow economists Roland Fryer and Steven Levitt, of Freakonomics fame, created a pre-school in a low-income neighborhood to test whether parents or teachers had more influence on the academic success of children. Over four years, more than 2,000 families participated in the experiments run through the Chicago Heights Early Childhood Center (CHECC). Among the outcomes was an understanding that paying parents up to $7,000 a year to participate in a Parents Academy with workshops about child-rearing strategies produced significant benefits for their children’s academic success.

Since then, List was put his experiments into practice working at Uber, where he headed up an “Ubernomics” team that encouraged the company to add tipping, as well as Uber-rival Lyft and, as of this year, Walmart.

In a recent interview, I asked List about the lessons of CHECC, why it’s hard to consider ideas at scale, and why businesses are reluctant to consider economic theories. Our conversation has been lightly edited for length and clarity.

Observer: So, why did you write The Voltage Effect?

John List The book’s roots can go back to when you and I first met and talked about CHECC. We got the great results from CHECC and around 2015, I started selling the results to policy makers. And I was met with a lot of skepticism. Policy makers would say, ‘looks like a great program, but don’t expect it to happen at scale.’ I would ask why, and they would say, ‘it just doesn’t have the silver bullet.’ And then I would say, well, what do you mean by that? And they would say, ‘we’re not really sure, but all of the experts tell us their programs will work and they end up being a fraction of theirselves when you scale it.’

Shocking.

At that moment I kind of stepped back and said, what do we do as academics? Usually in academia, what we do is we run a program and we give our program its best shot of working. It’s an efficacy test. And then we write it up and get in a good publication, we get tenure, we get grant money and it happens all over again. But is an efficacy test the right way to change the world if you want to change it at scale? And then I started wondering about the importance of scale. And I realized that every discussion I had been having at Uber at the time, where I was a chief economist, was a scaling discussion.

When I worked in the White House, it was a lot about scaling. When I worked for various firms, it was always, in the foreground: Will this idea scale? So I started to say, well, maybe I should start an academic research agenda on scaling and I’m realizing that, you know, I write these academic papers and maybe only four people will read them. So that was the come-to Jesus moment where I said, I’m gonna write a popular book and give it a shot.

I believe many people are now willing to say that scaling is a science. People would say things like move ‘fast and break things,’ ‘fake it till you make it,’ ‘throw spaghetti against the wall, and whatever sticks, cook it.’ That’s the business world, but government was basically the same thing, that it’s a gut feeling.

For the people who aren’t familiar, which is 99.9% of the readers of the Observer, can you explain the outcome of CHECC and what worked and didn’t work there and what scaled and what didn’t scale.

I think CHECC in general worked. It moved both cognitive and executive function skills. Now, the parent academy only worked for Latinos. It didn’t work for white or Black families. And that’s a teaching moment because if you want to scale the parent academy, it can scale to Hispanic families. But unless it changes, it won’t scale to any other families. And that’s an important moment in scaling and trying to figure out who does your program work?

The other thing we learned is our program needed good teachers. So our program can scale as long as we have good teachers. If you horizontally scale, that’s fine. Here’s what I mean by horizontally scale: I have one school in Chicago Heights, one school in Cincinnati, one school in Dayton, one school in Denver, etc. If I only need to hire 30 good teachers, I can do it one per city. But if I want to scale that in Chicago and hire 30,000 good teachers, I’m done. So with vertical scaling, I failed with CHECC. With horizontal scaling. I produced something.

What’s the theory for why it works with Hispanic families and not with others?

I don’t want to get in trouble here, but I think it’s because Hispanic families have more intact families that have more substitutable inputs. Invariably, it’s the mother in all of these families who is the go-to person in the parent academy. If the mother can’t make it in a Hispanic family, dad’s pretty good, grandma’s pretty good, auntie’s pretty good but in the white and Black families, there’s less of that. So it it’s really instructive about the types of programs that you can actually run. A lot of times people say ‘it’s a minority family, it’s a minority solution.’ It’s not true.

It sounds like you learned some pretty valuable lessons about scaling from the CHECC experience. What are the obvious ones that a reader could take away?

One is: always generate policy-based evidence.

The way that we’ve set up science, it’s called evidence-based policy. And it’s basically taking evidence from an efficacy test and seeing it if it scales and if it won’t. So policy-based evidence changes around the ordering by using backward induction. What I mean by that is, look at what your inputs are going to have to be at scale, and test them in the original Petri dish. Does your idea work with those inputs in place? That basically is policy-based evidence, because it’s what your idea is going to have to face if it becomes a policy. We never do that, ever. And it’s strange because if you really want to change the world, that’s where you would start. yYou would say, ‘Okay, what types of people and what types of situations does my idea have to work in?’

We don’t do that. We do the reverse. We say under the best-case situation, will the idea work? Steve Levitt and I had probably our biggest fight over hiring teachers for CHECC. He wanted to hire the very best teachers because he said, ‘look, you can’t go back to Griffin with a program that didn’t work and we can never get a program published in a good academic journal if it didn’t work.’ And I said, ‘No, no, no, no, no. I want to hire teachers exactly like Chicago Heights would hire teachers.’ I was half right. Because I was thinking about horizontal scaling, not vertical. To be completely right, I would want to hire some teachers like (how Chicago Heights’ school district hired them) and then some really bad teachers, the ones who I’m going to have to hire if I vertically scale.

I could see the appeal of wanting to produce a program that works, because it’s never going get off the ground unless you could show some results. So better to sort of manipulate the evidence to get the best possible result. Then you could sort of worry about scaling later, but your point is that’s not going to work.

I like your intuition, because that’s been the academics’ intuition for five decades. Here’s why it doesn’t work: One, if somebody wants to go back now and reproduce CHECC to do that treatment arm that you want, it’s another $10 million. They won’t do it. It’s too expensive. Two, typically we do A/B testing, right? I’m just asking for option C. Have option B be your efficacy test, so get your big result, so you can go brag about it to people. But I want option C to include the critical features that you’re going face at scale. And then your relationship between B and C tells you the reality, right? This is what policy makers want to know. And then if it doesn’t work for option C, you need to reconfigure. Or understand that you can just horizontally scale, which is useful information.

So in the case of CHECC, option C would be making sure you had enough bad teachers?

Yeah.

Is there like another example of a program that didn’t scale?

Do you have one of those smart thermostats in your home?

Yeah. We have a Nest.

So the engineers promise that if people put smart thermostats in their homes, we’re going save tons of carbon emissions. Because it’s going to moderate our temperatures in our home. That was all based on engineering estimates. We have now tested the smart thermostat with all kinds of households in California. They signed up, we sent half of them the smart thermostat, the other half we left as a control group. What we find is exactly zero energy savings from the smart thermostat.

Well, what happened? The engineers assumed that the end user was Commander Spock. And the end user is really Homer Simpson. So Homer Simpson goes in and reconfigures the default or the presets (on the thermostat). So what they needed to do is try out a few people and the friendliness of the machine itself and, and make sure that what they’re scaling into was the right people with the right instructions and user friendliness. That’s a perfect example of Option C. Option B was Commander Spock, Option C is Homer Simpson.

What’s the best way to transfer this theoretical understanding of how businesses could be smarter and better use data to actual companies? It does seem like it really takes a kind of full embrace like what Uber did with you to bring these ideas on board.

What’s kind of interesting is in government, the agencies are chock filled with people who really know the literature. In the business world that’s starting, but it’s way behind. Because if you have a really good person who can read the literature and bring those ideas forward in a translatable way, that can work too.

Do you think business’ reluctance to embrace theory is what you’re talking about earlier: ‘I made it this far on my gut, I don’t need any egghead to tell me what to do’? Or is it they just want to be lean and they don’t want things slowing them down?

That’s part of it. Part of it that people underestimate the role that luck has played in their outcomes. And if you think you already have all the answers and know how to scale stuff, why do you need some economists? We’ve got this figured out. And then the other one is they do think there’s a true cost to it. Why should we want run an experiment? That’s too costly, but that’s the opposite. Because if you go another day without knowing the truth, the opportunity cost is huge. Right? So they’re thinking about the problem wrong and saying it’s too costly. And by the way, I don’t think my book slows people down at all. In fact, I think it can speed things up because you can be more confident in what you can scale and what you can’t scale. This tells you where to look and it will tell you which ideas at least have a shot. I mean, you have to execute, of course. But it tells you if the idea even has a shot.

I was wondering if we could apply look at a case study you gave in the book, which is a pretty compelling one, which is Jamie Oliver’s restaurant chain. [Jamie Oliver’s chain of Italian restaurants, initially successful, expanded too quickly and declined rapidly after Oliver was no longer involved in their operation]. If you were advising him at the beginning, what could he have done differently?

So from the beginning, we would’ve noticed that he was the secret sauce. And we would’ve said, look, one fact is that unique humans don’t scale. So what are we gonna do? What you can do is you can try to systematize that unique human. Let’s think about now Uber. Uber could scale because an average Joe or Jane can drive. You don’t need Dale Earnhardt Jr. or Danica Patrick or Michael Schumacher. If you needed one of those, you’re done.

But now let’s say you did need one of those. How can you systematize that? That might be autonomous vehicles. So when autonomous comes, you’re systematizing the uniqueness. Now you have a chance. So now let’s go to Jamie Oliver and say, okay, what is it about your uniqueness? And can we systematize it? In some cases you can, in other cases you can’t now with, with chefs.

How would you have identified that he was the secret ingredient to the whole operation’s success like that? That didn’t jump off the page.

I would’ve done exit surveys when he was the chef and when he was not the chef How much did you like your meal? What did you like about your meal? I would’ve found that he’s getting all fives. And the person under him is getting the threes and I’m like, wow. You know, what’s gonna happen here. We’re gonna try to scale this thing up and if Jamie’s not there…

So just like at CHECC, I want to figure out what are the critical inputs, and then you have to put those critical inputs in place at the same levels that you’re going to get when you scale. And that’s what people don’t do because they don’t want their ideas to fail. But if you don’t want your ideas to fail, it will never scale.