Research

Rethinking historical stress tests: from remembered crises to detectable market regimes

min. read

12 Jun 2026

Assumed audience: Anyone who has seen a stress test result — "in a 2008 scenario, this portfolio loses 23%" — and wondered where that scenario came from. No prior risk knowledge needed. If you assemble scenario sets for a living, the question at the center of this piece might be one you've asked yourself for years.

A historical stress test replays a past crisis against today's portfolio: if 2008 happened again tomorrow, what would happen to these investments? It's one of the most widely used tools in risk management. It also involves a step that has traditionally been hard to formalize: deciding which crises to replay, and where each one starts and ends.

Until now, that decision has relied on human judgment and shared memory — which is natural, but leaves open questions about completeness and consistency.

Edgelab's research offers an objective method. It defines market stress in precise, quantitative terms, scans the full history of a market, and finds every stress episode — famous or forgotten — ranked by severity. This article explains how it works and what it changes.

‍

1. What a historical stress test is

Before an engineer signs off on a bridge, she asks: what's the worst this bridge will ever face? The heaviest load, the strongest wind, the hundred-year flood. Then she designs for it.

Risk management asks the same question about portfolios. And markets, conveniently, keep records. So one of the most natural ways to test a portfolio is to replay history against it: take the market movements from a past crisis — the 2008 financial crisis, the dot-com crash, March 2020 — and apply them to today's holdings. The output is concrete and easy to communicate: in a repeat of 2008, this portfolio loses 23%.

This is a historical stress test. Regulators expect it. Clients understand it. Risk teams run it routinely.

There is just one step in the process that has never had a formal answer: which pieces of history should be replayed?

2. The problem: scenario lists are written from memory

Try a small experiment. Ask three colleagues when the 2008 crisis began. One will say March, when Bear Stearns collapsed. Another will say September, when Lehman fell. A third will point to the cracks that appeared in 2007. All three answers are defensible — and that's precisely what makes the question interesting. Move the start of a scenario by a few weeks and the loss it predicts for a portfolio can change substantially. Two equally careful teams, using the same crisis, can produce different results.

There's a second, subtler dimension. Any list of crises assembled by people naturally reflects the crises those people experienced — and the ones that made headlines, which are overwhelmingly equity events. A bond index or an oil index has lived through its own difficult periods, some of which never became household names. Those episodes are just as real, and for a portfolio holding those assets, just as relevant. They're simply harder to find by recollection alone.

And there is a third issue that follows from the first two: a hand-curated list can never demonstrate its own completeness. You can check that every scenario on the list is reasonable. You cannot check that nothing reasonable was left off.

None of this is a criticism of how anyone works. Relying on memory and judgment was not a methodological choice — it was the only option available, because no formal alternative existed. The interesting question is what becomes possible once one does.

3. What the research does differently

The research provides that alternative in two steps: it defines what a stress event is, then finds all of them.

First, the definitions. The everyday word "stress" covers several distinct things, so the paper separates them along two axes: how long an episode lasts, and which direction it moves. That yields four clean types.

A crash is short and sharp — heavy losses over days or weeks.

A crisis is long and grinding — losses stretching over months.

And because the method is even-handed, the upside gets the same rigor:

A rally is a short, fast climb.

A recovery is the long climb back.

This is more than a naming exercise. It changes what stress testing is allowed to see. Stress is no longer only “how bad can losses get?” It becomes a broader question: what have been the historically extreme market regimes, in both directions, over different horizons?

And this matters for portfolios with structured products, options, short positions, or asymmetric exposures, where extreme upside moves can matter as much as extreme downside moves.

Each type has a precise quantitative definition that anyone can inspect, challenge, and apply consistently.

Second, the detector. To find these episodes automatically, you need a way to measure how stressed a market is at any moment. The natural gauge is volatility — how strongly prices are moving, the closest thing markets have to a fever reading. The paper measures it with an LM-ARCH process — the same model family Edgelab's risk engine is built on, and one that is designed to capture changing market volatility over time.

Think of the result as a seismograph for markets. A seismograph doesn't remember earthquakes; it records them — continuously, impartially, all of them. Run this instrument over the full history of a market, and every episode that meets the definitions surfaces, with its exact start date, end date, and severity.

Finding stress events becomes an act of computation. The dates stop being a matter of interpretation. The list stops depending on who compiled it.

One clarification worth making: the method doesn't remove judgment from stress testing — it relocates it. Someone still decides how "crash" and "crisis" are defined, and someone still decides which detected episodes matter for a given portfolio and a given client. What changes is where that judgment lives. Instead of being re-exercised from memory in every scenario meeting, it is stated once, explicitly, in definitions that anyone can inspect and challenge. That is a different and better place for judgment to sit.

4. What the results showed

The methodology was applied to four very different markets: the MSCI World index, a regional stock index, a bond index, and an oil index. Three findings stand out.

The famous crises show up — and so do the quiet ones. The episodes everyone would name appear in the results, which is reassuring: the definitions capture what intuition means by "crisis." But the scan also surfaces episodes that rarely appear on standard scenario lists — genuine periods of market stress that simply never entered the shared vocabulary. The historical record turns out to be richer than the well-known highlights.

Each market has its own stress history. The most difficult moments of the oil index are not the most difficult moments of the bond index, and neither matches the familiar equity calendar. Laid out side by side, the four histories make a simple point vividly: a scenario set designed around equity events tells only part of the story for a multi-asset portfolio.

Severity becomes comparable. Because every episode is measured with the same yardstick, crises can be ranked rather than just listed. The question "should we include the Asian crisis?" gets a quantitative companion: "how severe was it, compared with everything else this market has been through?" Scenario selection stops being a matter of which events feel important and becomes a matter of evidence.

5. What this means for wealth managers

The findings translate into concrete changes in how stress testing is done day to day.

Scenario libraries can be complete and stay complete. Instead of maintaining a hand-curated list of crises, a risk system can hold the full set of detected episodes for every relevant market, and extend it automatically as new data arrives. When the next turbulent period occurs, it's identified, dated, and added by the same method — no meeting required to decide whether it "counts."

Scenarios can be matched to what the portfolio actually holds. A portfolio rich in bonds can be tested against the most severe episodes in bond market history; a commodity exposure against the oil index's own record. For multi-asset portfolios — the norm in wealth management — this means the stress test reflects each component's real history rather than a single equity-centric calendar.

Results become comparable across portfolios, teams, and time. Because every scenario has objective boundaries, "the 2008 scenario" means the same dates in every report. A relationship manager comparing two client portfolios, or the same portfolio across two quarters, is comparing apples with apples.

Scenario choice becomes explainable. When a client or a regulator asks why these scenarios were chosen, the answer can be specific: these are the five most severe crises this market has experienced, identified and ranked by a published, reproducible method. The reasoning behind the test becomes as transparent as the result.

Client conversations gain context. Severity ranking lets a result be framed meaningfully: not just: “in this scenario, you would lose 14%” but “this scenario is among the most severe episodes this market has experienced in the available history — and this is how your portfolio would have behaved.” That gives the number a place in history. It helps the client understand not only the loss, but the extremity of the event being tested.

Stress testing has always answered the question what would happen if the past returned? This research strengthens the question itself — making sure the past being replayed is complete, precisely dated, and relevant to the portfolio at hand. At Edgelab, this methodology is part of how we build that foundation, so that every element of a stress test, including the choice of scenario, can be traced back to data.

Based on the research paper: "A Quantitative Approach to Historical Stress Tests," The Journal of Portfolio Management, Vol. 51(9), 234–255 (2025).

‍

P.S. A stress test asks how a portfolio would fare if a known storm returned. A related question is whether your day-to-day risk number — VaR — reacts fast enough when conditions change in real time. We have a piece on back-testing methodology that takes that question apart, and another on why long-term portfolio projections need better models than the one invented in 1900.

Interested in learning more?

Read the full paper

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript