The Venture Feedback Arbitrage
Most discussions about venture's feedback cycles focus on the obvious: there's a massive delay between investment decisions and ground-truth outcomes. But this framing misses something crucial about how learning actually happens in venture.
Let's think about venture capital as two loosely coupled reinforcement learning systems:
- Seeing and picking deals
- Deploying capital & helping your deals win
The conventional wisdom correctly identifies that System 2 has an extremely sparse reward function. Your actual ground truth signal (exits, markups, company success) operates on a 7-10 year lag. This lag – and the density of actions relative to the sparsity of rewards – creates a classic temporal credit assignment problem.
Here's an insight: System 1 is far more tractable from an RL perspective. While your ultimate reward signal (successful investments) is still delayed, you can construct dense intermediate rewards that provide much richer learning signals.
Consider the state-action-reward chain in early-stage sourcing:
- States: Your network, reputation, thesis areas
- Actions: Events, content, making other investments (which feedback into network etc.)
- Intermediate Rewards: Meeting quality founders, competitive round access, info flow from key nodes
You can actually instrument this system. Smart firms build attribution systems that track the complete causal chain: [Action] → [Network Node] → [Deal Flow] → [Quality Meetings] → [Competitive Access]
In these causal chains, there are a few things that are disproportionately rewarding, such as investing in a "supernode" (someone who has high degree connectivity in a prospective ecosystem) or being strongly opinionated in long-form about a nonconsensus opinion.
The RL framing reveals something: Early stage venture isn't just about maximizing expected value over a 10-year horizon. It's about learning which actions in your current state maximize the probability of encountering high-value states in your deal flow MDP.
The best firms have figured this out. They maintain tight feedback loops on their sourcing system even while acknowledging the inevitable sparsity in their investment outcome rewards. They've effectively decomposed the learning problem into separable components with different temporal structures.