Cartesia: AI comes for Voice

Written by

Radhika Malik

Published

March 11, 2025

Company journey

‍

Text-based LLMs may have kick-started the GenAI mania, but voice AI is quickly becoming its own multi-billion industry. Cartesia is paving the way with their novel work on State Space Models (SSMs).

This huge market opportunity, along with the rapid advancement of the technology, is why we invested in Cartesia’s Series A with Kleiner Perkins, Lightspeed, Index Ventures, Factory HQ, Samsung Ventures and others.

Transformers are at the heart of the current Gen AI wave - the architecture forms the basis of most modern LLMs and is incredibly powerful. However, transformers have known limitations.

They struggle with long inputs and outputs, leading to higher inference latency and costs. While several techniques have been developed to bring down costs and increase practical context window sizes, the model architecture still inherently possesses this limitation.
There is no concept of memory, necessitating the use of techniques like RAG to pass in the context for each inference.

These limitations are especially pronounced when dealing with complex, noisy, high dimensionality data such as audio and video. With their research on SSMs, the Cartesia team demonstrated a novel architecture that is a lot more efficient to run and stores information about each interaction in its internal state as a form of working memory. SSMs are able to compress long sequences of data into this working memory and efficiently reason on those sequences.

The first modality the team is tackling is audio, building a full platform for voice AI. Cartesia’s first text-to-speech model, Sonic, showed best-in-class quality with incredibly fast response time and deep output controllability. The team also built an inference stack and platform that could run their audio models blazingly fast on a variety of infrastructure, in the cloud or on device. Sonic has already seen adoption across every enterprise segment, underscoring the surging demand for fast, reliable, and human-like voice AI. With Sonic-2, these capabilities get even stronger and will further help developers infuse best-in-class voice cloning, text-to-speech, and other voice AI capabilities into their real-time applications.

And this is just the beginning. The real-world is full of noisy, complex signals where SSMs shine. We’re excited to help Cartesia build the next AI platform for voice and other real-time modalities that runs on every device.

Welcome to DTC, Cartesia.

Radhika & Team DTC