Architecture · EchoNest

EchoNest Architecture

EchoNest is a hierarchical language model that organizes computation into three nested tiers rather than stacking identical transformer blocks. This separation of local pattern recognition from global context integration makes the model efficient to train on CPU and edge hardware.

Output projection
Weight-tied vocab projection
weight tying
GlobalEchoCore
SwiGLU routing layer
SwiGLU FFN
MacroNest
Stack of MicroNests + cross-nest attention
causal MHA
MicroNest
Parallel pool of EchoUnits
softmax aggregation
EchoUnit
Pre-norm LSTM + resonance gate
LSTM · causal

EchoUnit

Base building block
EchoUnit

Each EchoUnit passes its input through a pre-normalization layer, then through a single LSTM, and applies a resonance gate — a learned scalar confidence score derived from the final LSTM hidden state. The output is scaled by that confidence, so units that have high certainty about their predictions contribute proportionally more to the result. The LSTM's sequential nature makes each unit inherently causal, processing tokens strictly left-to-right.

The resonance gate is the key differentiator of this unit. Rather than treating all LSTM outputs equally, the gate learns when to trust its own output. During training, units specialize: some develop high confidence on syntactic patterns, others on semantic ones.

MicroNest

Parallel aggregation
MicroNest

A pool of parallel EchoUnits that each independently process the same input sequence. Their outputs are blended via softmax-weighted confidence aggregation — the model learns which units to trust for which kinds of input, rather than averaging them uniformly.

This design avoids the uniformity assumption of simple averaging. A MicroNest with four EchoUnits can route a code-like sequence primarily through the unit that has specialized in structural patterns, while routing natural language through a different unit entirely — all learned end-to-end without explicit routing supervision.

MacroNest

Vertical stack
MacroNest

A vertical stack of MicroNests connected by causal multi-head self-attention and a feed-forward layer. After all MicroNests produce their outputs, a cross-nest attention pass allows every token position — across all parallel processing streams — to exchange information. A causal mask is applied so that no token can attend to future positions. A residual connection around the attention and a pre-norm GELU feed-forward block stabilize gradients throughout.

The cross-nest attention is where long-range dependencies are resolved. Individual EchoUnits see only local sequential structure via their LSTMs; the MacroNest attention layer is what allows the model to relate a pronoun at position 200 back to its referent at position 5.

GlobalEchoCore

Global routing
GlobalEchoCore

A single SwiGLU feed-forward block applied after the MacroNest. It acts as a global routing layer that conditions final token representations on full-depth context before the vocabulary projection step. SwiGLU is used here for its empirically faster convergence compared to standard ReLU or GELU feed-forward networks.

The GlobalEchoCore sees representations that have already been shaped by every level of the hierarchy. Its role is to perform a final, high-level transformation — compressing all the information gathered across MicroNests and attention layers into token embeddings ready for projection onto the vocabulary.

Output Projection

Vocabulary projection
Output projection

The output projection shares weights with the input token embedding (weight tying). This reduces total parameter count and ensures the embedding layer receives gradient signal on every training step — improving sample efficiency at small model scales.

Weight tying is particularly impactful at the 340M parameter scale where EchoNest-1 operates. The shared matrix must simultaneously encode a good geometric representation of tokens (for input) and a good discriminative function over the vocabulary (for output), which acts as a useful inductive bias.

Summary

EchoNest is a hybrid recurrent-attention model. The LSTM units handle local, sequential structure efficiently; the MacroNest attention handles long-range dependencies; and the confidence gating mechanism lets the model dynamically weight which processing pathways to trust. The design prioritizes trainability on modest hardware over raw scale.