COLD GPT-2 Chat

This tool provides an interface for a technique known as activation steering, which allows for real-time causal intervention in the text generation process of a GPT-2 Small model. The steering mechanism functions by attaching a forward hook to the post-activation state of the 5th layer's MLP block. At the final token position of a user's prompt, this hook intercepts the 3,072-dimensional activation vector and adds a user-defined amplification value to the activations of a pre-selected group of neurons. This modified vector is then passed to the subsequent layers of the model, causally influencing the final text output. The 'BASELINE' response shows the model's unaltered, deterministic generation, while the 'COLD' response shows the output produced after the intervention.

The neuron teams used in this tool were identified through a multi-stage, bottom-up research process designed to overcome the problem of high-frequency "landmark" neurons that dominate simple activation analyses. The process began with identifying a hyper-specialist neuron, J5-38, whose minimal trigger was discovered to be the precise phrase "it was cold". To map the broader circuit this specialist belonged to, we developed a method of "Ablation Spectrometry." First, the atomic triggers for the main landmark neurons were empirically verified (`?` for J5-1888, `the` for J5-1790). Then, for a suite of 14 "cold"-related probes, the influence vectors of these landmarks were programmatically subtracted from each probe's activation vector. The top 30 neurons from each of these "ablated" vectors were recorded, and the frequency of each neuron's appearance was tallied to create a final "Ablation Fingerprint," revealing the most consistent members of the 'PHYSICAL COLD' constellation.

The four teams available represent different sub-groups of this discovered constellation, each with distinct functional hypotheses and observed causal effects. The 'Brute Force (Top 8)' team was selected based on the highest frequency of appearance in the ablation fingerprint; it is hypothesized to contain the highest-magnitude activators along the temperature polarity axis. In practice, this team demonstrates powerful but crude causal effects, capable of overriding strong contexts (e.g., "The desert sun is beating down, the air is" becomes "...the air is cold, and the sun is cold") but often at the cost of generative stability. The 'Minimalist Coordinators (Top 2)' team isolates the two most frequent neurons from the fingerprint. This pair is hypothesized to define the principal axis of the "cold" subspace, possibly due to high directional coherence in their output weights. This team acts as a subtle steering mechanism, producing more coherent and logical outputs (e.g., "The summer sun is fiery, the season is" becomes "...the season is cold, the sun is hot...") while largely avoiding anomalous side effects. The 'Specialists Only' team consists of low-frequency neurons identified through probe-by-probe analysis. These are high-specificity feature units hypothesized to encode distinct facets of the cold concept (lexical, environmental). This team produces nuanced, thematic effects rather than direct overrides, such as altering "The night was" to "...dark and cold, and the wind was blowing...". Finally, the 'Full Roster (Combined)' team integrates all of the above neurons. This represents the complete cold-aligned ensemble, combining high-amplitude generalists with high-specificity feature units. It yields the most powerful causal effects but also the most complex polysemantic behavior, directly and creatively integrating a latent "girl" concept into its outputs (e.g., "The first line of the novel is" becomes "...is a girl's voice.").