syleetolow/s3ae · Hugging Face

This is trained parameters of the Sentence-level, Supervised, Sparse AutoEncoder (S3AE) proposed in the paper "Emergence of psychopathological computations in large language models". Codes with S3AE architecture and use examples can be found in this Github.

S3AE was trained on the residual stream in the 10th layer of instruction-tuned Gemma 2 27B, using a proprietary synthetic dataset with psychopathology symptom labels. The model weight precision is bfloat16, and the hidden dimension size is 8 times that of the LLM residual stream.

The 1st to 17th dimensions of S3AE hidden features, respectively, correspond to activations of the following thoughts:

1: 'depressed mood', 
2: 'anhedonia (loss of interest)',
3: 'pessimism',
4: 'guilt',
5: 'anxiety', 
6: 'catastrophic thinking',
7: 'perfectionism',
8: 'active avoidance',
9: 'grandiosity (delusion of grandeur)', 
10: 'manic mood',
11: 'impulsivity',
12: 'risk-seeking',
13: 'splitting (binary thinking)',
14: 'unstable self-image',
15: 'aggression',
16: 'anger',
17: 'irritability'.

Dimensions 7, 13, and 14 were not used for the paper's analysis.