models module
Language model wrapper module.
This module provides interfaces for different language models to extract log-probability vectors needed for correlation dimension computation.
- class corrdim.models.GPT2Wrapper(model_size='gpt2', **kwargs)[source]
Bases:
TransformersModelWrapperSpecialized wrapper for GPT-2 models.
- Parameters:
model_size (str)
- class corrdim.models.LLaMAWrapper(model_name='meta-llama/Llama-2-7b-hf', **kwargs)[source]
Bases:
TransformersModelWrapperSpecialized wrapper for LLaMA models.
- Parameters:
model_name (str)
- class corrdim.models.LanguageModelWrapper[source]
Bases:
ABCAbstract base class for language model wrappers.
- abstractmethod get_log_probabilities(text, context_length=None, dim_reduction=None, stride=1, show_progress=False)[source]
Extract log-probability vectors for each token position.
- Parameters:
text (str) – Input text
context_length (int) – Maximum context length
dim_reduction (int)
stride (int)
show_progress (bool)
- Returns:
Array of sampled log-probability vectors of shape (sampled_seq_len, vocab_size)
- class corrdim.models.TransformersModelWrapper(model_name, tokenizer=None, device=None, forward_chunk_size=512, **kwargs)[source]
Bases:
LanguageModelWrapperWrapper for Hugging Face Transformers models.
- Parameters:
model_name (str)
tokenizer (object | None)
device (str | None)
forward_chunk_size (int)
- decode(tokens, **kwargs)[source]
Decode tokens to text.
- Parameters:
tokens (List[int])
- Return type:
str
- get_log_probabilities(text, context_length=None, dim_reduction=None, stride=1, show_progress=False)
Extract sampled log-probability vectors.
- Parameters:
text (str) – Input text
context_length (int) – Maximum context length
stride (int) – Sampling interval over token positions. Keep every stride-th token vector.
dim_reduction (int)
show_progress (bool)
- Returns:
Array of sampled log-probability vectors of shape (sampled_seq_len, vocab_size)
- Return type:
torch.Tensor
- get_log_probabilities_batch(texts, context_length=None, dim_reduction=None, stride=1, show_progress=False, batch_size=None)
Encode
textsusing HuggingFace batchedforward(padded batch,attention_mask).Returns one tensor per input string; rows can differ in length (
sampled_seq_len) when token lengths differ. This API only supports short sequences: every input must satisfylen(tokens) <= context_length.- Parameters:
texts (List[str])
context_length (int | None)
dim_reduction (int | None)
stride (int)
show_progress (bool)
batch_size (int | None)
- Return type:
List[torch.Tensor]
- corrdim.models.create_model_wrapper(model_name, tokenizer=None, device=None, **kwargs)[source]
Factory function to create appropriate model wrapper.
- Parameters:
model_name (str) – Name of the model
tokenizer (object | None) – Tokenizer instance
device (str | None) – Device to run on
**kwargs – Additional arguments
- Returns:
Appropriate model wrapper instance
- Return type: