models module

Language model wrapper module.

This module provides interfaces for different language models to extract log-probability vectors needed for correlation dimension computation.

class corrdim.models.GPT2Wrapper(model_size='gpt2', **kwargs)[source]

Bases: TransformersModelWrapper

Specialized wrapper for GPT-2 models.

Parameters:

model_size (str)

class corrdim.models.LLaMAWrapper(model_name='meta-llama/Llama-2-7b-hf', **kwargs)[source]

Bases: TransformersModelWrapper

Specialized wrapper for LLaMA models.

Parameters:

model_name (str)

class corrdim.models.LanguageModelWrapper[source]

Bases: ABC

Abstract base class for language model wrappers.

abstractmethod get_log_probabilities(text, context_length=None, dim_reduction=None, stride=1, show_progress=False)[source]

Extract log-probability vectors for each token position.

Parameters:
  • text (str) – Input text

  • context_length (int) – Maximum context length

  • dim_reduction (int)

  • stride (int)

  • show_progress (bool)

Returns:

Array of sampled log-probability vectors of shape (sampled_seq_len, vocab_size)

class corrdim.models.TransformersModelWrapper(model_name, tokenizer=None, device=None, forward_chunk_size=512, **kwargs)[source]

Bases: LanguageModelWrapper

Wrapper for Hugging Face Transformers models.

Parameters:
  • model_name (str)

  • tokenizer (object | None)

  • device (str | None)

  • forward_chunk_size (int)

decode(tokens, **kwargs)[source]

Decode tokens to text.

Parameters:

tokens (List[int])

Return type:

str

encode(text, **kwargs)[source]

Tokenize text.

Parameters:

text (str)

Return type:

List[int]

get_log_probabilities(text, context_length=None, dim_reduction=None, stride=1, show_progress=False)

Extract sampled log-probability vectors.

Parameters:
  • text (str) – Input text

  • context_length (int) – Maximum context length

  • stride (int) – Sampling interval over token positions. Keep every stride-th token vector.

  • dim_reduction (int)

  • show_progress (bool)

Returns:

Array of sampled log-probability vectors of shape (sampled_seq_len, vocab_size)

Return type:

torch.Tensor

get_log_probabilities_batch(texts, context_length=None, dim_reduction=None, stride=1, show_progress=False, batch_size=None)

Encode texts using HuggingFace batched forward (padded batch, attention_mask).

Returns one tensor per input string; rows can differ in length (sampled_seq_len) when token lengths differ. This API only supports short sequences: every input must satisfy len(tokens) <= context_length.

Parameters:
  • texts (List[str])

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • show_progress (bool)

  • batch_size (int | None)

Return type:

List[torch.Tensor]

corrdim.models.create_model_wrapper(model_name, tokenizer=None, device=None, **kwargs)[source]

Factory function to create appropriate model wrapper.

Parameters:
  • model_name (str) – Name of the model

  • tokenizer (object | None) – Tokenizer instance

  • device (str | None) – Device to run on

  • **kwargs – Additional arguments

Returns:

Appropriate model wrapper instance

Return type:

LanguageModelWrapper