corrdim package
CorrDim: Correlation Dimension for Language Models
A library for computing correlation dimension of autoregressive large language models, based on the research paper “Correlation Dimension of Auto-regressive Large Language Models” (NeurIPS 2025).
- class corrdim.CurveResult(sequence_length: 'int', epsilons: 'np.ndarray', corrints: 'np.ndarray')[source]
Bases:
object- Parameters:
sequence_length (int)
epsilons (numpy.ndarray)
corrints (numpy.ndarray)
- corrints: numpy.ndarray
- epsilons: numpy.ndarray
- sequence_length: int
- class corrdim.DimensionResult(sequence_length: 'int', epsilons: 'np.ndarray', corrints: 'np.ndarray', corrdim: 'float', fit_r2: 'float', epsilons_linear_region: 'np.ndarray', corrints_linear_region: 'np.ndarray', linear_region_bounds: 'Tuple[Optional[float], Optional[float]]' = (None, None))[source]
Bases:
object- Parameters:
sequence_length (int)
epsilons (numpy.ndarray)
corrints (numpy.ndarray)
corrdim (float)
fit_r2 (float)
epsilons_linear_region (numpy.ndarray)
corrints_linear_region (numpy.ndarray)
linear_region_bounds (Tuple[float | None, float | None])
- corrdim: float
- corrints: numpy.ndarray
- corrints_linear_region: numpy.ndarray
- epsilons: numpy.ndarray
- epsilons_linear_region: numpy.ndarray
- fit_r2: float
- linear_region_bounds: Tuple[float | None, float | None] = (None, None)
- sequence_length: int
- class corrdim.ProgressiveCurveResult(sequence_length: 'int', epsilons: 'np.ndarray', corrints_progressive: 'np.ndarray')[source]
Bases:
object- Parameters:
sequence_length (int)
epsilons (numpy.ndarray)
corrints_progressive (numpy.ndarray)
- corrints_progressive: numpy.ndarray
- epsilons: numpy.ndarray
- sequence_length: int
- class corrdim.ProgressiveDimensionResult(sequence_length, epsilons, skip_prefix_tokens, measure_every_tokens, by_prefix)[source]
Bases:
objectCorrelation dimensions fitted at subsampled prefix indices after one progressive pass.
- Parameters:
sequence_length (int)
epsilons (numpy.ndarray)
skip_prefix_tokens (int)
measure_every_tokens (int)
by_prefix (Dict[int, DimensionResult])
- by_prefix: Dict[int, DimensionResult]
- property corrdims: Dict[int, float]
- epsilons: numpy.ndarray
- measure_every_tokens: int
- sequence_length: int
- skip_prefix_tokens: int
- corrdim.auto_linear_region_bounds(sequence_length, epsilons, corrints)[source]
- Parameters:
sequence_length (int)
epsilons (numpy.ndarray)
corrints (numpy.ndarray)
- Return type:
Tuple[float, float]
- corrdim.available_corrint_backends()[source]
Return availability of known backends.
- Return type:
Dict[str, bool]
- corrdim.clamp(values, reference, low, high)[source]
Filter values by a reference range, returning filtered values and reference.
- Parameters:
values (torch.Tensor | numpy.ndarray)
reference (torch.Tensor | numpy.ndarray)
low (float)
high (float)
- Return type:
Tuple[torch.Tensor | numpy.ndarray, torch.Tensor | numpy.ndarray]
- corrdim.correlation_counts(vecs, epsilons, vecs_other=None, *, backend=None, **kwargs)[source]
Return counts (unnormalized); implemented by the selected backend module.
- Parameters:
vecs (torch.FloatTensor)
epsilons (torch.FloatTensor)
vecs_other (torch.FloatTensor | None)
backend (str | CorrIntBackend | None)
kwargs (Any)
- Return type:
torch.Tensor
- corrdim.correlation_integral(vecs, epsilons, vecs_other=None, return_counts=False, *, backend=None, **kwargs)[source]
Return correlation integral; if return_counts=True, return counts instead.
- Parameters:
vecs (torch.FloatTensor)
epsilons (torch.FloatTensor)
vecs_other (torch.FloatTensor | None)
return_counts (bool)
backend (str | CorrIntBackend | None)
kwargs (Any)
- Return type:
torch.Tensor
- corrdim.curve_from_text(text, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, forward_chunk_size=None, **model_kwargs)[source]
- Parameters:
text (str)
model (str | LanguageModelWrapper)
tokenizer (object | None)
context_length (int | None)
dim_reduction (int | None)
stride (int)
epsilon_range (Tuple[float, float])
num_epsilon (int)
block_size (int)
show_progress (bool)
precision (torch.dtype)
backend (str | None)
forward_chunk_size (int | None)
- Return type:
- corrdim.curve_from_texts(texts, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, batch_size=None, forward_chunk_size=None, **model_kwargs)[source]
- Parameters:
texts (list[str])
model (str | LanguageModelWrapper)
tokenizer (object | None)
context_length (int | None)
dim_reduction (int | None)
stride (int)
epsilon_range (Tuple[float, float])
num_epsilon (int)
block_size (int)
show_progress (bool)
precision (torch.dtype)
backend (str | None)
batch_size (int | None)
forward_chunk_size (int | None)
- Return type:
list[CurveResult]
- corrdim.curve_from_vectors(vectors, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
- Parameters:
vectors (torch.Tensor)
epsilon_range (Tuple[float, float])
num_epsilon (int)
block_size (int)
show_progress (bool)
backend (str | None)
- Return type:
- corrdim.curve_from_vectors_batch(vectors_batch, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
- Parameters:
vectors_batch (torch.Tensor)
epsilon_range (Tuple[float, float])
num_epsilon (int)
block_size (int)
show_progress (bool)
backend (str | None)
- Return type:
list[CurveResult]
- corrdim.estimate_dimension_from_curve(curve, correlation_integral_range=None, epsilon_range=None)[source]
- Parameters:
curve (CurveResult)
correlation_integral_range (Tuple[float, float] | None)
epsilon_range (Tuple[float, float] | None)
- Return type:
- corrdim.estimate_dimension_from_curves(curves, correlation_integral_range=None, epsilon_range=None)[source]
- Parameters:
curves (list[CurveResult])
correlation_integral_range (Tuple[float, float] | None)
epsilon_range (Tuple[float, float] | None)
- Return type:
list[DimensionResult]
- corrdim.get_corrint_backend()[source]
Get the current default backend (AUTO is resolved).
- Return type:
str
- corrdim.measure_text(text, model, tokenizer=None, truncation_tokens=None, context_length=None, dim_reduction=8192, stride=1, correlation_integral_range=None, epsilon_range=None, num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float16, backend=None, forward_chunk_size=None, **model_kwargs)[source]
- Parameters:
text (str)
model (str | LanguageModelWrapper)
tokenizer (object | None)
truncation_tokens (int | None)
context_length (int | None)
dim_reduction (int | None)
stride (int)
correlation_integral_range (Tuple[float, float] | None)
epsilon_range (Tuple[float, float] | None)
num_epsilon (int)
block_size (int)
show_progress (bool)
precision (torch.dtype)
backend (str | None)
forward_chunk_size (int | None)
- Return type:
- corrdim.measure_text_progressive(text, model, tokenizer=None, truncation_tokens=None, skip_prefix_tokens=100, measure_every_tokens=None, context_length=None, dim_reduction=8192, stride=1, correlation_integral_range=None, epsilon_range=None, num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float16, backend=None, forward_chunk_size=None, **model_kwargs)[source]
Compute progressive curves once, then fit correlation dimension at sampled prefixes.
For each index
iinrange(skip_prefix_tokens, sequence_length, step), uses rowcorrints_progressive[i]with the sharedepsilonsgrid. Results are inby_prefix(i→DimensionResult). Ifmeasure_every_tokensisNone,stepis chosen fromsequence_length:< 100→1,< 1000→10, otherwise100. Other arguments followmeasure_text()/progressive_curve_from_text().- Parameters:
text (str)
model (str | LanguageModelWrapper)
tokenizer (object | None)
truncation_tokens (int | None)
skip_prefix_tokens (int)
measure_every_tokens (int | None)
context_length (int | None)
dim_reduction (int | None)
stride (int)
correlation_integral_range (Tuple[float, float] | None)
epsilon_range (Tuple[float, float] | None)
num_epsilon (int)
block_size (int)
show_progress (bool)
precision (torch.dtype)
backend (str | None)
forward_chunk_size (int | None)
- Return type:
- corrdim.measure_texts(texts, model, tokenizer=None, truncation_tokens=None, context_length=None, dim_reduction=8192, stride=1, correlation_integral_range=None, epsilon_range=None, num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float16, backend=None, batch_size=None, forward_chunk_size=None, **model_kwargs)[source]
- Parameters:
texts (list[str])
model (str | LanguageModelWrapper)
tokenizer (object | None)
truncation_tokens (int | None)
context_length (int | None)
dim_reduction (int | None)
stride (int)
correlation_integral_range (Tuple[float, float] | None)
epsilon_range (Tuple[float, float] | None)
num_epsilon (int)
block_size (int)
show_progress (bool)
precision (torch.dtype)
backend (str | None)
batch_size (int | None)
forward_chunk_size (int | None)
- Return type:
list[DimensionResult]
- corrdim.measure_texts_progressive(texts, model, tokenizer=None, truncation_tokens=None, skip_prefix_tokens=100, measure_every_tokens=None, context_length=None, dim_reduction=8192, stride=1, correlation_integral_range=None, epsilon_range=None, num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float16, backend=None, batch_size=None, forward_chunk_size=None, **model_kwargs)[source]
Like
measure_text_progressive()for several strings; batches log-probability extraction when supported.- Parameters:
texts (list[str])
model (str | LanguageModelWrapper)
tokenizer (object | None)
truncation_tokens (int | None)
skip_prefix_tokens (int)
measure_every_tokens (int | None)
context_length (int | None)
dim_reduction (int | None)
stride (int)
correlation_integral_range (Tuple[float, float] | None)
epsilon_range (Tuple[float, float] | None)
num_epsilon (int)
block_size (int)
show_progress (bool)
precision (torch.dtype)
backend (str | None)
batch_size (int | None)
forward_chunk_size (int | None)
- Return type:
- corrdim.progressive_correlation_counts(vecs, epsilons, *, backend=None, **kwargs)[source]
Progressive counts over sequence prefixes (if implemented by backend).
- Parameters:
vecs (torch.FloatTensor)
epsilons (torch.FloatTensor)
backend (str | CorrIntBackend | None)
kwargs (Any)
- Return type:
torch.Tensor
- corrdim.progressive_correlation_integral(vecs, epsilons, return_counts=False, *, backend=None, **kwargs)[source]
Progressive correlation integral over sequence prefixes (if implemented by backend).
- Parameters:
vecs (torch.FloatTensor)
epsilons (torch.FloatTensor)
return_counts (bool)
backend (str | CorrIntBackend | None)
kwargs (Any)
- Return type:
torch.Tensor
- corrdim.progressive_curve_from_text(text, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, forward_chunk_size=None, **model_kwargs)[source]
- Parameters:
text (str)
model (str | LanguageModelWrapper)
tokenizer (object | None)
context_length (int | None)
dim_reduction (int | None)
stride (int)
epsilon_range (Tuple[float, float])
num_epsilon (int)
block_size (int)
show_progress (bool)
precision (torch.dtype)
backend (str | None)
forward_chunk_size (int | None)
- Return type:
- corrdim.progressive_curve_from_texts(texts, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, batch_size=None, forward_chunk_size=None, **model_kwargs)[source]
- Parameters:
texts (list[str])
model (str | LanguageModelWrapper)
tokenizer (object | None)
context_length (int | None)
dim_reduction (int | None)
stride (int)
epsilon_range (Tuple[float, float])
num_epsilon (int)
block_size (int)
show_progress (bool)
precision (torch.dtype)
backend (str | None)
batch_size (int | None)
forward_chunk_size (int | None)
- Return type:
list[ProgressiveCurveResult]
- corrdim.progressive_curve_from_vectors(vectors, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
- Parameters:
vectors (torch.Tensor)
epsilon_range (Tuple[float, float])
num_epsilon (int)
block_size (int)
show_progress (bool)
backend (str | None)
- Return type:
- corrdim.progressive_curve_from_vectors_batch(vectors_batch, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
- Parameters:
vectors_batch (torch.Tensor)
epsilon_range (Tuple[float, float])
num_epsilon (int)
block_size (int)
show_progress (bool)
backend (str | None)
- Return type:
list[ProgressiveCurveResult]
- corrdim.reduce_dimension(vectors, num_groups=8192, method='group_add')[source]
- Parameters:
vectors (torch.Tensor | numpy.ndarray)
num_groups (int)
method (str)
- Return type:
torch.Tensor | numpy.ndarray
- corrdim.set_corrint_backend(backend='auto')[source]
Set process-wide default backend; returns the resolved backend name.
- Parameters:
backend (str | CorrIntBackend | None)
- Return type:
str
- corrdim.text_to_vectors(text, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, show_progress=False, precision=torch.float32, forward_chunk_size=None, **model_kwargs)[source]
Extract log-probability vectors from text using model.
This is the public entry point for vector extraction; the returned tensor has shape
(sampled_seq_len, reduced_vocab_size)and can be passed directly tocurve_from_vectors()orprogressive_curve_from_vectors().- Parameters:
text (str) – Input text.
model (str | LanguageModelWrapper) – HuggingFace model name/ID (
str) or a pre-builtLanguageModelWrapperinstance.tokenizer (object | None) – Tokenizer instance (only used when model is a string).
context_length (int | None) – Maximum context length for the model.
dim_reduction (int | None) – Vocabulary grouping size for dimensionality reduction.
stride (int) – Keep every stride-th token vector.
show_progress (bool) – Show a progress bar during inference.
precision (torch.dtype) – Output tensor dtype.
forward_chunk_size (int | None) – Number of tokens per forward-pass chunk. Reduce this value (e.g. 128) on systems with limited VRAM. Only effective when model is a string; for wrapper instances set the attribute directly.
**model_kwargs – Extra keyword arguments forwarded to the model loader when model is a string.
- Return type:
torch.Tensor