corrdim package

CorrDim: Correlation Dimension for Language Models

A library for computing correlation dimension of autoregressive large language models, based on the research paper “Correlation Dimension of Auto-regressive Large Language Models” (NeurIPS 2025).

class corrdim.CurveResult(sequence_length: 'int', epsilons: 'np.ndarray', corrints: 'np.ndarray')[source]

Bases: object

Parameters:
  • sequence_length (int)

  • epsilons (numpy.ndarray)

  • corrints (numpy.ndarray)

corrints: numpy.ndarray
epsilons: numpy.ndarray
sequence_length: int
class corrdim.DimensionResult(sequence_length: 'int', epsilons: 'np.ndarray', corrints: 'np.ndarray', corrdim: 'float', fit_r2: 'float', epsilons_linear_region: 'np.ndarray', corrints_linear_region: 'np.ndarray', linear_region_bounds: 'Tuple[Optional[float], Optional[float]]' = (None, None))[source]

Bases: object

Parameters:
  • sequence_length (int)

  • epsilons (numpy.ndarray)

  • corrints (numpy.ndarray)

  • corrdim (float)

  • fit_r2 (float)

  • epsilons_linear_region (numpy.ndarray)

  • corrints_linear_region (numpy.ndarray)

  • linear_region_bounds (Tuple[float | None, float | None])

corrdim: float
corrints: numpy.ndarray
corrints_linear_region: numpy.ndarray
epsilons: numpy.ndarray
epsilons_linear_region: numpy.ndarray
fit_r2: float
linear_region_bounds: Tuple[float | None, float | None] = (None, None)
sequence_length: int
class corrdim.ProgressiveCurveResult(sequence_length: 'int', epsilons: 'np.ndarray', corrints_progressive: 'np.ndarray')[source]

Bases: object

Parameters:
  • sequence_length (int)

  • epsilons (numpy.ndarray)

  • corrints_progressive (numpy.ndarray)

corrints_progressive: numpy.ndarray
epsilons: numpy.ndarray
sequence_length: int
class corrdim.ProgressiveDimensionResult(sequence_length, epsilons, skip_prefix_tokens, measure_every_tokens, by_prefix)[source]

Bases: object

Correlation dimensions fitted at subsampled prefix indices after one progressive pass.

Parameters:
  • sequence_length (int)

  • epsilons (numpy.ndarray)

  • skip_prefix_tokens (int)

  • measure_every_tokens (int)

  • by_prefix (Dict[int, DimensionResult])

by_prefix: Dict[int, DimensionResult]
property corrdims: Dict[int, float]
epsilons: numpy.ndarray
measure_every_tokens: int
sequence_length: int
skip_prefix_tokens: int
corrdim.auto_linear_region_bounds(sequence_length, epsilons, corrints)[source]
Parameters:
  • sequence_length (int)

  • epsilons (numpy.ndarray)

  • corrints (numpy.ndarray)

Return type:

Tuple[float, float]

corrdim.available_corrint_backends()[source]

Return availability of known backends.

Return type:

Dict[str, bool]

corrdim.clamp(values, reference, low, high)[source]

Filter values by a reference range, returning filtered values and reference.

Parameters:
  • values (torch.Tensor | numpy.ndarray)

  • reference (torch.Tensor | numpy.ndarray)

  • low (float)

  • high (float)

Return type:

Tuple[torch.Tensor | numpy.ndarray, torch.Tensor | numpy.ndarray]

corrdim.clear_model_cache()[source]
Return type:

None

corrdim.correlation_counts(vecs, epsilons, vecs_other=None, *, backend=None, **kwargs)[source]

Return counts (unnormalized); implemented by the selected backend module.

Parameters:
  • vecs (torch.FloatTensor)

  • epsilons (torch.FloatTensor)

  • vecs_other (torch.FloatTensor | None)

  • backend (str | CorrIntBackend | None)

  • kwargs (Any)

Return type:

torch.Tensor

corrdim.correlation_integral(vecs, epsilons, vecs_other=None, return_counts=False, *, backend=None, **kwargs)[source]

Return correlation integral; if return_counts=True, return counts instead.

Parameters:
  • vecs (torch.FloatTensor)

  • epsilons (torch.FloatTensor)

  • vecs_other (torch.FloatTensor | None)

  • return_counts (bool)

  • backend (str | CorrIntBackend | None)

  • kwargs (Any)

Return type:

torch.Tensor

corrdim.curve_from_text(text, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, forward_chunk_size=None, **model_kwargs)[source]
Parameters:
  • text (str)

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

  • forward_chunk_size (int | None)

Return type:

CurveResult

corrdim.curve_from_texts(texts, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, batch_size=None, forward_chunk_size=None, **model_kwargs)[source]
Parameters:
  • texts (list[str])

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

  • batch_size (int | None)

  • forward_chunk_size (int | None)

Return type:

list[CurveResult]

corrdim.curve_from_vectors(vectors, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
Parameters:
  • vectors (torch.Tensor)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • backend (str | None)

Return type:

CurveResult

corrdim.curve_from_vectors_batch(vectors_batch, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
Parameters:
  • vectors_batch (torch.Tensor)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • backend (str | None)

Return type:

list[CurveResult]

corrdim.estimate_dimension_from_curve(curve, correlation_integral_range=None, epsilon_range=None)[source]
Parameters:
  • curve (CurveResult)

  • correlation_integral_range (Tuple[float, float] | None)

  • epsilon_range (Tuple[float, float] | None)

Return type:

DimensionResult

corrdim.estimate_dimension_from_curves(curves, correlation_integral_range=None, epsilon_range=None)[source]
Parameters:
  • curves (list[CurveResult])

  • correlation_integral_range (Tuple[float, float] | None)

  • epsilon_range (Tuple[float, float] | None)

Return type:

list[DimensionResult]

corrdim.get_corrint_backend()[source]

Get the current default backend (AUTO is resolved).

Return type:

str

corrdim.measure_text(text, model, tokenizer=None, truncation_tokens=None, context_length=None, dim_reduction=8192, stride=1, correlation_integral_range=None, epsilon_range=None, num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float16, backend=None, forward_chunk_size=None, **model_kwargs)[source]
Parameters:
  • text (str)

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • truncation_tokens (int | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • correlation_integral_range (Tuple[float, float] | None)

  • epsilon_range (Tuple[float, float] | None)

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

  • forward_chunk_size (int | None)

Return type:

DimensionResult

corrdim.measure_text_progressive(text, model, tokenizer=None, truncation_tokens=None, skip_prefix_tokens=100, measure_every_tokens=None, context_length=None, dim_reduction=8192, stride=1, correlation_integral_range=None, epsilon_range=None, num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float16, backend=None, forward_chunk_size=None, **model_kwargs)[source]

Compute progressive curves once, then fit correlation dimension at sampled prefixes.

For each index i in range(skip_prefix_tokens, sequence_length, step), uses row corrints_progressive[i] with the shared epsilons grid. Results are in by_prefix (iDimensionResult). If measure_every_tokens is None, step is chosen from sequence_length: < 1001, < 100010, otherwise 100. Other arguments follow measure_text() / progressive_curve_from_text().

Parameters:
  • text (str)

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • truncation_tokens (int | None)

  • skip_prefix_tokens (int)

  • measure_every_tokens (int | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • correlation_integral_range (Tuple[float, float] | None)

  • epsilon_range (Tuple[float, float] | None)

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

  • forward_chunk_size (int | None)

Return type:

ProgressiveDimensionResult

corrdim.measure_texts(texts, model, tokenizer=None, truncation_tokens=None, context_length=None, dim_reduction=8192, stride=1, correlation_integral_range=None, epsilon_range=None, num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float16, backend=None, batch_size=None, forward_chunk_size=None, **model_kwargs)[source]
Parameters:
  • texts (list[str])

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • truncation_tokens (int | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • correlation_integral_range (Tuple[float, float] | None)

  • epsilon_range (Tuple[float, float] | None)

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

  • batch_size (int | None)

  • forward_chunk_size (int | None)

Return type:

list[DimensionResult]

corrdim.measure_texts_progressive(texts, model, tokenizer=None, truncation_tokens=None, skip_prefix_tokens=100, measure_every_tokens=None, context_length=None, dim_reduction=8192, stride=1, correlation_integral_range=None, epsilon_range=None, num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float16, backend=None, batch_size=None, forward_chunk_size=None, **model_kwargs)[source]

Like measure_text_progressive() for several strings; batches log-probability extraction when supported.

Parameters:
  • texts (list[str])

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • truncation_tokens (int | None)

  • skip_prefix_tokens (int)

  • measure_every_tokens (int | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • correlation_integral_range (Tuple[float, float] | None)

  • epsilon_range (Tuple[float, float] | None)

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

  • batch_size (int | None)

  • forward_chunk_size (int | None)

Return type:

list[ProgressiveDimensionResult]

corrdim.progressive_correlation_counts(vecs, epsilons, *, backend=None, **kwargs)[source]

Progressive counts over sequence prefixes (if implemented by backend).

Parameters:
  • vecs (torch.FloatTensor)

  • epsilons (torch.FloatTensor)

  • backend (str | CorrIntBackend | None)

  • kwargs (Any)

Return type:

torch.Tensor

corrdim.progressive_correlation_integral(vecs, epsilons, return_counts=False, *, backend=None, **kwargs)[source]

Progressive correlation integral over sequence prefixes (if implemented by backend).

Parameters:
  • vecs (torch.FloatTensor)

  • epsilons (torch.FloatTensor)

  • return_counts (bool)

  • backend (str | CorrIntBackend | None)

  • kwargs (Any)

Return type:

torch.Tensor

corrdim.progressive_curve_from_text(text, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, forward_chunk_size=None, **model_kwargs)[source]
Parameters:
  • text (str)

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

  • forward_chunk_size (int | None)

Return type:

ProgressiveCurveResult

corrdim.progressive_curve_from_texts(texts, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, batch_size=None, forward_chunk_size=None, **model_kwargs)[source]
Parameters:
  • texts (list[str])

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

  • batch_size (int | None)

  • forward_chunk_size (int | None)

Return type:

list[ProgressiveCurveResult]

corrdim.progressive_curve_from_vectors(vectors, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
Parameters:
  • vectors (torch.Tensor)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • backend (str | None)

Return type:

ProgressiveCurveResult

corrdim.progressive_curve_from_vectors_batch(vectors_batch, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
Parameters:
  • vectors_batch (torch.Tensor)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • backend (str | None)

Return type:

list[ProgressiveCurveResult]

corrdim.reduce_dimension(vectors, num_groups=8192, method='group_add')[source]
Parameters:
  • vectors (torch.Tensor | numpy.ndarray)

  • num_groups (int)

  • method (str)

Return type:

torch.Tensor | numpy.ndarray

corrdim.set_corrint_backend(backend='auto')[source]

Set process-wide default backend; returns the resolved backend name.

Parameters:

backend (str | CorrIntBackend | None)

Return type:

str

corrdim.text_to_vectors(text, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, show_progress=False, precision=torch.float32, forward_chunk_size=None, **model_kwargs)[source]

Extract log-probability vectors from text using model.

This is the public entry point for vector extraction; the returned tensor has shape (sampled_seq_len, reduced_vocab_size) and can be passed directly to curve_from_vectors() or progressive_curve_from_vectors().

Parameters:
  • text (str) – Input text.

  • model (str | LanguageModelWrapper) – HuggingFace model name/ID (str) or a pre-built LanguageModelWrapper instance.

  • tokenizer (object | None) – Tokenizer instance (only used when model is a string).

  • context_length (int | None) – Maximum context length for the model.

  • dim_reduction (int | None) – Vocabulary grouping size for dimensionality reduction.

  • stride (int) – Keep every stride-th token vector.

  • show_progress (bool) – Show a progress bar during inference.

  • precision (torch.dtype) – Output tensor dtype.

  • forward_chunk_size (int | None) – Number of tokens per forward-pass chunk. Reduce this value (e.g. 128) on systems with limited VRAM. Only effective when model is a string; for wrapper instances set the attribute directly.

  • **model_kwargs – Extra keyword arguments forwarded to the model loader when model is a string.

Return type:

torch.Tensor