corrdim package

CorrDim: Correlation Dimension for Language Models

A library for computing correlation dimension of autoregressive large language models, based on the research paper “Correlation Dimension of Auto-regressive Large Language Models” (NeurIPS 2025).

class corrdim.CurveResult(sequence_length: 'int', epsilons: 'np.ndarray', corrints: 'np.ndarray')[source]

Bases: object

Parameters:
  • sequence_length (int)

  • epsilons (numpy.ndarray)

  • corrints (numpy.ndarray)

corrints: numpy.ndarray
epsilons: numpy.ndarray
sequence_length: int
class corrdim.DimensionResult(sequence_length: 'int', epsilons: 'np.ndarray', corrints: 'np.ndarray', corrdim: 'float', fit_r2: 'float', epsilons_linear_region: 'np.ndarray', corrints_linear_region: 'np.ndarray', linear_region_bounds: 'Tuple[Optional[float], Optional[float]]' = (None, None))[source]

Bases: object

Parameters:
  • sequence_length (int)

  • epsilons (numpy.ndarray)

  • corrints (numpy.ndarray)

  • corrdim (float)

  • fit_r2 (float)

  • epsilons_linear_region (numpy.ndarray)

  • corrints_linear_region (numpy.ndarray)

  • linear_region_bounds (Tuple[float | None, float | None])

corrdim: float
corrints: numpy.ndarray
corrints_linear_region: numpy.ndarray
epsilons: numpy.ndarray
epsilons_linear_region: numpy.ndarray
fit_r2: float
linear_region_bounds: Tuple[float | None, float | None] = (None, None)
sequence_length: int
class corrdim.LanguageModelWrapper[source]

Bases: ABC

Abstract base class for language model wrappers.

abstractmethod get_log_probabilities(text, context_length=None, dim_reduction=None, stride=1, show_progress=False)[source]

Extract log-probability vectors for each token position.

Parameters:
  • text (str) – Input text

  • context_length (int) – Maximum context length

  • batch_size – Batch size for processing

  • dim_reduction (int)

  • stride (int)

  • show_progress (bool)

Returns:

Array of sampled log-probability vectors of shape (sampled_seq_len, vocab_size)

class corrdim.ProgressiveCurveResult(sequence_length: 'int', epsilons: 'np.ndarray', corrints_progressive: 'np.ndarray')[source]

Bases: object

Parameters:
  • sequence_length (int)

  • epsilons (numpy.ndarray)

  • corrints_progressive (numpy.ndarray)

corrints_progressive: numpy.ndarray
epsilons: numpy.ndarray
sequence_length: int
class corrdim.ProgressiveDimensionResult(sequence_length, epsilons, skip_prefix_tokens, measure_every_tokens, by_prefix)[source]

Bases: object

Correlation dimensions fitted at subsampled prefix indices after one progressive pass.

Parameters:
  • sequence_length (int)

  • epsilons (numpy.ndarray)

  • skip_prefix_tokens (int)

  • measure_every_tokens (int)

  • by_prefix (Dict[int, DimensionResult])

by_prefix: Dict[int, DimensionResult]
property corrdims: Dict[int, float]
epsilons: numpy.ndarray
measure_every_tokens: int
sequence_length: int
skip_prefix_tokens: int
corrdim.auto_linear_region_bounds(sequence_length, epsilons, corrints)[source]
Parameters:
  • sequence_length (int)

  • epsilons (numpy.ndarray)

  • corrints (numpy.ndarray)

Return type:

Tuple[float, float]

corrdim.available_corrint_backends()[source]

Return availability of known backends.

Return type:

Dict[str, bool]

corrdim.clamp(values, reference, low, high)[source]

Filter values by a reference range, returning filtered values and reference.

Parameters:
  • values (torch.Tensor | numpy.ndarray)

  • reference (torch.Tensor | numpy.ndarray)

  • low (float)

  • high (float)

Return type:

Tuple[torch.Tensor | numpy.ndarray, torch.Tensor | numpy.ndarray]

corrdim.clear_model_cache()[source]
Return type:

None

corrdim.correlation_counts(vecs, epsilons, vecs_other=None, *, backend=None, **kwargs)[source]

Return counts (unnormalized); implemented by the selected backend module.

Parameters:
  • vecs (torch.FloatTensor)

  • epsilons (torch.FloatTensor)

  • vecs_other (torch.FloatTensor | None)

  • backend (str | CorrIntBackend | None)

  • kwargs (Any)

Return type:

torch.Tensor

corrdim.correlation_integral(vecs, epsilons, vecs_other=None, return_counts=False, *, backend=None, **kwargs)[source]

Return correlation integral; if return_counts=True, return counts instead.

Parameters:
  • vecs (torch.FloatTensor)

  • epsilons (torch.FloatTensor)

  • vecs_other (torch.FloatTensor | None)

  • return_counts (bool)

  • backend (str | CorrIntBackend | None)

  • kwargs (Any)

Return type:

torch.Tensor

corrdim.curve_from_text(text, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, **model_kwargs)[source]
Parameters:
  • text (str)

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

Return type:

CurveResult

corrdim.curve_from_texts(texts, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, **model_kwargs)[source]
Parameters:
  • texts (list[str])

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

Return type:

list[CurveResult]

corrdim.curve_from_vectors(vectors, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
Parameters:
  • vectors (torch.Tensor)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • backend (str | None)

Return type:

CurveResult

corrdim.curve_from_vectors_batch(vectors_batch, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
Parameters:
  • vectors_batch (torch.Tensor)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • backend (str | None)

Return type:

list[CurveResult]

corrdim.estimate_dimension_from_curve(curve, correlation_integral_range=None, epsilon_range=None)[source]
Parameters:
  • curve (CurveResult)

  • correlation_integral_range (Tuple[float, float] | None)

  • epsilon_range (Tuple[float, float] | None)

Return type:

DimensionResult

corrdim.estimate_dimension_from_curves(curves, correlation_integral_range=None, epsilon_range=None)[source]
Parameters:
  • curves (list[CurveResult])

  • correlation_integral_range (Tuple[float, float] | None)

  • epsilon_range (Tuple[float, float] | None)

Return type:

list[DimensionResult]

corrdim.measure_text(text, model, tokenizer=None, truncation_tokens=None, context_length=None, dim_reduction=8192, stride=1, correlation_integral_range=None, epsilon_range=None, num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float16, backend=None, **model_kwargs)[source]
Parameters:
  • text (str)

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • truncation_tokens (int | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • correlation_integral_range (Tuple[float, float] | None)

  • epsilon_range (Tuple[float, float] | None)

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

Return type:

DimensionResult

corrdim.measure_text_progressive(text, model, tokenizer=None, truncation_tokens=None, skip_prefix_tokens=100, measure_every_tokens=None, context_length=None, dim_reduction=8192, stride=1, correlation_integral_range=None, epsilon_range=None, num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float16, backend=None, **model_kwargs)[source]

Compute progressive curves once, then fit correlation dimension at sampled prefixes.

For each index i in range(skip_prefix_tokens, sequence_length, step), uses row corrints_progressive[i] with the shared epsilons grid. Results are in by_prefix (iDimensionResult). If measure_every_tokens is None, step is chosen from sequence_length: < 1001, < 100010, otherwise 100. Other arguments follow measure_text() / progressive_curve_from_text().

Parameters:
  • text (str)

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • truncation_tokens (int | None)

  • skip_prefix_tokens (int)

  • measure_every_tokens (int | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • correlation_integral_range (Tuple[float, float] | None)

  • epsilon_range (Tuple[float, float] | None)

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

Return type:

ProgressiveDimensionResult

corrdim.measure_texts(texts, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, correlation_integral_range=None, epsilon_range=None, num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float16, backend=None, **model_kwargs)[source]
Parameters:
  • texts (list[str])

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • correlation_integral_range (Tuple[float, float] | None)

  • epsilon_range (Tuple[float, float] | None)

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

Return type:

list[DimensionResult]

corrdim.progressive_correlation_counts(vecs, epsilons, *, backend=None, **kwargs)[source]

Progressive counts over sequence prefixes (if implemented by backend).

Parameters:
  • vecs (torch.FloatTensor)

  • epsilons (torch.FloatTensor)

  • backend (str | CorrIntBackend | None)

  • kwargs (Any)

Return type:

torch.Tensor

corrdim.progressive_correlation_integral(vecs, epsilons, return_counts=False, *, backend=None, **kwargs)[source]

Progressive correlation integral over sequence prefixes (if implemented by backend).

Parameters:
  • vecs (torch.FloatTensor)

  • epsilons (torch.FloatTensor)

  • return_counts (bool)

  • backend (str | CorrIntBackend | None)

  • kwargs (Any)

Return type:

torch.Tensor

corrdim.progressive_curve_from_text(text, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, **model_kwargs)[source]
Parameters:
  • text (str)

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

Return type:

ProgressiveCurveResult

corrdim.progressive_curve_from_texts(texts, model, tokenizer=None, context_length=None, dim_reduction=8192, stride=1, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, precision=torch.float32, backend=None, **model_kwargs)[source]
Parameters:
  • texts (list[str])

  • model (str | LanguageModelWrapper)

  • tokenizer (object | None)

  • context_length (int | None)

  • dim_reduction (int | None)

  • stride (int)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • precision (torch.dtype)

  • backend (str | None)

Return type:

list[ProgressiveCurveResult]

corrdim.progressive_curve_from_vectors(vectors, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
Parameters:
  • vectors (torch.Tensor)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • backend (str | None)

Return type:

ProgressiveCurveResult

corrdim.progressive_curve_from_vectors_batch(vectors_batch, epsilon_range=(1e-20, 1e+20), num_epsilon=1024, block_size=512, show_progress=False, backend=None)[source]
Parameters:
  • vectors_batch (torch.Tensor)

  • epsilon_range (Tuple[float, float])

  • num_epsilon (int)

  • block_size (int)

  • show_progress (bool)

  • backend (str | None)

Return type:

list[ProgressiveCurveResult]

corrdim.reduce_dimension(vectors, num_groups=8192, method='group_add')[source]
Parameters:
  • vectors (torch.Tensor | numpy.ndarray)

  • num_groups (int)

  • method (str)

Return type:

torch.Tensor | numpy.ndarray

corrdim.set_corrint_backend(backend='auto')[source]

Set process-wide default backend; returns the resolved backend name.

Parameters:

backend (str | CorrIntBackend | None)

Return type:

str