Concepts

What CorrDim measures

Given a text and an autoregressive language model, CorrDim measures the text’s global structural complexity as perceived by that model.

At a high level:

  • repetitive or degenerate text tends to have a lower correlation dimension

  • ordinary fluent text tends to have a higher dimension

  • richer long-range structure can produce an even higher dimension

CorrDim is therefore best treated as a sequence-level geometric signal, not as a replacement for perplexity.

How the pipeline works

CorrDim typically follows four steps:

  1. Convert text into a sequence of next-token log-probability vectors.

  2. Optionally reduce the vocabulary dimension.

  3. Compute a correlation-integral curve over a range of epsilon thresholds.

  4. Fit the slope in log-log space to estimate the correlation dimension.

In Python, these stages map roughly to:

  • curve_from_text(...) or curve_from_vectors(...) for curve construction

  • estimate_dimension_from_curve(...) for slope fitting

  • measure_text(...) when you want both steps wrapped into one call

  • measure_text_progressive(...) when you want fitted dimensions at subsampled prefix lengths after a single progressive curve pass

Backend model

CorrDim exposes multiple backends for correlation-integral computation:

  • triton: Triton kernels

  • pytorch: pure PyTorch implementation

  • pytorch_fast: PyTorch variant optimized for distance computation

  • auto: resolve automatically, preferring triton when available and otherwise pytorch

You can select the backend with an environment variable:

export CORRDIM_CORRINT_BACKEND=pytorch

Or in Python:

import corrdim

resolved = corrdim.set_corrint_backend("auto")
print("Using backend:", resolved)
print(corrdim.available_corrint_backends())

If you do not set anything, CorrDim defaults to triton.

API layers

The library is intentionally split into layers:

  • high-level API: measure_text, measure_texts, measure_text_progressive

  • curve API: curve_from_text, curve_from_texts, curve_from_vectors

  • progressive API: progressive_curve_from_text, progressive_curve_from_vectors; fitted dimensions along prefixes use measure_text_progressiveProgressiveDimensionResult

  • raw backend API: correlation_counts, correlation_integral, progressive_correlation_integral

Use the highest layer that still gives you the outputs you need.