# Concepts

## What CorrDim measures

Given a text and an autoregressive language model, CorrDim measures the text's **global structural complexity** as perceived by that model.

At a high level:

- repetitive or degenerate text tends to have a lower correlation dimension
- ordinary fluent text tends to have a higher dimension
- richer long-range structure can produce an even higher dimension

CorrDim is therefore best treated as a sequence-level geometric signal, not as a replacement for perplexity.

## How the pipeline works

CorrDim typically follows four steps:

1. Convert text into a sequence of next-token log-probability vectors.
2. Optionally reduce the vocabulary dimension.
3. Compute a correlation-integral curve over a range of epsilon thresholds.
4. Fit the slope in log-log space to estimate the correlation dimension.

In Python, these stages map roughly to:

- `curve_from_text(...)` or `curve_from_vectors(...)` for curve construction
- `estimate_dimension_from_curve(...)` for slope fitting
- `measure_text(...)` when you want both steps wrapped into one call
- `measure_text_progressive(...)` when you want fitted dimensions at subsampled prefix lengths after a single progressive curve pass

## Backend model

CorrDim exposes multiple backends for correlation-integral computation:

- `triton`: Triton kernels
- `pytorch`: pure PyTorch implementation
- `pytorch_fast`: PyTorch variant optimized for distance computation
- `auto`: resolve automatically, preferring `triton` when available and otherwise `pytorch`

You can select the backend with an environment variable:

```bash
export CORRDIM_CORRINT_BACKEND=pytorch
```

Or in Python:

```python
import corrdim

resolved = corrdim.set_corrint_backend("auto")
print("Using backend:", resolved)
print(corrdim.available_corrint_backends())
```

If you do not set anything, CorrDim defaults to `triton`.

## API layers

The library is intentionally split into layers:

- high-level API: `measure_text`, `measure_texts`, `measure_text_progressive`
- curve API: `curve_from_text`, `curve_from_texts`, `curve_from_vectors`
- progressive API: `progressive_curve_from_text`, `progressive_curve_from_vectors`; fitted dimensions along prefixes use `measure_text_progressive` → `ProgressiveDimensionResult`
- raw backend API: `correlation_counts`, `correlation_integral`, `progressive_correlation_integral`

Use the highest layer that still gives you the outputs you need.