Metrics

table_evaluator.metrics.column_correlations(dataset_a: DataFrame, dataset_b: DataFrame, categorical_columns: list[str] | None, theil_u=True)

Column-wise correlation calculation between dataset_a and dataset_b.

Parameters:

dataset_a (pd.DataFrame) – First DataFrame
dataset_b (pd.DataFrame) – Second DataFrame
categorical_columns (list[str]) – The columns containing categorical values
theil_u (bool) – Whether to use Theil’s U. If False, use Cramer’s V.

Returns:

Mean correlation between all columns.

Return type:

float

table_evaluator.metrics.euclidean_distance(y_true: ndarray | Series, y_pred: ndarray | Series) → float

Returns the euclidean distance between y_true and y_pred.

Parameters:

y_true (numpy.ndarray) – The ground truth values.
y_pred (numpy.ndarray) – The predicted values.

Returns:

The mean absolute error.

Return type:

float

table_evaluator.metrics.jensenshannon_distance(colname: str, real_col: Series, fake_col: Series, bins: int = 25) → Dict[str, Any]

Calculate the Jensen-Shannon distance between real and fake data columns.

This function bins the data, calculates probability distributions, and then computes the Jensen-Shannon distance between these distributions.

Parameters:

colname (str) – Name of the column being analyzed.
real_col (pd.Series) – Series containing the real data.
fake_col (pd.Series) – Series containing the fake data.
bins (int, optional) – Number of bins to use for discretization. Defaults to 25.

Returns:

A dictionary containing:

’col_name’: Name of the column.
’js_distance’: The calculated Jensen-Shannon distance.

Return type:

Dict[str, Any]

Note

The number of bins is capped at the length of the real column to avoid empty bins.

table_evaluator.metrics.js_distance_df(real: DataFrame, fake: DataFrame, numerical_columns: List[str]) → DataFrame

Calculate Jensen-Shannon distances between real and fake data for numerical columns.

This function computes the Jensen-Shannon distance for each numerical column in parallel using joblib’s Parallel and delayed functions.

Parameters:

real (pd.DataFrame) – DataFrame containing the real data.
fake (pd.DataFrame) – DataFrame containing the fake data.
numerical_columns (List[str]) – List of column names to compute distances for.

Returns:

A DataFrame with column names as index and Jensen-Shannon: distances as values.

Return type:

pd.DataFrame

Raises:

AssertionError – If the columns in real and fake DataFrames are not identical.

table_evaluator.metrics.kolmogorov_smirnov_test(col_name: str, real_col: Series, fake_col: Series) → Dict[str, Any]

Perform Kolmogorov-Smirnov test on real and fake data columns.

Parameters:

col_name (str) – Name of the column being tested.
real_col (pd.Series) – Series containing the real data.
fake_col (pd.Series) – Series containing the fake data.

Returns:

A dictionary containing:

’col_name’: Name of the column.
’statistic’: The KS statistic.
’p-value’: The p-value of the test.
’equality’: ‘identical’ if p-value > 0.01, else ‘different’.

Return type:

Dict[str, Any]

table_evaluator.metrics.mean_absolute_error(y_true: ndarray, y_pred: ndarray) → floating[Any]

Returns the mean absolute error between y_true and y_pred.

Parameters:

y_true – NumPy.ndarray with the ground truth values.
y_pred – NumPy.ndarray with the ground predicted values.

Returns:

Mean absolute error (float).

table_evaluator.metrics.mean_absolute_percentage_error(y_true: ndarray | Series, y_pred: ndarray | Series)

Returns the mean absolute percentage error between y_true and y_pred. Throws ValueError if y_true contains zero values.

Parameters:

y_true (numpy.ndarray) – The ground truth values.
y_pred (numpy.ndarray) – The predicted values.

Returns:

Mean absolute percentage error.

Return type:

float

table_evaluator.metrics.rmse(y_true: ndarray | Series, y_pred: ndarray | Series) → ndarray | Series

Returns the root mean squared error between y_true and y_pred.

Parameters:

y_true – NumPy.ndarray with the ground truth values.
y_pred – NumPy.ndarray with the ground predicted values.

Returns:

root mean squared error (float).