TableEvaluator

class table_evaluator.table_evaluator.TableEvaluator(real: DataFrame, fake: DataFrame, cat_cols: List[str] | None = None, unique_thresh: int = 0, metric: str | Callable = 'pearsonr', verbose: bool = False, n_samples: int | None = None, name: str | None = None, seed: int = 1337, sample: bool = False, infer_types: bool = True)

Class for evaluating synthetic data. It is given the real and fake data and allows the user to easily evaluate data with the evaluate method. Additional evaluations can be done with the different methods of evaluate and the visual evaluation method.

__init__(real: DataFrame, fake: DataFrame, cat_cols: List[str] | None = None, unique_thresh: int = 0, metric: str | Callable = 'pearsonr', verbose: bool = False, n_samples: int | None = None, name: str | None = None, seed: int = 1337, sample: bool = False, infer_types: bool = True)

Initialize the TableEvaluator with real and fake datasets.

Parameters:
  • real (pd.DataFrame) – Real dataset.

  • fake (pd.DataFrame) – Synthetic dataset.

  • cat_cols (Optional[List[str]], optional) – Columns to be evaluated as discrete. If provided, unique_thresh is ignored. Defaults to None.

  • unique_thresh (int, optional) – Threshold for automatic evaluation if a column is numeric. Defaults to 0.

  • metric (str, optional) – Metric for evaluating linear relations. Defaults to “pearsonr”.

  • verbose (bool, optional) – Whether to print verbose output. Defaults to False.

  • n_samples (Optional[int], optional) – Number of samples to evaluate. If None, takes the minimum length of both datasets. Defaults to None.

  • name (Optional[str], optional) – Name of the TableEvaluator, used in plotting functions. Defaults to None.

  • seed (int, optional) – Random seed for reproducibility. Defaults to 1337.

  • sample (bool, optional) – Whether to sample the datasets to n_samples. Defaults to False.

_determine_columns(cat_cols: List[str] | None) Tuple[List[str], List[str]]

Determine numerical and categorical columns based on the provided data.

_fill_missing_values()

Fill missing values in the datasets.

_set_sample_size(n_samples: int | None) int

Set the number of samples to evaluate.

_validate_dataframes()

Ensure that the real and fake dataframes have the same columns.

basic_statistical_evaluation() float

Calculate the correlation coefficient between the basic properties of self.real and self.fake using Spearman’s Rho. Spearman’s is used because these values can differ a lot in magnitude, and Spearman’s is more resilient to outliers.

Returns:

correlation coefficient

Return type:

float

column_correlations()

Wrapper function around metrics.column_correlation.

Returns:

Column correlations between self.real and self.fake.

convert_numerical() Tuple[DataFrame, DataFrame]

Convert dataset to a numerical representations while making sure they have identical columns. This is sometimes a problem with categorical columns with many values or very unbalanced values

Returns:

Real and fake dataframe factorized using the pandas function

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

convert_numerical_one_hot() Tuple[DataFrame, DataFrame]

Convert dataset to a numerical representations while making sure they have identical columns. This is sometimes a problem with categorical columns with many values or very unbalanced values

Returns:

Real and fake dataframe with categorical columns one-hot encoded and binary columns factorized.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

correlation_correlation() float

Calculate the correlation coefficient between the association matrices of self.real and self.fake using self.comparison_metric

Returns:

The correlation coefficient

Return type:

float

correlation_distance(how: str = 'euclidean') float

Calculate distance between correlation matrices with certain metric.

Parameters:

how – metric to measure distance. Choose from [euclidean, mae, rmse].

Returns:

distance between the association matrices in the chosen evaluation metric. Default: Euclidean

estimator_evaluation(target_col: str, target_type: str = 'class', kfold: bool = False) float

Method to do full estimator evaluation, including training. And estimator is either a regressor or a classifier, depending on the task. Two sets are created of each of the estimators S_r and S_f, for the real and fake data respectively. S_f is trained on self.real and S_r on self.fake. Then, both are evaluated on their own and the others test set. If target_type is regr we do a regression on the RMSE scores with Pearson’s. If target_type is class, we calculate F1 scores and do return 1 - MAPE(F1_r, F1_f).

Parameters:
  • target_col (str) – The column to be considered as the target for both regression and classification tasks.

  • target_type (str) – The type of task. Can be either “class” or “regr”.

  • kfold (bool) – If True, performs 5-fold CV. If False, trains on 80% and tests on 20% of the data once.

Returns:

Correlation value or 1 - MAPE.

Return type:

float

evaluate(target_col: str, target_type: str = 'class', metric: str | None = None, verbose: bool | None = None, n_samples_distance: int = 20000, kfold: bool = False, notebook: bool = False, return_outputs: bool = False) Dict | None

Determine correlation between attributes from the real and fake dataset using a given metric. All metrics from scipy.stats are available.

Parameters:
  • target_col (str) – Column to use for predictions with estimators.

  • target_type (str, optional) – Type of task to perform on the target_col. Can be either “class” for classification or “regr” for regression. Defaults to “class”.

  • metric (str | None, optional) – Overwrites self.metric. Scoring metric for the attributes. By default Pearson’s r is used. Alternatives include Spearman rho (scipy.stats.spearmanr) or Kendall Tau (scipy.stats.kendalltau). Defaults to None.

  • n_samples_distance (int, optional) – The number of samples to take for the row distance. See documentation of tableEvaluator.row_distance for details. Defaults to 20000.

  • kfold (bool, optional) – Use a 5-fold CV for the ML estimators if set to True. Train/Test on 80%/20% of the data if set to False. Defaults to False.

  • notebook (bool, optional) – Better visualization of the results in a python notebook. Defaults to False.

  • verbose (bool | None, optional) – Whether to print verbose logging. Defaults to None.

  • return_outputs (bool, optional) – Will omit printing and instead return a dictionary with all results. Defaults to False.

Returns:

A dictionary containing evaluation results if return_outputs is True, otherwise None.

Return type:

Dict

fit_estimators()

Fit self.r_estimators and self.f_estimators to real and fake data, respectively.

get_copies(return_len: bool = False) DataFrame | int

Check whether any real values occur in the fake data.

Parameters:

return_len (bool) – Whether to return the length of the copied rows or not.

Returns:

Dataframe containing the duplicates if return_len=False,

else integer indicating the number of copied rows.

Return type:

Union[pd.DataFrame, int]

get_duplicates(return_values: bool = False) Tuple[DataFrame, DataFrame] | Tuple[int, int]

Return duplicates within each dataset.

Parameters:

return_values (bool) – Whether to return the duplicate values in the datasets. If False, the lengths are returned.

Returns:

If return_values is True, returns a tuple of DataFrames with duplicates. If return_values is False, returns a tuple of integers representing the lengths of those DataFrames.

Return type:

Union[Tuple[pd.DataFrame, pd.DataFrame], Tuple[int, int]]

pca_correlation(lingress: bool = False)

Calculate the relation between PCA explained variance values. Due to some very large numbers, in recent implementation the MAPE(log) is used instead of regressions like Pearson’s r.

Parameters:

lingress (bool) – Whether to use a linear regression, in this case Pearson’s.

Returns:

The correlation coefficient if lingress=True, otherwise 1 - MAPE(log(real), log(fake)).

Return type:

float

plot_correlation_difference(plot_diff=True, fname=None, show: bool = True, **kwargs)

Plot the association matrices for each table and, if chosen, the difference between them.

Parameters:
  • plot_diff – whether to plot the difference

  • fname – If not none, saves the plot with this file name.

  • kwargs – kwargs for sns.heatmap

plot_cumsums(nr_cols=4, fname: PathLike | None = None, show: bool = True)

Plot the cumulative sums for all columns in the real and fake dataset. Height of each row scales with the length of the labels. Each plot contains the values of a real columns and the corresponding fake column.

Params:

fname str: If not none, saves the plot with this file name.

plot_distributions(nr_cols: int = 3, fname: PathLike | None = None, show: bool = True)

Plot the distribution plots for all columns in the real and fake dataset. Height of each row of plots scales with the length of the labels. Each plot contains the values of a real columns and the corresponding fake column.

Params:

fname (str, Optional): If not none, saves the plot with this file name.

plot_mean_std(fname=None, show: bool = True)

Class wrapper function for plotting the mean and std using plots.plot_mean_std.

Params:

fname (str, Optional): If not none, saves the plot with this file name.

plot_pca(fname: PathLike | None = None, show: bool = True)

Plot the first two components of a PCA of real and fake data. :param fname: If not none, saves the plot with this file name.

row_distance(n_samples: int | None = None) Tuple[number, number]

Calculate mean and standard deviation distances between self.fake and self.real.

Parameters:

n_samples – Number of samples to take for evaluation. Compute time increases exponentially.

Returns:

(mean, std) of these distances.

score_estimators()

Get F1 scores of self.r_estimators and self.f_estimators on the fake and real data, respectively.

Returns:

dataframe with the results for each estimator on each data test set.

visual_evaluation(save_dir: PathLike | None = None, show: bool = True, **kwargs)

Plot all visual evaluation metrics. Includes plotting the mean and standard deviation, cumulative sums, correlation differences and the PCA transform.

Parameters:
  • save_dir (str | None) – Directory path to save images.

  • show (bool) – Whether to display the plots.

  • **kwargs – Additional keyword arguments for matplotlib.

Returns:

None