DataFrame Comparison¶
The hera.utils.dataframeutils module provides compareDataframeConfigurations, a function that identifies parameters that differ across multiple named datasets. It supports several input formats and can return results in either long or wide (pivoted) format.
Importing¶
Basic Example -- List of Tuples¶
The simplest input is a list of (name, dataframe) tuples. Each DataFrame must have a parameter-name column and a value column:
import pandas as pd
from hera.utils.dataframeutils import compareDataframeConfigurations
df_a = pd.DataFrame({
"parameterName": ["solver.dt", "solver.maxIter", "mesh.resolution"],
"value": [0.1, 500, 100],
})
df_b = pd.DataFrame({
"parameterName": ["solver.dt", "solver.maxIter", "mesh.resolution"],
"value": [0.2, 500, 200],
})
diff = compareDataframeConfigurations(
[("baseline", df_a), ("refined", df_b)],
parameterName="parameterName",
valueName="value",
)
print(diff)
The result is a wide-format DataFrame showing only the rows where the value differs between baseline and refined (here solver.dt and mesh.resolution). solver.maxIter is excluded because it is the same in both.
Wide vs. Long Format¶
By default the output is wide (pivoted), with one column per dataset:
Pass longFormat=True to get long format instead:
diff = compareDataframeConfigurations(
[("baseline", df_a), ("refined", df_b)],
parameterName="parameterName",
valueName="value",
longFormat=True,
)
datasetName parameterName value
0 baseline mesh.resolution 100
1 refined mesh.resolution 200
2 baseline solver.dt 0.1
3 refined solver.dt 0.2
Input from a Single DataFrame¶
If your data is already in a single long-format DataFrame with a column distinguishing datasets, pass it directly:
combined = pd.DataFrame({
"run": ["A", "A", "B", "B"],
"parameterName": ["dt", "maxIter", "dt", "maxIter"],
"value": [0.01, 100, 0.02, 100],
})
diff = compareDataframeConfigurations(
combined,
datasetName="run",
parameterName="parameterName",
valueName="value",
)
Making Columns Compatible with pandas.query¶
Dot-separated parameter names (e.g., solver.dt) cannot be used in pandas.query(). Set changeDotToUnderscore=True to convert dots to underscores:
diff = compareDataframeConfigurations(
[("a", df_a), ("b", df_b)],
parameterName="parameterName",
valueName="value",
changeDotToUnderscore=True,
)
# Columns like "solver.dt" become "solver_dt"
Parameters Reference¶
| Parameter | Type | Default | Description |
|---|---|---|---|
data |
list or DataFrame | -- | Datasets to compare. See input formats above. |
datasetName |
str | "datasetName" |
Column or key identifying each dataset. |
parameterName |
str | "parameterName" |
Column containing parameter paths. |
valueName |
str | "value" |
Column containing parameter values. |
indexList |
list | None |
Additional grouping columns (sub-indices). |
longFormat |
bool | False |
Return long format instead of wide. |
changeDotToUnderscore |
bool | False |
Replace dots with underscores in parameter names. |