Key Concepts¶
A gentle introduction to the main ideas behind Hera. Understanding these concepts will help you work effectively with the platform.
The Big Picture: A Metadata-Driven Data Lake¶
At its core, Hera is a data lake with a metadata layer. Your data — weather observations, simulation outputs, GIS files, processed results — lives as ordinary files on disk. Hera doesn't move or copy those files. Instead, it stores metadata documents in MongoDB that describe each piece of data: what it is, where it lives, what format it's in, and any additional properties you attach to it.
┌─────────────────────────────────────────────────────┐
│ MongoDB │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Document │ │
│ │ projectName: "WindStudy" │ │
│ │ type: "WeatherStation" │ │
│ │ dataFormat: "parquet" │ │
│ │ resource: "/data/station_A.parquet" ──┼──┐ │
│ │ desc: │ │ │
│ │ station: "A" │ │ │
│ │ location: "Haifa" │ │ │
│ │ elevation: 120 │ │ │
│ │ period: "2023-2024" │ │ │
│ └─────────────────────────────────────────┘ │ │
│ │ │
└───────────────────────────────────────────────┼─────┘
│
┌───────────────────────────┘
▼
┌─────────────────────┐
│ /data/ │
│ station_A.parquet │ ← actual data on disk
│ station_B.parquet │
│ dem_30m.tif │
└─────────────────────┘
The desc field is a free-form dictionary — you can attach any metadata you need, forming a tree-like structure that you can later query by any level of nesting. This is what makes Hera different from simply organizing files in folders: every piece of data is searchable by any combination of its metadata fields. The dataFormat field tells Hera how to load the file — see resource types for all supported formats.
A small example: store and retrieve¶
from hera import Project
proj = Project(projectName="WindStudy")
# Store: register a file with metadata
proj.addMeasurementsDocument(
resource="/data/station_A.parquet",
dataFormat="parquet",
type="WeatherStation",
desc={
"station": "A",
"location": "Haifa",
"elevation": 120,
"period": "2023-2024",
"variables": ["temperature", "wind_speed", "wind_direction"]
}
)
# Retrieve: find data by any metadata field
docs = proj.getMeasurementsDocuments(type="WeatherStation", location="Haifa")
# Load the actual data into a pandas DataFrame
df = docs[0].getData()
print(df.head())
You can query by any combination of fields — find all stations above 100m elevation, all data from 2024, all parquet files of a certain type, etc. The metadata acts as a catalog for your data lake.
Three collections for different roles¶
Hera organizes documents into three collections based on their role in the scientific workflow:
| Collection | Role | Example |
|---|---|---|
| Measurements | Raw input data that comes from the real world | Weather station files, GIS shapefiles, sensor readings |
| Simulations | Results produced by computational models | OpenFOAM output, dispersion model results |
| Cache | Derived or intermediate results | Processed statistics, function return values, aggregations |
This separation helps you understand the provenance of any piece of data — where it came from and what role it plays.
For details on adding data, data formats, the type field, and querying, see Working with Data.
Toolkits: Portals to Specific Data Types¶
While Projects give you raw access to all your data, Toolkits provide a domain-specific lens. A toolkit understands a particular kind of data — how to read it, process it, and visualize it.
Think of toolkits as portals: each one is focused on a specific data type and knows the right questions to ask about it.
from hera import toolkitHome
# The meteorology toolkit knows how to work with weather station data
# Tip: if you created the project with `hera-project project create`, you can omit projectName
meteo = toolkitHome.getToolkit(toolkitHome.METEOROLOGY_LOWFREQ, projectName="WindStudy")
# It manages named, versioned datasets
df = meteo.getDataSourceData("station_haifa")
# It provides domain-specific analysis
enriched = meteo.analysis.addDatesColumns(df, datecolumn="datetime")
hourly = meteo.analysis.calcHourlyDist(enriched, field="wind_speed")
# And domain-specific visualization
meteo.presentation.seasonalPlots.plotSeasonalHourly(enriched, field="wind_speed")
Without a toolkit, you'd query the raw project and handle all the data loading, processing, and plotting yourself. With a toolkit, the domain knowledge is built in:
| Without toolkit (raw Project) | With toolkit |
|---|---|
proj.getMeasurementsDocuments(type="WeatherStation", station="A") |
meteo.getDataSourceData("station_haifa") |
| Manual pandas processing | meteo.analysis.addDatesColumns(df, ...) |
| Manual matplotlib plotting | meteo.presentation.dailyPlots.plotScatter(df, ...) |
| You manage versions manually | meteo.getDataSourceData("station_haifa", version=(1,0,0)) |
Each toolkit adds three things on top of the Project data layer:
- Data Sources — versioned, named datasets (no need to remember query filters)
- Analysis — domain-specific processing methods
- Presentation — domain-specific visualizations
For the full details on projects — directory-based auto-detection, configuration, counters, and all data manipulation methods — see the Projects page.
Available Toolkits¶
Hera ships with built-in toolkits for several domains:
| Domain | Toolkits | What they do |
|---|---|---|
| GIS | Topography, Buildings, LandCover, Demography | Elevation data, building footprints, land classification, population |
| Meteorology | MeteoLowFreq, MeteoHighFreq | Weather station data (hourly and high-frequency) |
| Simulations | OpenFOAM, LSM, Gaussian, WindProfile | CFD, dispersion modeling, wind profiles |
| Risk | RiskAssessment | Agent-based hazard and casualty modeling |
| Data | experiment, dataToolkit | Experiment workflows, repository management |
You can also register your own custom toolkits. See the Toolkit Catalog for detailed documentation of each toolkit.
Data Sources¶
A Data Source is a versioned, named dataset managed by a toolkit. It is the primary way toolkits organize their data within a project.
Each data source has:
- Name — a human-readable identifier (e.g.,
"YAVNEEL","Israel_DEM") - Version — a 3-tuple
(major, minor, patch)for tracking changes - Resource — the actual data (file path, JSON, etc.)
- Data format — how to read the resource (parquet, netcdf, string, etc.)
# List data sources for a toolkit
topo.getDataSourceList()
# ['Israel_DEM', 'SRTM_30m']
# Get a specific version
ds = topo.getDataSourceData("Israel_DEM", version=(1, 0, 0))
# Get the default version (latest or explicitly set)
ds = topo.getDataSourceData("Israel_DEM")
# View all data sources as a table
topo.getDataSourceTable()
Versioning¶
Multiple versions of the same data source can coexist. When you request data without specifying a version, Hera returns:
- The default version — if one is stored in the project config
- The latest version — otherwise (highest version number), which is then automatically saved as the default for stable subsequent calls
Repositories¶
A Repository is a JSON file that describes a collection of data sources and their locations. Repositories make it easy to share and reproduce project setups — instead of manually adding data sources one by one, you load a repository file that configures everything at once.
# Register a repository
hera-project repository add my_repository.json
# Load it into a project
hera-project repository load my_repository MY_PROJECT
A repository JSON maps toolkit names to their data sources, configurations, and documents:
{
"MeteoLowFreq": {
"Config": { "defaultStation": "YAVNEEL" },
"Datasource": {
"YAVNEEL": {
"isRelativePath": "True",
"item": {
"resource": "data/yavneel.parquet",
"dataFormat": "parquet"
}
}
}
}
}
How it all fits together¶
- You ask ToolkitHome for a toolkit by name
- ToolkitHome finds and instantiates the right Toolkit class
- The toolkit is bound to a Project, giving it access to MongoDB and the file system
- You work with Data Sources through the toolkit — loading data, running analysis, creating visualizations
- Repositories let you define and share entire project setups as JSON files
For the full technical details, see the Core Concepts page in the Developer Guide.