Key Concepts¶

A gentle introduction to the main ideas behind Hera. Understanding these concepts will help you work effectively with the platform.

The Big Picture: A Metadata-Driven Data Lake¶

At its core, Hera is a data lake with a metadata layer. Your data — weather observations, simulation outputs, GIS files, processed results — lives as ordinary files on disk. Hera doesn't move or copy those files. Instead, it stores metadata documents in MongoDB that describe each piece of data: what it is, where it lives, what format it's in, and any additional properties you attach to it.

┌─────────────────────────────────────────────────────┐
│                    MongoDB                          │
│                                                     │
│  ┌─────────────────────────────────────────┐        │
│  │ Document                                │        │
│  │   projectName: "WindStudy"              │        │
│  │   type: "WeatherStation"                │        │
│  │   dataFormat: "parquet"                 │        │
│  │   resource: "/data/station_A.parquet" ──┼──┐     │
│  │   desc:                                 │  │     │
│  │     station: "A"                        │  │     │
│  │     location: "Haifa"                   │  │     │
│  │     elevation: 120                      │  │     │
│  │     period: "2023-2024"                 │  │     │
│  └─────────────────────────────────────────┘  │     │
│                                               │     │
└───────────────────────────────────────────────┼─────┘
                                                │
                    ┌───────────────────────────┘
                    ▼
         ┌─────────────────────┐
         │  /data/             │
         │   station_A.parquet │  ← actual data on disk
         │   station_B.parquet │
         │   dem_30m.tif       │
         └─────────────────────┘

The desc field is a free-form dictionary — you can attach any metadata you need, forming a tree-like structure that you can later query by any level of nesting. This is what makes Hera different from simply organizing files in folders: every piece of data is searchable by any combination of its metadata fields. The dataFormat field tells Hera how to load the file — see resource types for all supported formats.

A small example: store and retrieve¶

from hera import Project

proj = Project(projectName="WindStudy")

# Store: register a file with metadata
proj.addMeasurementsDocument(
    resource="/data/station_A.parquet",
    dataFormat="parquet",
    type="WeatherStation",
    desc={
        "station": "A",
        "location": "Haifa",
        "elevation": 120,
        "period": "2023-2024",
        "variables": ["temperature", "wind_speed", "wind_direction"]
    }
)

# Retrieve: find data by any metadata field
docs = proj.getMeasurementsDocuments(type="WeatherStation", location="Haifa")

# Load the actual data into a pandas DataFrame
df = docs[0].getData()
print(df.head())

You can query by any combination of fields — find all stations above 100m elevation, all data from 2024, all parquet files of a certain type, etc. The metadata acts as a catalog for your data lake.

Three collections for different roles¶

Hera organizes documents into three collections based on their role in the scientific workflow:

Collection	Role	Example
Measurements	Raw input data that comes from the real world	Weather station files, GIS shapefiles, sensor readings
Simulations	Results produced by computational models	OpenFOAM output, dispersion model results
Cache	Derived or intermediate results	Processed statistics, function return values, aggregations

This separation helps you understand the provenance of any piece of data — where it came from and what role it plays.

For details on adding data, data formats, the type field, and querying, see Working with Data.

Toolkits: Portals to Specific Data Types¶

While Projects give you raw access to all your data, Toolkits provide a domain-specific lens. A toolkit understands a particular kind of data — how to read it, process it, and visualize it.

Think of toolkits as portals: each one is focused on a specific data type and knows the right questions to ask about it.

from hera import toolkitHome

# The meteorology toolkit knows how to work with weather station data
# Tip: if you created the project with `hera-project project create`, you can omit projectName
meteo = toolkitHome.getToolkit(toolkitHome.METEOROLOGY_LOWFREQ, projectName="WindStudy")

# It manages named, versioned datasets
df = meteo.getDataSourceData("station_haifa")

# It provides domain-specific analysis
enriched = meteo.analysis.addDatesColumns(df, datecolumn="datetime")
hourly = meteo.analysis.calcHourlyDist(enriched, field="wind_speed")

# And domain-specific visualization
meteo.presentation.seasonalPlots.plotSeasonalHourly(enriched, field="wind_speed")

Without a toolkit, you'd query the raw project and handle all the data loading, processing, and plotting yourself. With a toolkit, the domain knowledge is built in:

Without toolkit (raw Project)	With toolkit
`proj.getMeasurementsDocuments(type="WeatherStation", station="A")`	`meteo.getDataSourceData("station_haifa")`
Manual pandas processing	`meteo.analysis.addDatesColumns(df, ...)`
Manual matplotlib plotting	`meteo.presentation.dailyPlots.plotScatter(df, ...)`
You manage versions manually	`meteo.getDataSourceData("station_haifa", version=(1,0,0))`

Each toolkit adds three things on top of the Project data layer:

Data Sources — versioned, named datasets (no need to remember query filters)
Analysis — domain-specific processing methods
Presentation — domain-specific visualizations

For the full details on projects — directory-based auto-detection, configuration, counters, and all data manipulation methods — see the Projects page.

Available Toolkits¶

Hera ships with built-in toolkits for several domains:

Domain	Toolkits	What they do
GIS	Topography, Buildings, LandCover, Demography	Elevation data, building footprints, land classification, population
Meteorology	MeteoLowFreq, MeteoHighFreq	Weather station data (hourly and high-frequency)
Simulations	OpenFOAM, LSM, Gaussian, WindProfile	CFD, dispersion modeling, wind profiles
Risk	RiskAssessment	Agent-based hazard and casualty modeling
Data	experiment, dataToolkit	Experiment workflows, repository management

You can also register your own custom toolkits. See the Toolkit Catalog for detailed documentation of each toolkit.

Data Sources¶

A Data Source is a versioned, named dataset managed by a toolkit. It is the primary way toolkits organize their data within a project.

Each data source has:

Name — a human-readable identifier (e.g., "YAVNEEL", "Israel_DEM")
Version — a 3-tuple (major, minor, patch) for tracking changes
Resource — the actual data (file path, JSON, etc.)
Data format — how to read the resource (parquet, netcdf, string, etc.)

# List data sources for a toolkit
topo.getDataSourceList()
# ['Israel_DEM', 'SRTM_30m']

# Get a specific version
ds = topo.getDataSourceData("Israel_DEM", version=(1, 0, 0))

# Get the default version (latest or explicitly set)
ds = topo.getDataSourceData("Israel_DEM")

# View all data sources as a table
topo.getDataSourceTable()

Versioning¶

Multiple versions of the same data source can coexist. When you request data without specifying a version, Hera returns:

The default version — if one is stored in the project config
The latest version — otherwise (highest version number), which is then automatically saved as the default for stable subsequent calls

# Set a default version
topo.setDataSourceDefaultVersion("Israel_DEM", version=(1, 0, 0))

Repositories¶

A Repository is a JSON file that describes a collection of data sources and their locations. Repositories make it easy to share and reproduce project setups — instead of manually adding data sources one by one, you load a repository file that configures everything at once.

# Register a repository
hera-project repository add my_repository.json

# Load it into a project
hera-project repository load my_repository MY_PROJECT

A repository JSON maps toolkit names to their data sources, configurations, and documents:

{
    "MeteoLowFreq": {
        "Config": { "defaultStation": "YAVNEEL" },
        "Datasource": {
            "YAVNEEL": {
                "isRelativePath": "True",
                "item": {
                    "resource": "data/yavneel.parquet",
                    "dataFormat": "parquet"
                }
            }
        }
    }
}

How it all fits together¶

Diagram

You ask ToolkitHome for a toolkit by name
ToolkitHome finds and instantiates the right Toolkit class
The toolkit is bound to a Project, giving it access to MongoDB and the file system
You work with Data Sources through the toolkit — loading data, running analysis, creating visualizations
Repositories let you define and share entire project setups as JSON files

For the full technical details, see the Core Concepts page in the Developer Guide.