Creating Nested Catalogs¶
Portolan supports hierarchical catalog structures where directories automatically become subcatalogs. This is useful for organizing large datasets by theme, region, or time period.
Quick Start¶
# Organize your data into themed directories
mkdir -p my-catalog/{climate,environment,housing}
cp climate-data/*.parquet my-catalog/climate/
cp env-data/*.parquet my-catalog/environment/
# Initialize and add everything
cd my-catalog
portolan init --auto --title "My Regional Data"
portolan add . --workers 4
# Add metadata and generate documentation
portolan metadata init
# Edit .portolan/metadata.yaml with your info
portolan readme --recursive
How Directory Structure Maps to STAC¶
Portolan infers the catalog hierarchy from your directory layout:
my-catalog/ # Root catalog (catalog.json)
├── climate/ # Subcatalog (climate/catalog.json)
│ ├── temperature/ # Collection (climate/temperature/collection.json)
│ │ └── temperature.parquet
│ └── precipitation/ # Collection
│ └── precipitation.parquet
└── demographics/ # Subcatalog
└── census-2020/ # Collection
└── census.parquet
When you run portolan add ., Portolan:
- Creates
catalog.jsonat the root with links to subcatalogs - Creates
catalog.jsonin each intermediate directory (subcatalogs) - Creates
collection.json+ item metadata in leaf directories (collections) - Generates
versions.jsonfor tracking at each level
Bulk Adding Files¶
Process many files efficiently with parallel workers:
portolan add . --workers 4 --verbose
The --verbose flag shows progress for each file. Without it, only changed/added files appear.
Metadata and READMEs¶
Setting Up Metadata¶
portolan metadata init
This creates .portolan/metadata.yaml with required fields (contact, license) and optional fields (citation, keywords, source URL, known issues).
Example:
contact:
name: "Data Team"
email: "data@example.org"
license: "CC-BY-4.0"
license_url: "https://creativecommons.org/licenses/by/4.0/"
keywords:
- climate
- regional data
- open data
source_url: "https://data.example.org/"
processing_notes: "Converted from Shapefile to GeoParquet with Hilbert sorting."
known_issues: "Temporal extent not specified for most datasets."
Generating READMEs¶
portolan readme --recursive
This generates README.md files at every level — root catalog, subcatalogs, and collections. Metadata from the root cascades down, so you only need to edit one metadata.yaml for consistent attribution across all READMEs.
To preview without writing:
portolan readme --stdout
Validation¶
Check the catalog structure and data formats:
portolan check --verbose
This validates:
- STAC metadata completeness
- Cloud-native format compliance (GeoParquet, COG)
- Provisional datetime warnings (items without explicit dates)
Example: The Hague Open Data¶
A real-world example with 6 thematic subcatalogs and 23 collections:
den-haag/
├── catalog.json
├── climate/ # 3 collections: heat maps, climate scores
├── environment/ # 7 collections: air quality, noise, soil
├── housing/ # 1 collection: energy labels
├── infrastructure/ # 3 collections: waste, zones, storage
├── nature/ # 7 collections: species, habitats, trees
└── water/ # 2 collections: gauges, water bodies
Created with:
portolan init --auto --title "The Hague Open Data" \
--description "Municipal open data from Den Haag, Netherlands"
portolan add . --workers 4
portolan metadata init
# Edit .portolan/metadata.yaml
portolan readme --recursive
portolan check
Cloning Remote Nested Catalogs¶
Clone nested catalogs from object storage:
# Clone recursively discovers all subcatalogs and collections
portolan clone s3://bucket/nested-catalog ./local-copy --profile my-profile
Portolan automatically traverses subcatalog catalog.json files to find actual collections. For the structure above, clone would find:
climate/temperature(collection)climate/precipitation(collection)demographics/census-2020(collection)
Not the intermediate subcatalogs (climate/catalog.json, demographics/catalog.json).
Restoring Missing Files¶
If you accidentally delete local data files, use --restore to re-download them.
The pull operation uses optimized concurrency settings (8 files × 4 chunks by default) to avoid overwhelming home networks:
# Normal pull - won't download if versions match
portolan pull s3://bucket/my-catalog
# Restore pull - re-downloads missing files even when versions match
portolan pull s3://bucket/my-catalog --restore
The --restore flag checks file existence locally and downloads any missing assets, regardless of version metadata. Useful for recovering from accidental deletions.
Tips¶
Start flat, restructure later. You can reorganize directories and re-run portolan add . — Portolan regenerates the STAC hierarchy from the current structure.
One metadata.yaml for consistency. Root-level metadata cascades to all READMEs. Only create collection-level metadata.yaml files when you need overrides.
Use --workers for large catalogs. Parallel processing significantly speeds up metadata extraction for catalogs with many files.