Configuration¶
Portolan uses two configuration mechanisms:
.portolan/config.yaml— Non-sensitive settings (conversion, PMTiles, backend).envfile or environment variables — Credentials (remote, profile, region)
Quick Start¶
# Create .env file in catalog root (never pushed to remote)
cat > .env << 'EOF'
PORTOLAN_REMOTE=s3://my-bucket/catalog
PORTOLAN_PROFILE=production
PORTOLAN_REGION=us-west-2
EOF
# Verify configuration
portolan config list
Security
Credentials (remote, profile, region) cannot be stored in config.yaml because that file gets pushed to remote storage. This applies to both catalog-level and collection-level config (e.g., collections.demo.remote is also blocked). Use environment variables or a .env file in your catalog root instead. The .env file is automatically ignored and never uploaded.
Backend (Enterprise)¶
By default, Portolan uses a file-based backend (versions.json) for version tracking. For enterprise deployments requiring ACID transactions, distributed locking, and advanced versioning features, install the portolake plugin:
uv add portolake
# or: pip install portolake
Then configure the backend:
# .portolan/config.yaml
backend: iceberg
Or initialize a new catalog with the Iceberg backend:
portolan init --backend iceberg
Version Management Commands¶
With the Iceberg backend, additional commands become available:
# Show current version of a collection
portolan version current boundaries
# List all versions
portolan version list boundaries
# Rollback to a previous version (instant, uses Iceberg snapshots)
portolan version rollback boundaries 1.0.0
# Remove old versions, keeping N most recent
portolan version prune boundaries --keep 5
Backend-specific commands
The portolan version subcommands require the iceberg backend. Running them with the default file backend will display an error message.
See the portolake documentation for full setup instructions and enterprise features.
Setting Configuration¶
Credentials (via .env or environment)¶
Credentials are sensitive settings that cannot be stored in config.yaml:
# Option 1: .env file (recommended for local development)
cat > .env << 'EOF'
PORTOLAN_REMOTE=s3://my-bucket/catalog
PORTOLAN_PROFILE=production
PORTOLAN_REGION=us-west-2
EOF
# Option 2: Environment variables (for CI/CD)
export PORTOLAN_REMOTE=s3://my-bucket/catalog
export PORTOLAN_PROFILE=production
export PORTOLAN_REGION=us-west-2
# View current settings (reads from env/.env)
portolan config list
Other Settings (via config.yaml)¶
Non-sensitive settings are stored in .portolan/config.yaml:
# Set backend type
portolan config set backend iceberg
# Set conversion options
portolan config set conversion.extensions.preserve "['shp', 'gpkg']"
# View current settings
portolan config list
Configuration Precedence¶
Settings are resolved in this order (highest to lowest):
- CLI argument (
--remote s3://...) - Environment variable (
PORTOLAN_REMOTE=s3://...) - Collection config (in
collections:section) - Catalog config (top-level in config.yaml)
- Built-in default
Conversion Configuration¶
Control how Portolan handles different file formats during check and convert operations.
Use Cases¶
| Scenario | Configuration |
|---|---|
| Force-convert FlatGeobuf to GeoParquet | extensions.convert: [fgb] |
| Keep Shapefiles as-is | extensions.preserve: [shp] |
| Preserve everything in archive/ | paths.preserve: ["archive/**"] |
Full Example¶
# .portolan/config.yaml
# Note: remote/profile/region go in .env, not here
conversion:
extensions:
# Force-convert these cloud-native formats to GeoParquet
convert:
- fgb # FlatGeobuf
# Keep these formats as-is (don't convert)
preserve:
- shp # Shapefiles
- gpkg # GeoPackage
paths:
# Glob patterns for files to preserve regardless of format
preserve:
- "archive/**" # Everything in archive/
- "regulatory/*.shp" # Regulatory shapefiles
- "legacy/**" # Legacy data directory
Extension Overrides¶
extensions.convert¶
Force-convert cloud-native formats to GeoParquet. Use when:
- You want consistent columnar format for analytics
- Your tooling prefers GeoParquet over FlatGeobuf
conversion:
extensions:
convert:
- fgb # FlatGeobuf -> GeoParquet
extensions.preserve¶
Keep convertible formats as-is. Use when:
- Regulatory requirements mandate original format
- Downstream tools require specific formats
- You're preserving archival data
conversion:
extensions:
preserve:
- shp # Keep Shapefiles
- gpkg # Keep GeoPackage
- geojson # Keep GeoJSON
Path Patterns¶
Use glob patterns to override behavior for specific directories or files.
conversion:
paths:
preserve:
- "archive/**" # All files in archive/ and subdirectories
- "regulatory/*.shp" # Only .shp files in regulatory/
- "**/*.backup.geojson" # Any .backup.geojson file
Pattern syntax:
*matches any characters except/**matches any characters including/?matches any single character
Precedence: Path patterns override extension rules. A FlatGeobuf file in archive/ will be preserved even if extensions.convert: [fgb] is set.
COG Settings¶
Configure Cloud-Optimized GeoTIFF conversion parameters. By default, Portolan uses ADR-0019 defaults (DEFLATE compression, predictor=2, 512×512 tiles, nearest resampling).
conversion:
cog:
compression: JPEG # DEFLATE (default), JPEG, LZW, ZSTD, WEBP
quality: 95 # Quality 1-100 (applies to JPEG and WEBP)
tile_size: 512 # Internal tile size in pixels
predictor: 2 # 1=none, 2=horizontal (default), 3=floating point
resampling: nearest # Overview resampling: nearest, bilinear, cubic, etc.
generate_thumbnail: true # Auto-generate JPEG thumbnail (default: true)
thumbnail_max_size: 512 # Max dimension in pixels (default: 512)
thumbnail_quality: 75 # JPEG quality 1-100 (default: 75)
Validation
Invalid settings produce warnings but don't block conversion. Quality is clamped to 1-100, and unknown compression/resampling values are passed through to let rio-cogeo handle errors.
Thumbnails
When generate_thumbnail is enabled, a JPEG thumbnail is created next to each converted COG (e.g., data.tif → data.thumb.jpg). The thumbnail is automatically picked up by portolan scan with roles: ["thumbnail"], following STAC best practices.
Vector Settings¶
Configure spatial optimization for GeoParquet conversion. Uses geoparquet-io's fluent Table API for spatial indexing, sorting, and partitioning.
conversion:
vector:
spatial_index: h3 # h3 | quadkey | s2 | a5 | kdtree | none (default: none)
resolution: auto # auto | explicit int (default: auto)
sort: hilbert # hilbert | quadkey | none (default: none)
add_bbox: true # Add bbox struct column (default: false)
partition: false # Produce hive-partitioned output (default: false)
Resolution defaults
When resolution: auto, geoparquet-io uses sensible defaults per index type (H3: 9, Quadkey: 13, S2: 13, A5: 15, KD-tree: 9 iterations). Explicit values override these defaults.
Spatial Index Types¶
| Index | Description | Resolution Range |
|---|---|---|
h3 |
Uber H3 hexagonal cells | 0-15 (default: 9) |
quadkey |
Bing Maps tile IDs | 0-23 (default: 13) |
s2 |
Google S2 spherical cells | 0-30 (default: 13) |
a5 |
A5 hierarchical grid | 0-30 (default: 15) |
kdtree |
KD-tree balanced spatial splits | 1-20 iterations (default: 9) |
Use Cases¶
| Scenario | Configuration |
|---|---|
| Analytics queries (spatial filtering) | spatial_index: h3, add_bbox: true |
| Optimal row group statistics | sort: hilbert, add_bbox: true |
| Partitioned output for large files | spatial_index: kdtree, partition: true |
| Web map tiling (PMTiles input) | spatial_index: quadkey, sort: quadkey |
Partitioning vs Auto-Partitioning¶
conversion.vector.partition: true— Always produce hive-partitioned output during conversionpartitioning.enabled: true— Auto-partition files exceedingthreshold_gb(see Spatial Partitioning)
These are complementary. Use conversion.vector for consistent spatial optimization, partitioning for size-based auto-splitting.
Use Cases¶
| Scenario | Configuration |
|---|---|
| RGB imagery (smaller files) | compression: JPEG, quality: 95 |
| Elevation data (lossless) | compression: DEFLATE, predictor: 3 |
| Analytics (fast reads) | compression: LZW, tile_size: 256 |
| Disable thumbnails | generate_thumbnail: false |
| Large thumbnails for preview | thumbnail_max_size: 1024, thumbnail_quality: 90 |
Available Compression Methods¶
| Method | Best For | Notes |
|---|---|---|
DEFLATE |
General use (default) | Lossless, universal compatibility |
LZW |
Fast compression/decompression | Lossless, slightly larger files |
ZSTD |
High compression ratio | Lossless, requires GDAL 2.3+ |
JPEG |
RGB imagery | Lossy, smallest files for photos |
WEBP |
Web display | Lossy, modern browsers only |
PMTiles Generation¶
Generate vector tile overviews from GeoParquet assets for efficient web map rendering.
# .portolan/config.yaml
pmtiles.enabled: true # Auto-generate during add (default: false)
pmtiles.min_zoom: 0 # Minimum zoom level (default: auto-detect)
pmtiles.max_zoom: 14 # Maximum zoom level (default: auto-detect)
pmtiles.precision: 6 # Coordinate decimal precision (default: 6)
pmtiles.layer: boundaries # Layer name in output (default: filename)
pmtiles.attribution: "© OpenStreetMap contributors"
External dependency
PMTiles generation requires tippecanoe installed and in PATH:
- macOS:
brew install tippecanoe - Ubuntu:
apt install tippecanoe
Also requires the optional pmtiles extra: pip install portolan-cli[pmtiles]
Commands¶
# Generate PMTiles during add
portolan add boundaries/ --pmtiles
# Force regeneration even if up-to-date
portolan add boundaries/ --pmtiles --force-pmtiles
# Check for missing PMTiles (produces warning, not error)
portolan check
How It Works¶
- Uses gpio-pmtiles wrapper around tippecanoe
- PMTiles stored alongside source GeoParquet (e.g.,
data.parquet→data.pmtiles) - Registered as collection-level asset with role
["overview"] - Tracked in
versions.jsonfor push - Skips regeneration if PMTiles newer than source (mtime check)
Settings Reference¶
| Setting | Default | Description |
|---|---|---|
pmtiles.enabled |
false |
Auto-generate during add command |
pmtiles.min_zoom |
auto | Minimum zoom level (tippecanoe default: 0) |
pmtiles.max_zoom |
auto | Maximum zoom level (tippecanoe default: 14) |
pmtiles.layer |
filename | Layer name in PMTiles output |
pmtiles.precision |
6 |
Coordinate decimal precision |
pmtiles.attribution |
gpio default | Attribution HTML for tiles |
pmtiles.bbox |
none | Bounding box filter: "minx,miny,maxx,maxy" |
pmtiles.where |
none | SQL WHERE clause for filtering features |
pmtiles.include_cols |
all | Comma-separated columns to include in tiles |
pmtiles.src_crs |
metadata | Override source CRS if metadata is incorrect |
Filtering Example¶
# Only include specific columns in tiles (reduces file size)
pmtiles.include_cols: "name,population,geometry"
# Filter features with SQL WHERE clause
pmtiles.where: "population > 10000"
# Clip to bounding box (minx,miny,maxx,maxy)
pmtiles.bbox: "-122.5,37.5,-122.0,38.0"
When to Use¶
- Web map applications requiring fast tile rendering
- Collections with GeoParquet assets intended for visual display
- When
portolan checkwarns about missing PMTiles
PMTiles are optional
PMTiles are derivatives for rendering, not the canonical data format. GeoParquet remains the source of truth. Missing PMTiles produce a validation warning, not an error.
Spatial Partitioning¶
Split large GeoParquet files into spatially-organized partitions for better query performance. Per OGC best practices, files over 2GB should be partitioned.
# .portolan/config.yaml
partitioning.enabled: true # Enable auto-partitioning during add (default: true)
partitioning.prompt: true # Ask before partitioning in interactive mode (default: true)
partitioning.threshold_gb: 2 # Size threshold in GB (default: 2.0)
partitioning.strategy: kdtree # Partitioning strategy (default: kdtree)
partitioning.target_rows: 120000 # Target rows per partition (default: 120,000)
With partitioning.enabled: true, large files are automatically partitioned during portolan add:
$ portolan add large-dataset.parquet
Found 1 file(s) exceeding 2.0 GB threshold:
large-dataset.parquet (4.23 GB)
Partition large files into spatial chunks? [Y/n] y
Set partitioning.prompt: false to partition without asking.
Commands¶
# Preview partition strategy without creating files
portolan partition buildings.parquet --preview
# Partition with default settings (kdtree, 120k rows/partition)
portolan partition buildings.parquet output/
# Custom target rows
portolan partition data.parquet output/ --target-rows 50000
How It Works¶
- Uses geoparquet-io KD-tree partitioning
- Creates Hive-style directory structure per ADR-0031
- Each partition becomes a STAC Item with its own bbox
- Collection gets a glob asset for bulk access (e.g.,
s3://bucket/collection/*.parquet)
Output Structure¶
collection/
├── collection.json # Glob asset for bulk access
├── kdtree_cell=001/
│ ├── item.json # STAC Item with partition bbox
│ └── data.parquet
├── kdtree_cell=002/
│ ├── item.json
│ └── data.parquet
└── ...
Settings Reference¶
| Setting | Default | Description |
|---|---|---|
partitioning.enabled |
true |
Enable auto-partitioning during portolan add |
partitioning.prompt |
true |
Ask before partitioning in interactive mode |
partitioning.threshold_gb |
2.0 |
File size threshold in GB |
partitioning.strategy |
kdtree |
Spatial partitioning strategy |
partitioning.target_rows |
120000 |
Target rows per partition |
Why KD-tree?
KD-tree is data-driven: partitions adapt to actual feature density, producing balanced partition sizes. Grid-based strategies (H3, S2, quadkey) are planned but not yet implemented.
STAC GeoParquet Settings¶
Generate items.parquet for collections with many items, enabling efficient spatial/temporal queries without N HTTP requests.
# .portolan/config.yaml
parquet.enabled: true # Auto-generate during add (default: false)
parquet.threshold: 100 # Hint when items exceed threshold (default: 100)
Flat key syntax
Config keys use dot notation as literal keys (e.g., parquet.enabled), not nested YAML mappings.
Commands¶
# Generate items.parquet for a collection
portolan stac-geoparquet -c eurosat
# Preview without creating files
portolan stac-geoparquet -c eurosat --dry-run
# Auto-generate during add
portolan add imagery/ --stac-geoparquet
How It Works¶
- Uses stac-geoparquet library
- Adds
items.parquetas a collection-level asset (per ADR-0031) and link withrel: items - Enables spatial filtering with a single HTTP request (vs N requests for items)
| Setting | Default | Description |
|---|---|---|
parquet.enabled |
false |
Auto-generate during add command |
parquet.threshold |
100 |
Show hint when items exceed threshold |
When to Use¶
- Collections with >100 items (e.g., satellite imagery time series)
- Raster collections with many scenes
- Partitioned vector datasets
Known Limitation
For existing catalogs with thousands of items, push after generating items.parquet may be slow (#329). This affects incremental updates to large catalogs. New catalogs and small catalogs work normally.
Collection-Level Configuration¶
Override settings for specific collections using the collections: section:
# .portolan/config.yaml
# Note: collection-level credential overrides go in .env
# PORTOLAN_REMOTE and PORTOLAN_PROFILE are catalog-wide
collections:
analytics:
conversion:
extensions:
convert: [fgb] # Force GeoParquet for analytics queries
archive:
conversion:
extensions:
preserve: [shp, gpkg, geojson] # Preserve all original formats
This approach works well for most catalogs. For large catalogs with many collections, see Hierarchical Configuration below.
Hierarchical Configuration (Optional)¶
For large catalogs or when different maintainers manage different collections, you can optionally create .portolan/ folders at collection or subcatalog levels:
catalog/
.portolan/
config.yaml # Catalog defaults
demographics/
.portolan/
config.yaml # Collection-specific overrides (optional)
collection.json
historical/ # Subcatalog
.portolan/
config.yaml # Subcatalog defaults (optional)
census-1990/
collection.json
This is entirely optional. Benefits include:
- Scalability: Avoids one giant config file with 100+ collection entries
- Ownership: Collection maintainers edit their own folder without touching root
- Git-friendly: Changes to one collection don't create merge conflicts in root
Inheritance Rules¶
Settings are inherited from parent levels. Child values override parent values:
# catalog/.portolan/config.yaml
backend: file
pmtiles.enabled: true
# catalog/demographics/.portolan/config.yaml
pmtiles.enabled: false # Overrides parent (no PMTiles for this collection)
# backend inherited from catalog
Credentials are catalog-wide
Credential settings (remote, profile, region) are set via .env at catalog root and apply to all collections. Per-collection credential overrides are not supported.
Precedence¶
When both approaches are used, folder config takes precedence over collections: section:
CLI > Env var > Collection folder config > Subcatalog folder config >
Root collections: section > Catalog config > Default
When to Use Each Approach¶
| Approach | Best For |
|---|---|
collections: section |
Small catalogs, simple overrides |
| Hierarchical folders | Large catalogs, multiple maintainers, verbose metadata |
Most users should start with collections: and only add per-collection .portolan/ folders when needed
Environment Variables¶
Credential Settings (required via env/.env)¶
These sensitive settings must use environment variables or .env files—they cannot be stored in config.yaml:
| Setting | Environment Variable | Notes |
|---|---|---|
remote |
PORTOLAN_REMOTE |
S3/GCS/Azure URL |
aws_profile |
PORTOLAN_AWS_PROFILE |
AWS credential profile |
profile |
PORTOLAN_PROFILE |
Alias for aws_profile |
region |
PORTOLAN_REGION |
AWS region for S3 |
Other Settings (optional via env)¶
Non-sensitive settings can also be set via environment variables, which override config.yaml:
| Setting | Environment Variable |
|---|---|
backend |
PORTOLAN_BACKEND |
pmtiles.enabled |
PORTOLAN_PMTILES_ENABLED |
Precedence: CLI arguments > Environment variables > config.yaml > Defaults
Setting Aliases¶
Some settings have aliases for convenience:
| Canonical Name | Alias |
|---|---|
aws_profile |
profile |
Both PORTOLAN_AWS_PROFILE and PORTOLAN_PROFILE environment variables work interchangeably.
Note
Aliases apply to environment variables only. Credential settings (aws_profile, profile, remote, region) cannot be stored in config files per the sensitive-settings rule.
Metadata Enrichment¶
In addition to config.yaml, Portolan supports .portolan/metadata.yaml for human-enrichable metadata that supplements STAC.
Purpose¶
STAC provides machine-extractable metadata (title, description, extent, columns). metadata.yaml adds human-only fields that can't be derived automatically:
| Field | Purpose |
|---|---|
contact |
Accountability (name, email) |
license |
SPDX identifier (e.g., CC-BY-4.0, MIT) |
citation |
Academic citation text |
doi |
Zenodo/DataCite DOI |
known_issues |
Data quality caveats |
source_url |
Link to original data source |
processing_notes |
Documentation of transformations applied |
keywords |
Tags for search/discovery (rendered as badges) |
attribution |
Credit to data provider or organization |
authors |
List of authors with name, optional ORCID and email |
related_dois |
List of related DOIs for linked publications |
citations |
List of citation strings for referencing |
upstream_version |
Version string of upstream data source |
upstream_version_url |
URL to upstream version (e.g., Zenodo record) |
Quick Start¶
# Generate template
portolan metadata init
# Validate required fields
portolan metadata validate
# Generate README from STAC + metadata
portolan readme
Example¶
# .portolan/metadata.yaml
contact:
name: Data Team
email: data@example.org
license: CC-BY-4.0
# Optional enrichment fields
license_url: https://creativecommons.org/licenses/by/4.0/
citation: "Census Bureau (2024). Demographics Dataset. DOI: 10.5281/zenodo.1234567"
doi: 10.5281/zenodo.1234567
known_issues: "Coverage gaps in rural areas for 2020 data."
# Provenance and discovery
source_url: https://data.census.gov/demographics
processing_notes: |
- Reprojected from NAD83 to EPSG:4326
- Simplified geometries for web display
- Joined with income data from ACS 2020
keywords:
- census
- demographics
- population
attribution: "U.S. Census Bureau"
# Author and citation metadata
authors:
- name: Jane Doe
orcid: 0000-0001-2345-6789
email: jane.doe@university.edu
- name: John Smith
related_dois:
- 10.5281/zenodo.1234567
- 10.1000/related-paper
citations:
- "Doe, J. (2024). Census Analysis Methods. J. Demographics, 1(1), 1-10."
upstream_version: "2024.1"
upstream_version_url: https://data.census.gov/releases/2024.1
Required Fields¶
Only two fields are required in metadata.yaml:
contact.nameandcontact.email- Who maintains this datalicense- SPDX identifier (validated against common licenses)
Title and description come from STAC metadata (set during portolan init).
Hierarchical Inheritance¶
Like config.yaml, metadata.yaml supports hierarchical resolution:
catalog/
.portolan/
metadata.yaml # Default contact and license
demographics/
.portolan/
metadata.yaml # Override or add collection-specific fields
Child values override parent values. Use this to set catalog-wide defaults (license, contact) while adding collection-specific fields (known_issues, citation).
README Generation¶
The portolan readme command generates README.md by combining:
From STAC (automatic):
- Title, description
- Spatial/temporal coverage
- Schema columns (from table:columns)
- Bands (from eo:bands, raster:bands)
- Files with checksums
- Code examples based on format
From metadata.yaml (human): - License, contact - Authors (with ORCID links) - Citation, DOI, related DOIs - Upstream version (with optional URL) - Known issues - Source URL, processing notes - Keywords (as shields.io badges with proper URL encoding) - Attribution
# Generate README.md
portolan readme
# Preview without writing
portolan readme --stdout
# Check if README is up-to-date (for CI)
portolan readme --check
# Generate for catalog and all collections
portolan readme --recursive
Catalog-level README: When run at catalog root, generates an index README with: - Aggregated spatial extent (envelope of all collections) - Aggregated temporal extent (earliest to latest) - List of collections with links
Data Defaults¶
When source files lack certain metadata (nodata values, temporal info), you can specify defaults in metadata.yaml:
# .portolan/metadata.yaml
defaults:
temporal:
year: 2025 # Items default to 2025-01-01
# Or explicit bounds:
# start: "2025-04-15"
# end: "2025-05-30"
raster:
nodata: 0 # Uniform nodata for all bands
# Or per-band:
# nodata: [0, 0, 255]
Behavior:
| Scenario | Result |
|---|---|
| Source file has value | File value used (defaults don't override) |
| Source file lacks value | Default applied |
| CLI flag provided | CLI flag overrides default |
| No default, no source value | Field left null |
Validation:
temporal.yearmust be an integer between 1800 and 2100temporal.start/temporal.endmust be valid ISO dates (YYYY-MM-DD)- Specifying both
yearandstartis an error (use one or the other) raster.nodatamust be a finite number (no NaN or Infinity)- Per-band nodata lists must match the raster's band count exactly
See the Metadata Defaults Guide for detailed usage.