Configuration¶

Portolan uses two configuration mechanisms:

.portolan/config.yaml — Non-sensitive settings (conversion, PMTiles, backend)
.env file or environment variables — Credentials (remote, profile, region)

Quick Start¶

# Create .env file in catalog root (never pushed to remote)
cat > .env << 'EOF'
PORTOLAN_REMOTE=s3://my-bucket/catalog
PORTOLAN_PROFILE=production
PORTOLAN_REGION=us-west-2
EOF

# Verify configuration
portolan config list

Security

Credentials (remote, profile, region) cannot be stored in config.yaml because that file gets pushed to remote storage. This applies to both catalog-level and collection-level config (e.g., collections.demo.remote is also blocked). Use environment variables or a .env file in your catalog root instead. The .env file is automatically ignored and never uploaded.

Backend (Enterprise)¶

By default, Portolan uses a file-based backend (versions.json) for version tracking. For enterprise deployments requiring ACID transactions, distributed locking, and advanced versioning features, install the portolake plugin:

uv add portolake
# or: pip install portolake

Then configure the backend:

# .portolan/config.yaml
backend: iceberg

Or initialize a new catalog with the Iceberg backend:

portolan init --backend iceberg

Version Management Commands¶

With the Iceberg backend, additional commands become available:

# Show current version of a collection
portolan version current boundaries

# List all versions
portolan version list boundaries

# Rollback to a previous version (instant, uses Iceberg snapshots)
portolan version rollback boundaries 1.0.0

# Remove old versions, keeping N most recent
portolan version prune boundaries --keep 5

Backend-specific commands

The portolan version subcommands require the iceberg backend. Running them with the default file backend will display an error message.

See the portolake documentation for full setup instructions and enterprise features.

Setting Configuration¶

Credentials (via .env or environment)¶

Credentials are sensitive settings that cannot be stored in config.yaml:

# Option 1: .env file (recommended for local development)
cat > .env << 'EOF'
PORTOLAN_REMOTE=s3://my-bucket/catalog
PORTOLAN_PROFILE=production
PORTOLAN_REGION=us-west-2
EOF

# Option 2: Environment variables (for CI/CD)
export PORTOLAN_REMOTE=s3://my-bucket/catalog
export PORTOLAN_PROFILE=production
export PORTOLAN_REGION=us-west-2

# View current settings (reads from env/.env)
portolan config list

Other Settings (via config.yaml)¶

Non-sensitive settings are stored in .portolan/config.yaml:

# Set backend type
portolan config set backend iceberg

# Set conversion options
portolan config set conversion.extensions.preserve "['shp', 'gpkg']"

# View current settings
portolan config list

Configuration Precedence¶

Settings are resolved in this order (highest to lowest):

CLI argument (--remote s3://...)
Environment variable (PORTOLAN_REMOTE=s3://...)
Collection config (in collections: section)
Catalog config (top-level in config.yaml)
Built-in default

Conversion Configuration¶

Control how Portolan handles different file formats during check and convert operations.

Use Cases¶

Scenario	Configuration
Force-convert FlatGeobuf to GeoParquet	`extensions.convert: [fgb]`
Keep Shapefiles as-is	`extensions.preserve: [shp]`
Preserve everything in archive/	`paths.preserve: ["archive/**"]`

Full Example¶

# .portolan/config.yaml
# Note: remote/profile/region go in .env, not here

conversion:
  extensions:
    # Force-convert these cloud-native formats to GeoParquet
    convert:
      - fgb      # FlatGeobuf

    # Keep these formats as-is (don't convert)
    preserve:
      - shp      # Shapefiles
      - gpkg     # GeoPackage

  paths:
    # Glob patterns for files to preserve regardless of format
    preserve:
      - "archive/**"           # Everything in archive/
      - "regulatory/*.shp"     # Regulatory shapefiles
      - "legacy/**"            # Legacy data directory

Extension Overrides¶

`extensions.convert`¶

Force-convert cloud-native formats to GeoParquet. Use when:

You want consistent columnar format for analytics
Your tooling prefers GeoParquet over FlatGeobuf

conversion:
  extensions:
    convert:
      - fgb       # FlatGeobuf -> GeoParquet

`extensions.preserve`¶

Keep convertible formats as-is. Use when:

Regulatory requirements mandate original format
Downstream tools require specific formats
You're preserving archival data

conversion:
  extensions:
    preserve:
      - shp       # Keep Shapefiles
      - gpkg      # Keep GeoPackage
      - geojson   # Keep GeoJSON

Path Patterns¶

Use glob patterns to override behavior for specific directories or files.

conversion:
  paths:
    preserve:
      - "archive/**"           # All files in archive/ and subdirectories
      - "regulatory/*.shp"     # Only .shp files in regulatory/
      - "**/*.backup.geojson"  # Any .backup.geojson file

Pattern syntax:

* matches any characters except /
** matches any characters including /
? matches any single character

Precedence: Path patterns override extension rules. A FlatGeobuf file in archive/ will be preserved even if extensions.convert: [fgb] is set.

COG Settings¶

Configure Cloud-Optimized GeoTIFF conversion parameters. By default, Portolan uses ADR-0019 defaults (DEFLATE compression, predictor=2, 512×512 tiles, nearest resampling).

conversion:
  cog:
    compression: JPEG      # DEFLATE (default), JPEG, LZW, ZSTD, WEBP
    quality: 95            # Quality 1-100 (applies to JPEG and WEBP)
    tile_size: 512         # Internal tile size in pixels
    predictor: 2           # 1=none, 2=horizontal (default), 3=floating point
    resampling: nearest    # Overview resampling: nearest, bilinear, cubic, etc.
    generate_thumbnail: true   # Auto-generate JPEG thumbnail (default: true)
    thumbnail_max_size: 512    # Max dimension in pixels (default: 512)
    thumbnail_quality: 75      # JPEG quality 1-100 (default: 75)

Validation

Invalid settings produce warnings but don't block conversion. Quality is clamped to 1-100, and unknown compression/resampling values are passed through to let rio-cogeo handle errors.

Thumbnails

When generate_thumbnail is enabled, a JPEG thumbnail is created next to each converted COG (e.g., data.tif → data.thumb.jpg). The thumbnail is automatically picked up by portolan scan with roles: ["thumbnail"], following STAC best practices.

Vector Settings¶

Configure spatial optimization for GeoParquet conversion. Uses geoparquet-io's fluent Table API for spatial indexing, sorting, and partitioning.

conversion:
  vector:
    spatial_index: h3     # h3 | quadkey | s2 | a5 | kdtree | none (default: none)
    resolution: auto      # auto | explicit int (default: auto)
    sort: hilbert         # hilbert | quadkey | none (default: none)
    add_bbox: true        # Add bbox struct column (default: false)
    partition: false      # Produce hive-partitioned output (default: false)

Resolution defaults

When resolution: auto, geoparquet-io uses sensible defaults per index type (H3: 9, Quadkey: 13, S2: 13, A5: 15, KD-tree: 9 iterations). Explicit values override these defaults.

Spatial Index Types¶

Index	Description	Resolution Range
`h3`	Uber H3 hexagonal cells	0-15 (default: 9)
`quadkey`	Bing Maps tile IDs	0-23 (default: 13)
`s2`	Google S2 spherical cells	0-30 (default: 13)
`a5`	A5 hierarchical grid	0-30 (default: 15)
`kdtree`	KD-tree balanced spatial splits	1-20 iterations (default: 9)

Use Cases¶

Scenario	Configuration
Analytics queries (spatial filtering)	`spatial_index: h3`, `add_bbox: true`
Optimal row group statistics	`sort: hilbert`, `add_bbox: true`
Partitioned output for large files	`spatial_index: kdtree`, `partition: true`
Web map tiling (PMTiles input)	`spatial_index: quadkey`, `sort: quadkey`

Partitioning vs Auto-Partitioning¶

conversion.vector.partition: true — Always produce hive-partitioned output during conversion
partitioning.enabled: true — Auto-partition files exceeding threshold_gb (see Spatial Partitioning)

These are complementary. Use conversion.vector for consistent spatial optimization, partitioning for size-based auto-splitting.

Use Cases¶

Scenario	Configuration
RGB imagery (smaller files)	`compression: JPEG`, `quality: 95`
Elevation data (lossless)	`compression: DEFLATE`, `predictor: 3`
Analytics (fast reads)	`compression: LZW`, `tile_size: 256`
Disable thumbnails	`generate_thumbnail: false`
Large thumbnails for preview	`thumbnail_max_size: 1024`, `thumbnail_quality: 90`

Available Compression Methods¶

Method	Best For	Notes
`DEFLATE`	General use (default)	Lossless, universal compatibility
`LZW`	Fast compression/decompression	Lossless, slightly larger files
`ZSTD`	High compression ratio	Lossless, requires GDAL 2.3+
`JPEG`	RGB imagery	Lossy, smallest files for photos
`WEBP`	Web display	Lossy, modern browsers only

PMTiles Generation¶

Generate vector tile overviews from GeoParquet assets for efficient web map rendering.

# .portolan/config.yaml
pmtiles.enabled: true     # Auto-generate during add (default: false)
pmtiles.min_zoom: 0       # Minimum zoom level (default: auto-detect)
pmtiles.max_zoom: 14      # Maximum zoom level (default: auto-detect)
pmtiles.precision: 6      # Coordinate decimal precision (default: 6)
pmtiles.layer: boundaries # Layer name in output (default: filename)
pmtiles.attribution: "© OpenStreetMap contributors"

External dependency

PMTiles generation requires tippecanoe installed and in PATH:

macOS: brew install tippecanoe
Ubuntu: apt install tippecanoe

Also requires the optional pmtiles extra: pip install portolan-cli[pmtiles]

Commands¶

# Generate PMTiles during add
portolan add boundaries/ --pmtiles

# Force regeneration even if up-to-date
portolan add boundaries/ --pmtiles --force-pmtiles

# Check for missing PMTiles (produces warning, not error)
portolan check

How It Works¶

Uses gpio-pmtiles wrapper around tippecanoe
PMTiles stored alongside source GeoParquet (e.g., data.parquet → data.pmtiles)
Registered as collection-level asset with role ["overview"]
Tracked in versions.json for push
Skips regeneration if PMTiles newer than source (mtime check)

Settings Reference¶

Setting	Default	Description
`pmtiles.enabled`	`false`	Auto-generate during `add` command
`pmtiles.min_zoom`	auto	Minimum zoom level (tippecanoe default: 0)
`pmtiles.max_zoom`	auto	Maximum zoom level (tippecanoe default: 14)
`pmtiles.layer`	filename	Layer name in PMTiles output
`pmtiles.precision`	`6`	Coordinate decimal precision
`pmtiles.attribution`	gpio default	Attribution HTML for tiles
`pmtiles.bbox`	none	Bounding box filter: `"minx,miny,maxx,maxy"`
`pmtiles.where`	none	SQL WHERE clause for filtering features
`pmtiles.include_cols`	all	Comma-separated columns to include in tiles
`pmtiles.src_crs`	metadata	Override source CRS if metadata is incorrect

Filtering Example¶

# Only include specific columns in tiles (reduces file size)
pmtiles.include_cols: "name,population,geometry"

# Filter features with SQL WHERE clause
pmtiles.where: "population > 10000"

# Clip to bounding box (minx,miny,maxx,maxy)
pmtiles.bbox: "-122.5,37.5,-122.0,38.0"

When to Use¶

Web map applications requiring fast tile rendering
Collections with GeoParquet assets intended for visual display
When portolan check warns about missing PMTiles

PMTiles are optional

PMTiles are derivatives for rendering, not the canonical data format. GeoParquet remains the source of truth. Missing PMTiles produce a validation warning, not an error.

Spatial Partitioning¶

Split large GeoParquet files into spatially-organized partitions for better query performance. Per OGC best practices, files over 2GB should be partitioned.

# .portolan/config.yaml
partitioning.enabled: true       # Enable auto-partitioning during add (default: true)
partitioning.prompt: true        # Ask before partitioning in interactive mode (default: true)
partitioning.threshold_gb: 2     # Size threshold in GB (default: 2.0)
partitioning.strategy: kdtree    # Partitioning strategy (default: kdtree)
partitioning.target_rows: 120000 # Target rows per partition (default: 120,000)

With partitioning.enabled: true, large files are automatically partitioned during portolan add:

$ portolan add large-dataset.parquet

Found 1 file(s) exceeding 2.0 GB threshold:
  large-dataset.parquet (4.23 GB)

Partition large files into spatial chunks? [Y/n] y

Set partitioning.prompt: false to partition without asking.

Commands¶

# Preview partition strategy without creating files
portolan partition buildings.parquet --preview

# Partition with default settings (kdtree, 120k rows/partition)
portolan partition buildings.parquet output/

# Custom target rows
portolan partition data.parquet output/ --target-rows 50000

How It Works¶

Uses geoparquet-io KD-tree partitioning
Creates Hive-style directory structure per ADR-0031
Each partition becomes a STAC Item with its own bbox
Collection gets a glob asset for bulk access (e.g., s3://bucket/collection/*.parquet)

Output Structure¶

collection/
├── collection.json          # Glob asset for bulk access
├── kdtree_cell=001/
│   ├── item.json            # STAC Item with partition bbox
│   └── data.parquet
├── kdtree_cell=002/
│   ├── item.json
│   └── data.parquet
└── ...

Settings Reference¶

Setting	Default	Description
`partitioning.enabled`	`true`	Enable auto-partitioning during `portolan add`
`partitioning.prompt`	`true`	Ask before partitioning in interactive mode
`partitioning.threshold_gb`	`2.0`	File size threshold in GB
`partitioning.strategy`	`kdtree`	Spatial partitioning strategy
`partitioning.target_rows`	`120000`	Target rows per partition

Why KD-tree?

KD-tree is data-driven: partitions adapt to actual feature density, producing balanced partition sizes. Grid-based strategies (H3, S2, quadkey) are planned but not yet implemented.

STAC GeoParquet Settings¶

Generate items.parquet for collections with many items, enabling efficient spatial/temporal queries without N HTTP requests.

# .portolan/config.yaml
parquet.enabled: true     # Auto-generate during add (default: false)
parquet.threshold: 100    # Hint when items exceed threshold (default: 100)

Flat key syntax

Config keys use dot notation as literal keys (e.g., parquet.enabled), not nested YAML mappings.

Commands¶

# Generate items.parquet for a collection
portolan stac-geoparquet -c eurosat

# Preview without creating files
portolan stac-geoparquet -c eurosat --dry-run

# Auto-generate during add
portolan add imagery/ --stac-geoparquet

How It Works¶

Uses stac-geoparquet library
Adds items.parquet as a collection-level asset (per ADR-0031) and link with rel: items
Enables spatial filtering with a single HTTP request (vs N requests for items)

Setting	Default	Description
`parquet.enabled`	`false`	Auto-generate during `add` command
`parquet.threshold`	`100`	Show hint when items exceed threshold

When to Use¶

Collections with >100 items (e.g., satellite imagery time series)
Raster collections with many scenes
Partitioned vector datasets

Known Limitation

For existing catalogs with thousands of items, push after generating items.parquet may be slow (#329). This affects incremental updates to large catalogs. New catalogs and small catalogs work normally.

Collection-Level Configuration¶

Override settings for specific collections using the collections: section:

# .portolan/config.yaml
# Note: collection-level credential overrides go in .env
# PORTOLAN_REMOTE and PORTOLAN_PROFILE are catalog-wide

collections:
  analytics:
    conversion:
      extensions:
        convert: [fgb]  # Force GeoParquet for analytics queries

  archive:
    conversion:
      extensions:
        preserve: [shp, gpkg, geojson]  # Preserve all original formats

This approach works well for most catalogs. For large catalogs with many collections, see Hierarchical Configuration below.

Hierarchical Configuration (Optional)¶

For large catalogs or when different maintainers manage different collections, you can optionally create .portolan/ folders at collection or subcatalog levels:

catalog/
  .portolan/
    config.yaml           # Catalog defaults
  demographics/
    .portolan/
      config.yaml         # Collection-specific overrides (optional)
    collection.json
  historical/             # Subcatalog
    .portolan/
      config.yaml         # Subcatalog defaults (optional)
    census-1990/
      collection.json

This is entirely optional. Benefits include:

Scalability: Avoids one giant config file with 100+ collection entries
Ownership: Collection maintainers edit their own folder without touching root
Git-friendly: Changes to one collection don't create merge conflicts in root

Inheritance Rules¶

Settings are inherited from parent levels. Child values override parent values:

# catalog/.portolan/config.yaml
backend: file
pmtiles.enabled: true

# catalog/demographics/.portolan/config.yaml
pmtiles.enabled: false  # Overrides parent (no PMTiles for this collection)
# backend inherited from catalog

Credentials are catalog-wide

Credential settings (remote, profile, region) are set via .env at catalog root and apply to all collections. Per-collection credential overrides are not supported.

Precedence¶

When both approaches are used, folder config takes precedence over collections: section:

CLI > Env var > Collection folder config > Subcatalog folder config >
  Root collections: section > Catalog config > Default

When to Use Each Approach¶

Approach	Best For
`collections:` section	Small catalogs, simple overrides
Hierarchical folders	Large catalogs, multiple maintainers, verbose metadata

Most users should start with collections: and only add per-collection .portolan/ folders when needed

Environment Variables¶

Credential Settings (required via env/.env)¶

These sensitive settings must use environment variables or .env files—they cannot be stored in config.yaml:

Setting	Environment Variable	Notes
`remote`	`PORTOLAN_REMOTE`	S3/GCS/Azure URL
`aws_profile`	`PORTOLAN_AWS_PROFILE`	AWS credential profile
`profile`	`PORTOLAN_PROFILE`	Alias for `aws_profile`
`region`	`PORTOLAN_REGION`	AWS region for S3

Other Settings (optional via env)¶

Non-sensitive settings can also be set via environment variables, which override config.yaml:

Setting	Environment Variable
`backend`	`PORTOLAN_BACKEND`
`pmtiles.enabled`	`PORTOLAN_PMTILES_ENABLED`

Precedence: CLI arguments > Environment variables > config.yaml > Defaults

Setting Aliases¶

Some settings have aliases for convenience:

Canonical Name	Alias
`aws_profile`	`profile`

Both PORTOLAN_AWS_PROFILE and PORTOLAN_PROFILE environment variables work interchangeably.

Note

Aliases apply to environment variables only. Credential settings (aws_profile, profile, remote, region) cannot be stored in config files per the sensitive-settings rule.

Metadata Enrichment¶

In addition to config.yaml, Portolan supports .portolan/metadata.yaml for human-enrichable metadata that supplements STAC.

Purpose¶

STAC provides machine-extractable metadata (title, description, extent, columns). metadata.yaml adds human-only fields that can't be derived automatically:

Field	Purpose
`contact`	Accountability (name, email)
`license`	SPDX identifier (e.g., CC-BY-4.0, MIT)
`citation`	Academic citation text
`doi`	Zenodo/DataCite DOI
`known_issues`	Data quality caveats
`source_url`	Link to original data source
`processing_notes`	Documentation of transformations applied
`keywords`	Tags for search/discovery (rendered as badges)
`attribution`	Credit to data provider or organization
`authors`	List of authors with name, optional ORCID and email
`related_dois`	List of related DOIs for linked publications
`citations`	List of citation strings for referencing
`upstream_version`	Version string of upstream data source
`upstream_version_url`	URL to upstream version (e.g., Zenodo record)

Quick Start¶

# Generate template
portolan metadata init

# Validate required fields
portolan metadata validate

# Generate README from STAC + metadata
portolan readme

Example¶

# .portolan/metadata.yaml
contact:
  name: Data Team
  email: data@example.org

license: CC-BY-4.0

# Optional enrichment fields
license_url: https://creativecommons.org/licenses/by/4.0/
citation: "Census Bureau (2024). Demographics Dataset. DOI: 10.5281/zenodo.1234567"
doi: 10.5281/zenodo.1234567
known_issues: "Coverage gaps in rural areas for 2020 data."

# Provenance and discovery
source_url: https://data.census.gov/demographics
processing_notes: |
  - Reprojected from NAD83 to EPSG:4326
  - Simplified geometries for web display
  - Joined with income data from ACS 2020
keywords:
  - census
  - demographics
  - population
attribution: "U.S. Census Bureau"

# Author and citation metadata
authors:
  - name: Jane Doe
    orcid: 0000-0001-2345-6789
    email: jane.doe@university.edu
  - name: John Smith
related_dois:
  - 10.5281/zenodo.1234567
  - 10.1000/related-paper
citations:
  - "Doe, J. (2024). Census Analysis Methods. J. Demographics, 1(1), 1-10."
upstream_version: "2024.1"
upstream_version_url: https://data.census.gov/releases/2024.1

Required Fields¶

Only two fields are required in metadata.yaml:

contact.name and contact.email - Who maintains this data
license - SPDX identifier (validated against common licenses)

Title and description come from STAC metadata (set during portolan init).

Hierarchical Inheritance¶

Like config.yaml, metadata.yaml supports hierarchical resolution:

catalog/
  .portolan/
    metadata.yaml         # Default contact and license
  demographics/
    .portolan/
      metadata.yaml       # Override or add collection-specific fields

Child values override parent values. Use this to set catalog-wide defaults (license, contact) while adding collection-specific fields (known_issues, citation).

README Generation¶

The portolan readme command generates README.md by combining:

From STAC (automatic): - Title, description - Spatial/temporal coverage - Schema columns (from table:columns) - Bands (from eo:bands, raster:bands) - Files with checksums - Code examples based on format

From metadata.yaml (human): - License, contact - Authors (with ORCID links) - Citation, DOI, related DOIs - Upstream version (with optional URL) - Known issues - Source URL, processing notes - Keywords (as shields.io badges with proper URL encoding) - Attribution

# Generate README.md
portolan readme

# Preview without writing
portolan readme --stdout

# Check if README is up-to-date (for CI)
portolan readme --check

# Generate for catalog and all collections
portolan readme --recursive

Catalog-level README: When run at catalog root, generates an index README with: - Aggregated spatial extent (envelope of all collections) - Aggregated temporal extent (earliest to latest) - List of collections with links

Data Defaults¶

When source files lack certain metadata (nodata values, temporal info), you can specify defaults in metadata.yaml:

# .portolan/metadata.yaml
defaults:
  temporal:
    year: 2025              # Items default to 2025-01-01
    # Or explicit bounds:
    # start: "2025-04-15"
    # end: "2025-05-30"

  raster:
    nodata: 0               # Uniform nodata for all bands
    # Or per-band:
    # nodata: [0, 0, 255]

Behavior:

Scenario	Result
Source file has value	File value used (defaults don't override)
Source file lacks value	Default applied
CLI flag provided	CLI flag overrides default
No default, no source value	Field left null

Validation:

temporal.year must be an integer between 1800 and 2100
temporal.start/temporal.end must be valid ISO dates (YYYY-MM-DD)
Specifying both year and start is an error (use one or the other)
raster.nodata must be a finite number (no NaN or Infinity)
Per-band nodata lists must match the raster's band count exactly

See the Metadata Defaults Guide for detailed usage.

Configuration¶

Quick Start¶

Backend (Enterprise)¶

Version Management Commands¶

Setting Configuration¶

Credentials (via .env or environment)¶

Other Settings (via config.yaml)¶

Configuration Precedence¶

Conversion Configuration¶

Use Cases¶

Full Example¶

Extension Overrides¶

extensions.convert¶

extensions.preserve¶

Path Patterns¶

COG Settings¶

Vector Settings¶

Spatial Index Types¶

Use Cases¶

Partitioning vs Auto-Partitioning¶

Use Cases¶

Available Compression Methods¶

PMTiles Generation¶

Commands¶

How It Works¶

Settings Reference¶

Filtering Example¶

When to Use¶

Spatial Partitioning¶

Commands¶

How It Works¶

Output Structure¶

Settings Reference¶

STAC GeoParquet Settings¶

Commands¶

How It Works¶

When to Use¶

Collection-Level Configuration¶

Hierarchical Configuration (Optional)¶

Inheritance Rules¶

Precedence¶

When to Use Each Approach¶

Environment Variables¶

Credential Settings (required via env/.env)¶

Other Settings (optional via env)¶

Setting Aliases¶

Metadata Enrichment¶

Purpose¶

Quick Start¶

Example¶

Required Fields¶

Hierarchical Inheritance¶

README Generation¶

Data Defaults¶

`extensions.convert`¶

`extensions.preserve`¶