Extracting Data from ArcGIS Services¶
Portolan can extract data directly from ArcGIS REST services:
- FeatureServer/MapServer: Vector data → GeoParquet files
- ImageServer: Raster imagery → Cloud-Optimized GeoTIFF (COG) tiles
Quick Start¶
# Extract all layers from a FeatureServer
portolan extract arcgis https://services.arcgis.com/.../FeatureServer ./output
# Extract tiles from an ImageServer (uses bbox to limit area)
portolan extract arcgis https://example.com/.../ImageServer ./output --bbox "minx,miny,maxx,maxy"
# Preview what would be extracted (dry run)
portolan extract arcgis URL --dry-run
Service Types¶
Portolan auto-detects the service type from the URL:
| URL Pattern | Service Type | Output Format |
|---|---|---|
.../FeatureServer |
Vector features | GeoParquet |
.../MapServer |
Vector features | GeoParquet |
.../ImageServer |
Raster imagery | COG tiles |
MapServer Raster Limitation
MapServer endpoints only support vector extraction. Even if a MapServer hosts imagery, Portolan extracts the vector tile indexes and boundaries—not the imagery itself.
This is a fundamental ArcGIS limitation: MapServer's /export endpoint returns rendered 8-bit visualization, not source raster data. For source imagery extraction, you need an ImageServer endpoint.
If you're working with a data provider that only offers MapServer (e.g., PASDA), contact them to request ImageServer access, or use their download links for the source data.
FeatureServer / MapServer Extraction¶
Basic Usage¶
Point Portolan at any ArcGIS FeatureServer or MapServer URL:
portolan extract arcgis \
https://services.arcgis.com/fLeGjb7u4uXqeF9q/ArcGIS/rest/services/Census_2020/FeatureServer \
./census_2020
This will:
- Discover all layers in the service
- Extract each layer to GeoParquet format
- Apply Hilbert spatial sorting for efficient queries
- Initialize a Portolan catalog with STAC metadata
- Add
viaprovenance links to each collection (pointing to source layer URL) - Seed
.portolan/metadata.yamlwith values from the service - Generate an extraction report in
.portolan/extraction-report.json
Filtering Layers¶
Use glob patterns to extract specific layers:
# Include only layers matching patterns
portolan extract arcgis URL --layers "Census*,Transport*"
# Exclude layers matching patterns
portolan extract arcgis URL --exclude-layers "*_Archive,*_Backup"
# Combine include and exclude
portolan extract arcgis URL --layers "Census*" --exclude-layers "*_2010"
Pattern syntax uses fnmatch:
*matches any characters?matches a single character- Examples:
sdn_*,*_2024,cod_ab_*
Output Structure¶
Each layer becomes a collection with the parquet file as a collection-level asset:
output/
├── .portolan/
│ ├── extraction-report.json # Extraction metadata
│ └── metadata.yaml # Pre-seeded with service metadata
├── catalog.json # STAC catalog
├── census_block_groups/
│ ├── collection.json
│ └── census_block_groups.parquet
└── census_tracts/
├── collection.json
└── census_tracts.parquet
Provenance Tracking¶
Each extracted collection includes a via link in its collection.json pointing to the original ArcGIS layer URL. This provides data lineage for auditing and reproducibility:
{
"links": [
{
"rel": "via",
"href": "https://services.arcgis.com/.../FeatureServer/0",
"type": "text/html",
"title": "Source ArcGIS layer: Census Block Groups"
}
]
}
The via link uses the layer-specific URL (service URL + layer index), not just the service root.
Auto-Seeded Metadata¶
The extraction process automatically seeds metadata at two levels:
Catalog-level (.portolan/metadata.yaml): Seeded from ArcGIS service metadata:
| ArcGIS Field | metadata.yaml Field |
|---|---|
copyrightText |
attribution |
documentInfo.Author |
contact.name |
documentInfo.Keywords |
keywords |
serviceDescription |
processing_notes |
accessInformation |
known_issues |
| Service URL | source_url |
Collection-level (<collection>/.portolan/metadata.yaml): Each extracted layer gets its own metadata file with layer-specific description from the ArcGIS layer details API. For nested extractions (services-root mode), metadata is written to the correct collection directory regardless of nesting depth.
Fields that require human input (like contact.email and license) are marked with TODO placeholders:
contact:
name: "Philadelphia GIS Team" # Auto-filled from service
email: "TODO: Add value" # Needs human input
license: "TODO: Add value" # Needs SPDX identifier
source_url: "https://services.arcgis.com/..."
attribution: "City of Philadelphia"
Won't Overwrite Existing Files
If .portolan/metadata.yaml already exists, extraction will not overwrite it. This preserves any manual edits you've made.
ImageServer Extraction¶
Basic Usage¶
Extract raster imagery from an ArcGIS ImageServer:
portolan extract arcgis \
https://sampleserver6.arcgisonline.com/arcgis/rest/services/Toronto/ImageServer \
./toronto-imagery
This will:
- Query service metadata (extent, CRS, pixel size, bands)
- Compute a tile grid covering the service extent
- Download tiles via
exportImageAPI - Convert each tile to Cloud-Optimized GeoTIFF (COG)
- Create STAC items for each tile with spatial metadata
- Generate an extraction report
- Seed
metadata.yamlwith service metadata (source URL, attribution, keywords)
Limiting Extraction Area¶
For large ImageServers, use --bbox to extract a subset:
# WGS84 coordinates (minx,miny,maxx,maxy = lon,lat order) - auto-converted to service CRS
portolan extract arcgis URL --bbox "-75.17,39.95,-75.15,39.97"
# Or explicit service CRS coordinates (Web Mercator in this example)
portolan extract arcgis URL --bbox "-8367886,4858679,-8365659,4861583"
# Override CRS auto-detection with --bbox-crs
portolan extract arcgis URL --bbox "100,200,300,400" --bbox-crs "EPSG:2269"
Bbox format: The --bbox flag expects coordinates as minx,miny,maxx,maxy (longitude,latitude order for WGS84). When values fall within the WGS84 range (-180/180, -90/90), Portolan automatically reprojects them to the service's native CRS. Use --bbox-crs to specify an explicit CRS and skip auto-detection.
Overriding CRS Detection
If you're working with a local CRS (like State Plane) where coordinates happen to fall in the WGS84 range, use --bbox-crs to specify the exact CRS and skip auto-detection.
ImageServer Options¶
# Tile size in pixels (default: 4096, auto-adjusted if exceeds service limit)
portolan extract arcgis URL --tile-size 2048
# Maximum concurrent downloads (default: 4)
portolan extract arcgis URL --max-concurrent 8
# COG compression (default: deflate)
portolan extract arcgis URL --compression jpeg # Good for RGB imagery
# Custom collection name (default: 'tiles')
portolan extract arcgis URL --collection-name "naip-philly-2024"
Tile Size Validation
Portolan automatically fetches the service's maximum allowed tile dimensions during discovery. If your --tile-size exceeds this limit, it's auto-adjusted down with a warning—no more cryptic "bad magic bytes" errors.
Output Structure¶
Raster data uses item-level assets — each tile becomes a STAC item:
output/
├── .portolan/
│ ├── config.yaml
│ ├── extraction-report.json
│ ├── imageserver-resume.json # For resuming interrupted extractions
│ └── metadata.yaml # Seeded from service metadata
├── catalog.json
└── tiles/ # Collection name (customizable via --collection-name)
├── collection.json
├── versions.json
├── tile_0_0/
│ ├── tile_0_0.json # STAC item with tile bbox
│ └── tile_0_0.tif # COG asset
├── tile_0_1/
│ ├── tile_0_1.json
│ └── tile_0_1.tif
└── ...
Use --collection-name to give the collection a meaningful name instead of the generic "tiles".
Metadata After Extraction¶
Extraction automatically seeds .portolan/metadata.yaml with values from the ArcGIS service (source URL, description, attribution, keywords). Fields that require human input are marked with TODO: Add value:
# Auto-seeded .portolan/metadata.yaml (example)
contact:
name: "TODO: Add value" # Required - add your name
email: "TODO: Add value" # Required - add your email
license: "TODO: Add value" # Required - add SPDX identifier (e.g., CC-BY-4.0)
source_url: https://example.com/.../ImageServer # Auto-populated
attribution: "Copyright © 2024 Example Org" # Auto-populated from copyrightText
Complete the TODO fields, then generate the README:
# Generate README from STAC + metadata.yaml
portolan readme tiles
metadata.yaml Supplements STAC
The metadata.yaml file supplements STAC metadata—it doesn't replace or modify it. This separation keeps machine-extracted metadata (STAC) distinct from human-enriched metadata (metadata.yaml).
There is no portolan metadata generate command because metadata.yaml is for human enrichment only. The README command (portolan readme) reads from both STAC (machine-extracted) and metadata.yaml (human-enriched) to generate documentation.
Overriding auto-seeded metadata
To replace auto-seeded values, edit the generated metadata.yaml directly. The file is never overwritten on subsequent extractions.
Common Options¶
These options work for both FeatureServer and ImageServer:
Controlling Extraction¶
# Request timeout in seconds (default: 60 for vectors, 120 for rasters)
portolan extract arcgis URL --timeout 120
# Retry failed requests (default: 3 attempts)
portolan extract arcgis URL --retries 5
Resume Failed Extractions¶
If an extraction fails partway through, resume from where you left off:
# Initial extraction (fails partway)
portolan extract arcgis URL ./output
# Resume - skips already-extracted layers/tiles
portolan extract arcgis URL ./output --resume
Dry Run Mode¶
Preview what would be extracted without downloading any data:
portolan extract arcgis URL --dry-run
JSON Output¶
For automation and scripts, use JSON output:
portolan extract arcgis URL --json
Non-Interactive Mode¶
Skip confirmation prompts (useful in scripts):
portolan extract arcgis URL --auto
Extraction Report¶
The extraction report (.portolan/extraction-report.json) contains:
- Source URL and extraction timestamp
- Metadata extracted from the ArcGIS service
- Per-layer/tile results: status, count, file size, duration, any errors
- Summary: totals for succeeded, failed, skipped
Example (FeatureServer):
{
"extraction_date": "2024-03-15T10:30:00Z",
"source_url": "https://services.arcgis.com/.../FeatureServer",
"summary": {
"total_layers": 5,
"succeeded": 4,
"failed": 1,
"total_features": 125000,
"total_size_bytes": 45000000
}
}
Extracting from Services Root¶
You can point Portolan at an entire ArcGIS services directory to extract all services at once:
portolan extract arcgis \
https://services.arcgis.com/{org_id}/ArcGIS/rest/services \
./output --services "Census*,Demographics*"
Filtering Services¶
# Include only services matching patterns
portolan extract arcgis URL --services "Census*,Transport*"
# Exclude services matching patterns
portolan extract arcgis URL --exclude-services "*_Archive,*_Test"
Output Structure¶
Services root extraction creates a nested catalog structure. Single-layer services are flattened to avoid redundant nesting:
Single-layer service (flattened):
bag_woonfunctie/ # Collection directly
├── collection.json
└── bag_woonfunctie.parquet
Multi-layer service (nested):
woontypering/ # Subcatalog
├── catalog.json
├── woontypering_2020/
│ ├── collection.json
│ └── woontypering_2020.parquet
└── woontypering_2021/
├── collection.json
└── woontypering_2021.parquet
This keeps directory structures clean while preserving hierarchy where it matters.
Tips¶
Finding ArcGIS Services¶
ArcGIS services are typically found at URLs like:
https://services.arcgis.com/{org_id}/ArcGIS/rest/services/{service_name}/FeatureServerhttps://gis.example.com/arcgis/rest/services/{folder}/{service_name}/ImageServer
You can browse available services at the root:
https://services.arcgis.com/{org_id}/ArcGIS/rest/services
Large Services¶
For services with many layers or large datasets:
- Use
--dry-runfirst to see what will be extracted - Filter with
--layers(vectors) or--bbox(rasters) - Use
--resumeif extraction is interrupted - Increase parallelism with
--workers(vectors) or--max-concurrent(rasters)
Error Handling¶
If a layer/tile fails to extract:
- The extraction continues with remaining items
- Failed items are recorded in the report with error details
- Use
--resumeto retry only failed items
Requirements¶
- geoparquet-io — Vector extraction (automatically installed)
- rio-cogeo — COG conversion (automatically installed)
- Network access to the ArcGIS service