Background & Summary

Canopy height models (CHM) are spatially explicit representations of the vertical structure of an environment, measured relative to the ground surface. These models provide detailed information about the structure, arrangement, and organization of vegetation and the built up environment; they are used in numerous applications, including land management and conservation, carbon and climate change modeling, landscape and habitat monitoring, disaster risk assessment and management, and geospatial analysis and modeling1,2,3,4,5. As CHMs are commonly derived from airborne lidar (i.e., Light Detection and Ranging) instruments, the availability of CHMs are often limited to local or regional acquisitions and to single snapshots in time. Spaceborne lidar (e.g., GEDI and ICESat-2) provide increased and often repeated coverage, but with the tradeoff of coarser ground sampling distances.

To overcome these challenges and produce CHMs at scale, recent approaches have combined lidar derived CHMs with multispectral optical or radar imagery. Combining spaceborne GEDI derived CHM with Landsat, Sentinel-2 or Sentinel-1 has resulted in modeled estimates of canopy height at global and regional scales6,7,8,9. Aerial lidar derived CHMs have also been successfully utilized, most commonly with high resolution aerial or satellite imagery. Wagner et al.10 produced sub-meter canopy height estimates for the state of California by producing their own aerial derived CHMs and training a model with United States Department of Agriculture (USDA) National Agriculture Imagery Program (NAIP) images. Notably, Tolan et al.11 produced sub-meter global estimates of canopy height through self supervised learning of Maxar satellite imagery and subsequent training of off-the-shelf CHMs (both air and spaceborne).

Due to the ability of tree height to reduce uncertainty in woody plant carbon modeling, scientists have focused CHM development and modeling efforts within forested ecosystems. Other ecosystems are often neglected or excluded entirely from model training, reducing their accuracy and utility in these ecosystems. Rangelands (inclusive of grasslands, savannas, and shrublands) have received little attention with regard to CHM modeling efforts, even though they are a dominant land cover12 and canopy height measurements are valuable to numerous modeling pursuits and on-the-ground management3. The inherent heterogeneity of rangelands (i.e., a mixture of grasses, shrubs, trees, or lack thereof) often requires fine resolution CHMs, reducing uncertainty and making them more suitable for application13.

The paucity of high resolution, aerial derived CHMs presents difficulties when training models for broader scale application. In the United States, the United States Geological Survey (USGS) works with partners to collect aerial lidar for the 3D Elevation Program (3DEP)14,15. Although data are collected in different regions, at different times, by different contractors, and ultimately processed into various derived products (primarily digital elevation models), lidar data are publicly available for independent CHM production. The overhead of retrieving and processing these data, however, can be challenging.

We used the USGS 3DEP lidar collection to produce a geographically large, but spatially disparate, CHM dataset. We focused our efforts on United States rangelands, but ensured that other dominant land covers are included. Our dataset comprises 22,796,764 CHM images, each spatially paired with a USDA NAIP image.

Methods

Location sampling

Utilizing the availability of USGS 3DEP lidar data and USDA NAIP imagery, we focused dataset development within the conterminous United States (CONUS). We stratified location sampling by Environmental Protection Agency level three ecoregions and National Land Cover Database (NLCD; 2019 release) dominant classes. Land cover subclasses were aggregated to their dominant class (i.e., deciduous, evergreen, and mixed forest were aggregated to forest), with the exception of pasture and cultivated crops. Our baseline sampling was 50,000 locations of each class within an ecoregion. To ensure greater representation of rangelands, we increased sampling of herbaceous and shrubland classes by 4x and the pasture class by 2x. We decreased sampling of the water class by 0.1x to limit its abundance.

To maintain a minimum distance of 240 m between sampling locations across all classes, we aggregated the NLCD to 240 m resolution by calculating the mode of all pixels within the aggregation unit. This sampling produced approximately 30.5 million locations, which were further reduced by availability of lidar data and NAIP imagery.

Lidar data

USGS 3DEP lidar data are available as LAZ format tiles via USGS rockyweb (https://rockyweb.usgs.gov/) and Amazon Web Services (AWS) cloud storage (https://registry.opendata.aws/usgs-lidar/). Additionally, many copies of the data are also available in Entwine Point Tile (EPT) format via AWS cloud storage. EPT format is cloud-friendly and streamable, allowing users to easily retrieve and process lidar based on geographic location or other parameters. We used the EPT format to construct this dataset.

USGS 3DEP lidar data are published by work units. Using the USGS Work unit Extent Spatial Metadata (WESM) we selected work units that met the following criteria: (1) lidar data collected from 2014 through 2023; (2) had a Quality Level (QL) of QL2 or lower (see Table 1); and (3) had a LPC category of “Meets”, “Meets with variance”, or “Expected to meet”, referencing the data’s ability to meet 3DEP specifications. We selected work units available in EPT format and buffered their perimeters inward by 200 m to reduce edge effects. We selected sampling locations that intersected work units, thereby reducing sampling locations to approximately 23.2 million.