Coordinates Series A, No. 2
 Unlocking the Mysteries of the Bounding Box Persistent URL for citation: http://purl.oclc.org/coordinates/a2.htm Date of Publication: 08/29/05 Douglas R. Caldwell Douglas R. Caldwell (e-mail: Douglas.R.Caldwell@erdc.usace.army.mil) is employed as a cartographer and geospatial analyst at the US Army Engineer Research & Development Center, Topographic Engineering Center, Research Division, Information Generation and Management Branch, 7701 Telegraph Road, Alexandria, VA 22315.
 Figure 4. Bounding Boxes for a Dataset with Geographic Coordinates Projected to Albers Equal Area. The red box shows the effects of projecting the box defined only by the four corner coordinates. The blue box has been densified with additional vertices.

The Projection Problem is caused by the fact that a bounding box, defined by corner coordinates, is an undersampled representation of the true bounding box. It only provides information at the corner locations, but no information along the lines connecting the corners. When a bounding box defined by the four corners is projected, the lines connecting the corners remain straight lines. This means that queries against this box may have an incorrect extent and risk missing areas that should be included, a significant problem. In addition, areas outside the original bounding box may be included in the search. This is less of a problem, as the areal coverage of a bounding box is already understood to be an approximation greater than the area of the feature. In the example in Figure 4, the bounding box for the United States misses the southern tips of Florida and Texas.

There are two potential solutions to the Projection Problem. The first solution is to add vertices to (Begin Page 8) the bounding box before projecting it. This ‘densification’ supports a more accurate representation of the projected boundary and should be done whenever using the bounding box to determine the extent of the data. This solution is appropriate when the original data are no longer available. If the data are available, a second solution is to generate a new bounding box after reprojecting the data.

Approximation Assessment

The Approximation Assessment, or measure of how well bounding boxes approximate the coverage of a feature, is the final issue. This is especially important for applications involving bounding boxes used to estimate the area of a feature, as poor approximations will lead to larger numbers of non-relevant results. Approximation effectiveness is analyzed using the Bounding Box Factor, which is the ratio of the area of the bounding box to the area of the feature. The Bounding Box Factor ranges from 1, where the bounding box and the feature are identical, to infinity, where the bounding box is infinitely larger than the feature.

In order to better understand the range of values for the Bounding Box Factor, tests were run on a three different datasets, representing political and natural features at multiple levels of aggregation. These included datasets for Census Tracts, Ecoregions and Hydrologic Units. Because the Bounding Box Factor can change for different projections of the same feature or dataset, all the data was projected to an Albers Equal Area projection to allow the direct comparison of areal measurements.

Census Tracts

The Census Tract dataset is the 2004 Edition of the U.S. Census Tracts produced by Geographic Data Technology for ESRI and distributed on the ESRI Data and Maps CD-ROM (See Figure 5). It covers all 50 states and the District of Columbia. Data were analyzed at four levels: Census Tract component parts, Census Tracts, Counties, and States. Some Census Tracts are multipart features, so they were broken into their component parts to analyze data at its atomic level.

Begin Page 9

Figure 5. Map of Continental US Census Tracts (2004) – Tracts.

 Census Tracts (2004) Bounding Box Factor Geography Level Feature Count Minimum Maximum Mean Standard Deviation Tracts (Component) 66906 1.003001 43.402738 1.870394 0.772512 Tracts (Multipart) 65344 1.003001 3158.839174 1.938390 13.185470 Counties 3141 1.004461 42.077043 1.609442 0.897699 States 51 1.083702 11.852433 2.085811 1.533791

Table 1. Bounding Box Statistics for Census Tract Data

The Census Tract Bounding Box Factor data are reported in Table 1. The data has the lowest overall mean values of all datasets for the Bounding Box Factor. This is not unexpected, as Census (Begin Page 10) Tracts are designed by humans using guidelines that place an emphasis on compactness. [14] The minimum Bounding Box Factors are close to 1, meaning that the bounding box very closely approximates the shape of the feature. This dataset has the largest maximum value for the Bounding Box Factor. This occurs at the multipart, unpopulated Census Tract with a FIPS Code of 09009000000. This Census Tract, located in New Haven County, Connecticut, has two parts separated by more than 20 miles, with a total area of 0.049 square miles and a bounding box area of 156.212 square miles (See Figure 6).

Figure 6. Multipart Census Tract 09009000000 Shown With Bounding Box.
Although they are difficult to see, the two small features shown in red at the southwest and northeast corners
of the bounding box are the components of the Census Tract.

Ecoregions

The Ecoregions dataset is the USDA Forest Service dataset for ‘Ecoregions and Subregions of the United States, Puerto Rico, and the U.S. Virgin Islands,’ published in 2004 (See Figure 7). [15] According to the metadata accompanying the dataset:

This data set shows ecoregions, which are ecosystems of regional extent, in the United States, Puerto Rico, and the U.S. Virgin Islands. Four levels of detail are included to show a hierarchy of ecosystems. The largest ecosystems are domains, which are groups of related climates and are differentiated based on precipitation and temperature. Divisions represent the climates within domains and are differentiated based on precipitation levels and patterns as well as temperature. Divisions are subdivided into provinces, which are differentiated based on vegetation or other natural land covers. The finest level of detail is described by subregions, called sections, which are subdivisions of provinces based on terrain features.

The dataset covers all 50 states and the District of Columbia. Some Sections are multipart features, so they were broken into their component parts similar to the Census Tract data.

Begin Page 11

Figure 7. Map of Continental United States Ecoregions and Subregions - Sections.

 Ecoregions (2004) Bounding Box Factor Geography Level Feature Count Minimum Maximum Mean Standard Deviation Section (Component) 3072 1.204887 40.362136 2.139069 1.798168 Section (Multipart) 193 1.339489 103.225697 3.458186 7.501049 Province 52 1.453908 48.111396 5.051460 6.812033 Division 25 1.453908 15.184596 5.031901 3.206748 Domain 4 1.471328 448.317529 114.876144 192.520238

Table 2. Bounding Box Statistics for Ecoregions Data

The Ecoregions Bounding Box Factor data are reported in Table 2. The Ecoregions data has the highest mean values for the Bounding Box Factor for all the datasets. The mean value for the domain is the highest of any dataset and is due to the fact that there are only four overlapping, multipart features at this level. (See Figure 8) This accounts for the very high standard deviation value as well. (Begin Page 12) The minimum Bounding Box Factors remain close to 1, but are higher than the Census Tract minimum Bounding Box Factors.

Figure 8. Bounding Boxes for Domain Level Ecoregions Data
Displayed With Section-Level Background Map.

Hydrologic Units

The 1:2,000,000-Scale Hydrologic Unit Boundaries dataset is the 2002 Edition produced by the US Geological Survey (See Figure 9). [16] Data were analyzed at four levels: Cataloging Unit, Accounting Unit, Subregion, and Region. Some Cataloging Units are multipart features, so they were broken into their component parts similar to the other datasets.

Begin Page 13

Figure 9. 1:2,000,000-Scale Hydrologic Unit Boundaries – Cataloging Units.

 Hydrologic Unit Boundaries (2002) Bounding Box Factor Geography Level Feature Count Minimum Maximum Mean Standard Deviation Cataloging Unit (Component) 5347 1.153610 40.362118 2.281707 1.605163 Cataloging Unit (Multipart) 2262 1.149064 149.092545 2.644819 5.263063 Accounting Unit 379 1.149064 133.335001 3.171363 7.734320 Subregion 222 1.149064 133.335001 3.093111 8.883635 Region 22 1.645682 133.335001 8.562469 27.246000

Table 3. Bounding Box Statistics for Hydrographic Unit

Begin Page 14

The Hydrographic Unit Bounding Box Factor data are reported in Table 3. The Hydrologic Unit data generally falls between the Census Tract Data and Ecoregions data values. The minimum Bounding Box Factors remain close to 1, but are higher than the Census Tract minimum Bounding Box Factors and lower than the Ecoregions values. The mean value for the domain ranges from just over two to approximately eight and a half. The maximum value for the Accounting Unit, Subregion, and Region, are identical because they represent the same set of features at all levels: a multipart set of features with Accounting Unit, Subregion, and Region codes of 0.

Factors Affecting Approximation Accuracy

There are two key factors that determine the Bounding Box Factor, and hence the effectiveness of the bounding box as an approximation of the areal coverage of a feature. These are the presence of multiple component parts and the shape/orientation of the feature.

Of these two factors, the presence of multiple components is the greatest factor contributing to high Bounding Box Factors. This has been discussed previously, with the most extreme example shown in Figure 6. Conditions where component parts are small and widely separated lead to the most extreme cases. In the case of the United States, states like Alaska, Hawaii, and Michigan, all have high Bounding Box Factors. This is especially problematic for Alaska, which has suffers from both multiple components and a Global Gotcha (it crosses the 180 degree meridian).

Bounding boxes for multiple component features should be treated cautiously. In cases like Hawaii, where the components are separated, but no intervening features exist; they will have little impact on searches that use a bounding box. Other cases, like the domain-level Ecoregions data shown in Figure 8, involve overlapping bounding boxes. In these cases, searches based on bounding boxes will result in excessive irrelevant information.

Feature shape and orientation also contribute to higher Bounding Box Factors, although to a much smaller degree (the highest Bounding Box Factor for a disaggregated feature was 42.077043, while the highest Bounding Box Factor for an aggregated feature was 3158.839174).

A graph of the Bounding Box Factor versus Circularity helps to understand the relationship between the shape of an object and its Bounding Box Factor (See Figure 10). The measure of shape used for this analysis was Circularity, which is defined as 4 * PI * area / (perimeter * perimeter).[17] Circularity ranges from 0 for a long, thin feature approximating a line to 1 for a circle.

Begin Page 15

Figure 10. Graph of Bounding Box Factor Versus Circularity. This shows the relationship
between the Bounding Box Factor and Circularity for the 1000 largest Bounding Box scores.

An examination of the graph of Bounding Box Factor scores versus Circularity shows that high Bounding Box Factor scores (above Figure 10) for disaggregated features in the Census Tract data are always associated with low circularity (below 0.2), but that it is possible to have low Bounding Box Factor scores associated with low circularity scores as well. Figure 11, a Census Tract component for a tract in Nueces County, Texas, has a Circularity of 0.017458 and a Bounding Box Factor of 2.217171. The low Circularity is due to the highly crenulated outline.

Begin Page 16

Figure 11. Disaggregrated Census Tract Feature with Low Circularity and Low Bounding Box Factor.

The Census Tract component part with the highest Bounding Box Factor, a value of 43.402738 is shown in Figure 12. This barrier island in Carteret County, North Carolina, has a long, thin shape running from southwest to northeast. Shape is not the only factor affecting the Bounding Box Factor—orientation is also a consideration. If the feature in Figure 12 were oriented vertically or horizontally, rather than diagonally, the Bounding Box Factor would be significantly lower.

Begin Page 17

Figure 12. Disaggregrated Census Tract Feature with the Highest Bounding Box Factor.

Summary

The bounding box is a fundamental component of numerous computational geometry algorithms, indexing schemes such as R-trees, and metadata. As a surrogate for a feature, it provides exact information on the extent or limits of the feature and approximates the coverage. Despite its ubiquity and simplicity, the bounding box remains a subtle and nuanced entity, showing once again that "spatial is special."

This paper has explored the underlying nature of the bounding box, in order for users to be aware of its promise and pitfalls. When dealing with bounding boxes and metadata, the initial problem is the (Begin Page 18) Content Quandary. The bounding box may reflect the extent of the data collection area or the data within the dataset. Both are useful, but convey very different information. Information on the data collection area describes the completeness of the data set, specifically the areas for which data was collected. It describes the total area over which information about the phenomenon is available. A bounding box covering the extent of data within a dataset specifies where the phenomenon occurs and is most useful for understanding whether or not the data overlaps a specific area of interest.

Other factors, like the Global Gotcha and the Projection Problem, highlight situations where a bounding box may overestimate or underestimate the extents and coverage of a feature. The Global Gotcha occurs when features exist both east and west of the 180-degree meridian. Their bounding boxes are overestimated, due to the artifact created by defining longitude from 180 degrees west to 180 degrees east. The Projection Problem introduces a different issue. When the bounding box is projected, the locations of the four corners are projected and connected with straight lines. This can potentially lead to both overestimation and underestimation of the extents of the feature. Underestimation is the greater concern, as queries against a projected bounding box may not locate features that would be found within the extents of the unprojected bounding box. The solution is for users to densify the boundary lines before projecting the data.

Finally, in the Approximation Assessment, the paper examined the quality of the bounding box as an estimator for the area of its associated feature. Bounding boxes for Census Tracts, Ecoregions, and Hydrologic Units were examined at multiple geography levels using the Bounding Box Factor, which is the ratio of the area of the bounding box to the area of the feature. Mean Bounding Box Factors ranged from a low of 1.609442 for the Census Tracts at the county level to a maximum of 114.876144 for the Ecoregions at the domain level. The average bounding boxes for all the datasets, except the Ecoregions at the domain level, was between 1.5 and 8.5. Multipart features, with small but widely separated components parts, lead to the greatest Bounding Box Factor scores. The Hawaiian Islands are a good example of this type of feature. Less important, but still significant factors are the shape and orientation of the feature. Long thin shapes angled diagonally have high Bounding Box Factor scores and low circularity.

In summary, all geospatial features require special treatment, even simple features like the bounding box. Users should take care to understand the strengths and limitations of their data when undertaking any form of analysis.

Acknowledgements

The author would like to thank Allan Wiley, Jackie Bryant, Jim Auclair, Valerie Carney, and Eric Zimmerman of the U.S. Army Engineer Research and Development Center, Topographic Engineering Center, for their support and comments on draft versions of the paper. The author would also like to thank the anonymous reviewers for their constructive comments and David Allen (Stony Brook University, State University of New York) for his editorial assistance.

References

1. MiMi.hu, "Minimum Bounding Rectangle," MiMi.hu, http://en.mimi.hu/gis/minimum_bounding_rectangle.html

Begin Page 19

2. Open GIS Consortium, Inc., OpenGIS® Geography Markup Language (GML) Implementation Specification, OGC 02-023r4, Version 3.0, January 29, 2003. https://portal.opengeospatial.org/files/?artifact_id=7174

3. Ibid.

4. Sunday, Dan, “Bounding Containers for Polygons, Polyhedra, and Point Sets (2D & 3D)”, softSurfer.com, http://geometryalgorithms.com/Archive/algorithm_0107/algorithm_0107.htm

5. Ibid.

6. Ibid.

7. Koperski, Krzysztof, “Spatial Data Structures, Computations, and Queries,” Simon Fraser University, http://db.cs.sfu.ca/GeoMiner/survey/html/node4.html

8. Dublin Core Metadata Initiative, “DCMI Box Encoding Scheme: specification of the spatial limits of a place, and methods for encoding this in a text string,” http://dublincore.org/documents/dcmi-box/index.shtml.

9. Federal Geographic Data Committee. FGDC-STD-001-1998. Content Standard for Digital Geospatial Metadata. Revised June 1998). Washington, D.C.: Federal Geographic Data Committee, 1988), 5.

10. International Standards Organization. ISO 19115. Geographic information – Metadata. First Edition (Geneva, Switzerland: ISO, May 1, 2003).

11. Dublin Core Metadata Initiative, "DCMI Box Encoding Scheme."

12. Longley, Paul A., Michael F. Goodchild, David J. Maguire, and David Rhind. 2001. Geographic Information Systems and Science (Chichester, U.K.: John Wiley & Sons, 2001), 5-6.

13. Biological Data Working Group, Federal Geographic Data Committee and USGS Biological Resources Division. FGDC-STD-001.1-1999. Content Standard for Digital Geospatial Metadata Part 1: Biological Data Profile (Washington, D.C.: Federal Geographic Data Committee), 1-13.

14. For a discussion on compactness and Census tracts, see Mark Monmonier, " Gauging Compactness" in Bushmanders & Bullwinkles (Chicago: University of Chicago Press, 2001), 64-76.

15. The “Ecoregions and Subregions of the United States, Puerto Rico, and the U.S. Virgin Islands” (Begin Page 20) from the US Forest Service is available online at http://www.fs.fed.us/institute/ecoregions/eco_download.html

16. The “1:2,000,000-Scale Hydrologic Unit Boundaries” from the US Geological Survey is available online at http://water.usgs.gov/GIS/huc.html

17. Circularity is defined in the online documentation of the Data Interoperability Extension for ArcGIS. This extension was used to calculate the circularity for the Census Tracts.