With the increasing interest in and availability of geo-referenced health data comes with the need for methods to properly analyze them, taking into account the spatial correlation of outcomes in nearby places. Recognition of spatial influences in statistical inference date back to some of the earliest developments of modern statistical methods leading, for example, to notions of randomized plot designs for agricultural field trials.
Development of theoretical methods for spatially referenced data includes point process models, spatial prediction, and spatial lattice models in fields such as agriculture, entomology, bacteriology, cosmology, mining, and meteorology. Methods for the analysis of measurements taken at fixed point locations as random processes grew from independent developments by Matheron, (Principles of geo-statistics in Economic Geology) and Gandin for analyzing geologic data.
While the areas of spatial statistics, statistical computing, and GIS all developed substantially from the 1960’s through today, these developments have been and continue to be largely separate and independent of one another. Application of spatial statistical methods is more common now that both GIS and spatial statistical software packages are widely available.
While there are several texts focused on statistical methods for spatial health data, and health applications of GIS, there is a growing need for guidance in the combination of the two areas, in particular the selection and proper use of the appropriate statistical techniques for different types of geo-referenced health data. Complicating factors We specifically focus on spatial and spatio-temporal statistical methods appropriate for observational human health data, not clinical trials or data from other types of designed experiments.
A challenging but common problem with this type of data is the difficulty in obtaining accurate exposure and disease outcome data for the time and place most relevant to that disease. Health consequences are the result of a continuum of multiple and varied exposures which often occur over a long period of time and in various places. The increasing concern in this field is protecting the privacy and confidentiality of the study subjects. While all researchers agree that this is important, it is often difficult to reconcile these needs with data needs for a proper analysis.
In particular, spatial data analysis and mapping of results are often hampered by the lack of specific addresses. Data collection agencies and medical facilities are imposing increasingly strict requirements for data release and often only identify a place (usually patient’s address) to a broad administrative unit. In addition to general concerns regarding the analysis of health data, the spatial data analysis of cancer data poses unique challenges. Most cancers develop over a period of 20 to 30 years and are a result of multiple exposures interacting with the individual’s genetic susceptibility.
Few Americans live in a single place for decades – migration presents the problem of which residential address to use for a case’s location. Because latencies differ by cancer type and most likely by an individual’s susceptibility, little guidance is available for this question. The rarity of cancer also causes a sparse data problem for analysis, both for detecting clusters in data with high spatial variability and for communication of results without violating confidentiality.
John Snow’s illustration of his theorized cause of cholera in London via a map of case residences was possible because of the large number of cases in a small geographic area with a single, precisely located exposure. The detection of clusters of a rare disease such as cancer requires sophisticated statistical tools that filter out potentially confounding effects of age, spatially-varying population density, and mobility. As pointed out by Waller and Gotway (Waller, LA. ; Gotway, CA. Applied spatial statistics for public health data.
New York, John Wiley & Sons; 2004. ), different statistical methods answer different questions and require care in appropriate application and interpretation. Further discussion of these and other limitations of spatial data analysis are addressed in the accompanying article by Jacquez (Jacquez GM. Current practices in spatial analysis of cancer: flies in the ointment. International Journal of Health Geographics. 2004; 3]. Despite such concerns, important discoveries in cancer research do result from spatial data analysis.
Although U. S. mortality data had been published in tabular form for many years, it wasn’t until mortality rates were mapped in 1975 that spatial patterns emerged, such as the cluster of high oral cancer rates in southeastern states, later found to be due to smokeless tobacco use. Later, a number of clusters of childhood leukemia were identified. Although environmental, genetic and viral hypotheses have been proposed, the cause of most of these clusters remains unclear.
These studies illustrate the potential impact of spatial data analysis on medical research. Finally, in order to ultimately improve public health, the results of the complex analyses of geo-referenced cancer data must be disseminated to those in a position to take action, such as state epidemiologists and local cancer control specialists. Below are the geographic patterns of all cancer mortality rates in the US (1970-74). Both white males and females graphs are shown.