Table of Contents | Director's Message | Executive Summary | CGD Achievements
Education and Outreach | Community Service | Awards | Publications | People | ASR 2004 Home


Geophysical Statistics Project Narrative

The mission of the Geophysical Statistics Project (GSP) is to encourage the application of statistical analysis and the development of new statistical methods in the geophysical sciences. To fulfill this goal GSP must also be engaged in basic statistical research including mathematical statistics and probability theory. Although GSP is administered through the Climate and Global Dynamics Division, this project is an NCAR-wide effort, and the research reported in the following sections reflects this breadth of scientific collaboration across the Center’s divisions. A large part of GSP's impact is through multi-year projects with substantial scientific involvement. For this reason many of the research results below are based on efforts continuing from last year’s report. Project leadership is provided by Douglas Nychka (GSP). Co-Principal investigators for the project are Richard Katz (Environmental and Societal Impacts Group, ESIG), Joseph Tribbia (Climate Dynamics and Predictability, CDP) and Jeffrey Anderson (Data Assimilation Initiative, DAI).

Spatial and Spatio-Temporal Processes

Multiresolution covariance models

Tomoko Matsuo (GSP/DAI), in collaboration with Nychka, has developed a parametrization for a family of covariances based on a wavelet representation and implemented an algorithm to estimate the covariance from irregularly sampled spatial data. Flexible classes of nonstationary covariance functions are important for the analysis of historical climate data or atmospheric soundings, sparse measurements of the upper atmosphere, and as a background (or prior) covariance in data assimilation. The basic idea behind a wavelet representation is that the random field is a composition of basis functions with localized support multiplied by random coefficients. The main contribution of this modeling project is the construction of a covariance among the wavelet coefficients that is much simpler than the covariance of the original field, can be adaptive over space, and has many zero elements. One result of this project is a statistical parametrization of the square root of the coefficient covariance matrix (H) that can accurately approximate the shape/scale Matern family of one dimensional covariances. Preliminary results also suggest that this parametrization can be extended to a two-dimensional family including anisotropy. The sparsity of the H matrix makes it possible to compute the likelihood of the covariance efficiently for large spatial datasets using sparse matrix algorithms. Moreover, a Monte Carlo Expectation/Maximiation (EM) algorithm has also be shown to work for irregularly sampled data. Here the E step is accomplished using conditional simulation from the spatial model to estimate the missing values of the field on a regular grid. The M step takes advantage of fast wavelet algorithms for regular grids, and unlike Fourier techniques, does not require tapering at the boundaries.

Precipitation Extremes

Daniel Cooley (GSP and University of Colorado), in collaboration with Ulrike Schneider (GSP) Linda Mearns (ESIG), and Philippe Naveau (University of Colorado), is developing statistical models to characterize the spatial distribution of extreme rainfall. The behavior of univariate extremes can be easily modeled parametrically, but little is known how the univariate distribution can vary over space. In addition, no parametric model exists to describe the family of multivariate extreme value distributions. This problem is exacerbated when considering processes of extreme values, such as a field of annual maximum precipitation across a region.

As a joint project with ESIG, new statistical methodology is being developed for extreme precipitation data from Colorado's Front Range. The goal is to improve on methods currently in use by the United States Weather Service in the construction of regional precipitation atlases. A Bayesian hierarchical model is proposed in which the parameters of the extreme value distribution at each location are spatially dependent and possibly dependent on other covariates (such as the annual precipitation, slope and elevation).  From this model one obtains a spatial prediction of precipitation return levels, along with ensemble-based measures of uncertainty. Another data motivated project arises from regional climate model (RCM) experiments. Unlike observational data, the RCM output is a complete field, but due to limited computing resources, does not cover a long period of time. The goal is to quantify the extreme behavior of the RCM simulations of daily precipitation based on statistical extreme value distributions. One important scientific issue is how extremes change according to the spatial scale of the output grid. Preliminary results for the Pacific Northwest Laboratory (PNL) RCM, forced by the NCAR Parallel Climate Model, show coherent spatial patterns of return levels for simulations of current climate.

Some theoretical questions related to spatial dependence extremes have also been investigated. Spatial dependencies are often modeled with the variogram; however, use of the variogram requires that second moments be finite, which is often not the case with extreme value distributions. However, the madogram (an L1 version of the variogram) only requires that first moments be finite, and it also has convenient relationships with maxima. Preliminary results suggest that the madogram can be used to estimate spatial dependence parameters for fields of stationary max-stable random fields.

cooley.jpg (153314 bytes)

This figure shows the estimated 50-year return levels based the PNL RCM simulation of current climate. The 50-year return level is the value that has a 1/50 probability of occuring in a given year and is estimated for each grid cell of the RCM model grid. These quantiles were found by fitting a generalized Pareto distribution (GPD) using maximum likelihood to the grid cell time series. To reduce the variability of estimating the GPD parameters across the grid, a local smoothing approach was taken. For a given location a local likelihood was used that weighted the log likelihood of the central grid cell by 1/2 and the log likelihoods of the neighboring grid cells by 1/8. Maximizing this criterion results in more stable parameter estimates across space.

Determining Extremes in Daily Ozone

Eric Gilleland (Research Applications Program) and Nychka have studied the problem of estimating the spatial field comprising the ozone standard proposed by the United States Environmental Protection Agency (EPA). The standard for an ozone monitoring station is an extreme order statistic (fourth highest) of the daily ozone averages observed over the summer season. When the fourth highest daily average (FHDA) exceeds 84 parts per billion, EPA deems that air quality standards have been violated. This is a nonlinear statistic that has different spatial structure from daily ozone measurements and is possibly not normally distributed. Despite these difficulties, the goal is to estimate the spatial field of the standard at unmonitored locations.

In this period previous work was extended to include a spatial model for the tail of the daily ozone distribution. A generalized Pareto distribution (GPD) was used to model ozone extremes as a function of spatial. Here the region had a common shape parameter, and the scale parameter varied according to a spatial model. To reduce the complexity of the model, the spatial covariance for the GPD scale was thin-plate spline model. The posterior modes for the GPD parameters have been determined for the North Carolina study region during the 1997 ozone season. Given the GPD distribution one can calculate the probability that a given location will have an FHDA statistic exceeding 84. The probability surface estimated through this model is very similar in structure to a spatial analysis obtained by a space-time model applied to the daily ozone measurements (These results are reported in the 2003 GSP Annual Science Report.). The fact that an extremes approach agrees with a more conventional space time model is surprising and reinforces the credibility of each method.

FHDAp.jpg (195812 bytes)

This figure summarizes the probability of the FHDA ozone statistics exceeding 84 PPB based on two statistical methods and estimated from monitoring data for 1997. The right image is based on a space-time model for daily values and the distribution of the FHDA statistics is inferred using a Monte Carlo conditional simulation. The left image is the probability of exceedance based on a generalized Pareto distribution where the scale parameter is allowed to vary smoothly over space.

Regression and Classification

Analysis of microarray data

Reinhard Furrer (GSP) and Stephan Sain (University of Colorado at Denver) applied techniques based on sparse matrix methods for tapered covariances to genomic microarray data. This is a technology transfer project that showcases the use of methods developed for large geophysical datasets to an important area in the life sciences. Microarrays are an array of DNA or protein samples that can be hybridized with probes to study patterns of gene expression. They provide a powerful assay to determine how a large number of genes within a cell interact with each other. Typically, a single microarray includes over 400,000 individual observations, too much data for classical analysis techniques. To estimate the effects of different genes, it is appropriate to use a statistical mixed model with a random spatial component. Tapering the covariance matrix in statistical calculations is known to be an accurate approximation and makes the analysis feasible. By taking advantage of the sparse nature of such tapered covariance matrices, back-fitting is used to estimate the fixed and random model parameters. Results are demonstrated on an experiment using microarrays to build a profile of differentially expressed genes relating to cerebral vascular malformations, an important cause of hemorrhagic stroke and seizures. This approach can also be applied to other problems in the environmental and biological sciences that involve the analysis of large and spatially structured data.

microarray.jpg (458764 bytes)

This figure illustrates the decomposition of log gene expression of a single microarray. Each pixel in these images represents a single location on the microarray, and images decompose the response into an overall mean, column and row effects, a spatial field, the gene expression itself, and a remaining error field. Here the spatial field is seen to represent a large amount of the variability compared to the error field.

Stochastic Multiresolution Models for Turbulence

This project is a collaboration among Brandon Whitcher (formerly of GSP, now with GlaxoSmithKline), Jeff Weiss (University of Colorado at Boulder), Thomas Lee (Colorado State University, CSU), and Curtis Storlie (CSU). With the current state of computing technology, it is relatively straightforward to generate high-resolution numerical integrations of the Naiver-Stokes (N-S) equations, the fundamental physical equations that describe (nonreactive) fluid motion. One feature in the simulation of geophysical type fluids is the presence of eddies and vortices. It has become of interest in the past five years to describe these coherent structures (i.e., the number, size, shape, amplitude of vortices), and our goal is to formulate a statistical model for these structures. The initial statistical problem is to segment the time slices from a numerical simulation of turbulence to reliably identify vortices. The statistics estimated from the vortices allow one to verify some of the assumptions made by Weiss and James McWilliams (University of California at Los Angeles) in the derivation of scaling arguments. The vortex field (vorticity is the curl of velocity) is decomposed as a sum of individual models for the vortices where the individual modes use a multi-resolution basis. This model for the field is formulated as a multivariate multiple linear regression in the wavelet domain, and individual vortex models are isolated through a model selection strategy. The identification procedure depends on the usual kind of truncation ideas in wavelet regression and has been implemented on a parallel computer architecture to make analysis of a reasonable numerical experiment feasible. The amount of truncation is determined by generalized cross-validation, in which the number of effective parameters is obtained via a thresholding rule.

A second problem is to use sequences of images to track the movements of vortices. A novel likelihood-based approach is currently being investigated that allows for the birth, death, and emerging of vortices. Results from small-scale numerical experiments demonstrate the promising empirical properties of this new approach. Future research focuses include the studies of (i) the empirical performance of the approach for large scale tracking problems, and (ii) the statistical properties of the estimated movements of the vortices. Several scientific goals from this extension are to quantify the interactions among vortices in simulations of turbulence and to track storms and eddies from more realistic numerical experiments for the atmosphere and ocean.

lee_time25.jpg (68512 bytes) Figure 1

lee_time35.jpg (58253 bytes) Figure 2

These two figures (figure 1 and figure 2) are snapshots of the vorticity field from a simulation of 2-d quasi-geostrophic flow taken at two different times. The statistical problem is to readily identify the vortices in this image as distinct coherent elements and to track the movements of the identified coherent elements. The crosses mark the locations of all the detected structures and agree with a subjective visual inspection of the images.

leesummary.jpg (40197 bytes)

This figure summarizes the properties of vortices from different time slices of a numerical simluaton of 2-D turbulence. Here the vortices have been identifed from the vorticity field using the methods illustrated in the previous figures. From scaling arguments one expects that the number of vortices, the average vorticity (circulation), the average radius, and the maximum vorticity to have a linear relationship when the variable and time are logged. The empirical slopes of these linear relationships are useful for characterizing dynamics due to coherent structures such as vortices.

Statistical Computing

The KriSp package for large spatial datasets

Furrer has written a package called ‘KriSp’ (Kriging with Sparse matrices) for the high-level language R. The functions are based on similar methods and classes of the well-established package ‘fields’ (Nychka, et al.). The package allows one to handle large spatial prediction problems where observations occur at irregular locations. The approach is to introduce sparsity in the covariance matrix by tapering with a compactly supported (positive definite) kernel (e.g., one finds the direct product of the covariance matrix and the tapering matrix). Although this technique is common in atmospheric data assimilation, it is not well used in general. The theoretical background was established in the previous year stating that, given the correct choice of a taper (e.g., matching the tail behavior of the spectral density of the covariance function), the approximation is asymptotically optimal. The potential speedup, based a sparse approximations and implemented KriSp, is striking suggesting that a spatial prediction (Kriging) can be computed for several thousand points interactively in the R environment. Also, the tapering approximation is quite accurate sacrificing little statistical efficiency compared to the exact calculation.

Rmatlab.jpg (98247 bytes)

This figure is an example of the computational time for the key step in a spatial prediction problem. Different lines indicate the time for solving ten linear systems for a 1-d Kriging problem based on two different high-level languages (Matlab and R) and different levels of tapering. For the R package the time difference between the exact solution verses using a tapered, sparse approximation is more that a factor of 100 for 1000 spatial locations.

homepage.jpg (228131 bytes)

This figure is a copy of the KriSp homepage and gives downloading instructions and a concise tutorial.

Model Validation and Synthesis

Combining multi-model regional climate projections

This project is a collaboration among Furrer, Nychka, Sain, and Tom Wigley (Climate Analysis Section). The objective is to develop a Bayesian mixed effects model to combine the results of different climate model simulations and quantify the uncertainty. This research is in the context of synthesizing the numerical experiments from different climate modeling centers in a way that produces credible probabilistic forecasts of future climate change. The novelty of this approach is the use of gridded, high-resolution data within a spatial framework, and is a generalization of Claudia Tebaldi's (ESIG) former work (with collaboration with Mearns, and Richard Smith, University of North Carolina). This single hierarchical Bayes model can be used to quantify uncertainty of atmosphere-ocean general the circulation model’s (AOGCM) responses on different scales, from regions of several hundred miles up to continental size. A pilot dataset used for this work is the set of 16 AOGCM experiments runs contained in the SCENGEN/MAGICC (A Regional Climate Scenario Generator/Model for the Assessment of Greenhouse-gas Induced Climate Change) package and has a resolution of 5 x 5 degrees.

The posterior distributions are obtained with computer-intensive Markov Chain Monte Carlo (MCMC) simulations with present and future mean temperatures from the periods 1971–1990 and 2031–2050.  Preliminary results suggest that the AOGCM results are quite heterogeneous to the extent that spatial models for the model disagreement have significant nonstationary and possibly non-Gaussian components. It is planned to switch to the new high-resolution data used for the Fourth Intergovernmental Panel on Climate Change report and to extend the model to the bivariate case (temperature and precipitation). The coding of this model in the R statistical language will facilitate the use of these methods by a wide audience.

posthistcolo.jpg (38487 bytes)

This figure gives posterior density estimates of the 20-year average temperature climate change between 2031–2050 and 1971–1990 of a 5 x 5 degree region centered in eastern Colorado. The red curve represents the posterior density of mean climate change synthesize from all 16 models and borrowing information from adjacent grid points. The tick marks locate the actual differences for the models, the "data" for this grid box. As an example, the light blue (CAR Climate System Model) and yellow (ECM4) are the posterior densities for two models.

Parameter estimation for intermediate climate models

This project is collaborative research among Chris Forest (Massachusetts Institute of Technology, MIT), Bruno Sanso (University of California at Santa Cruz), Nychka, Tebaldi, and Dorin Drignei (GSP) and is intended to revise the method for estimating the uncertainty in climate system properties from Forest, et al. (2002). From a statistical point of view, the goal is to estimate the parameters of a nonlinear regression model where the regression function is expensive to evaluate and the response has correlated errors. Some of this work draws on the design and analysis of computer experiments, but has the additional complexity that a run of the climate model only yields a realization of the climate and so contains some sampling error. To apply a fully Bayesian approach, we first approximate the response of the MIT intermediate climate model with a statistical model that provides a response surface in the uncertain parameter space. The three-dimensional parameter space is defined as climate sensitivity (CS), rate of deep-ocean heat uptake (KV), and the net aerosol forcing (FA), and covers the three major uncertain quantities that affect the ability to simulate accurately the 20th century climate record. The availability of this response surface permits one to sample the joint posterior distribution of the parameters using a MCMC technique. This approach facilitates the testing of methodologies for performing a more computationally intensive project using the complete MIT climate model, which is infeasible with current computer resources. Currently, the choice of parameters to evaluate the climate model is not driven by knowledge of the regression response surface.  This statistical approach would lead to computational designs that specify a minimal number of physical parameter vectors required to achieve a desired precision of the parameter estimates.

Previous work has used an error covariance estimated from long control runs for all other combinations of the above physical parameters. It is possible that the error covariance is dependent on these parameters and a better approach would be to construct a global statistical model that allows the error covariance to vary as a function of the climate model parameters.

interp.jpg (133490 bytes)

This figure is an example of output of temperature change from MIT model and the interpolated temperatures, showing the performance of the interpolation algorithm. The plotted quantity is surface temperature change, averaged over 4 zonal bands (4 plots), and 5 decades (along the x-axis of each plot). Black solid line, MIT model ensemble mean, dashed lines individual ensemble members, red line interpolated temperatures. This specific case corresponds to the setting CS=0.5° C, KV=40cm2/s and FA=-0.5 W/m2

Table of Contents | Director's Message | Executive Summary | [DIV] Achievements
Education and Outreach | Community Service | Awards | Publications | People | ASR 2004 Home