Geophysical Statistics Project Narrative
The mission of the Geophysical Statistics Project (GSP) is
to encourage the application of statistical analysis and the development of new
statistical methods in the geophysical sciences. To fulfill this goal GSP must also be
engaged in basic statistical research including mathematical statistics and probability
theory. Although GSP is administered through the Climate and Global Dynamics Division,
this project is an NCAR-wide effort, and the research reported in the following sections
reflects this breadth of scientific collaboration across the Centers divisions. A
large part of GSP's impact is through multi-year projects with substantial scientific
involvement. For this reason many of the research results below are based on efforts
continuing from last years report. Project leadership is provided by Douglas Nychka
(GSP). Co-Principal investigators for the project are Richard Katz (Environmental and
Societal Impacts Group, ESIG), Joseph Tribbia (Climate Dynamics and Predictability, CDP)
and Jeffrey Anderson (Data Assimilation Initiative, DAI).
Spatial and Spatio-Temporal Processes
Multiresolution covariance models
Tomoko Matsuo (GSP/DAI), in collaboration with Nychka,
has developed a parametrization for a family of covariances based on a wavelet
representation and implemented an algorithm to estimate the covariance from irregularly
sampled spatial data. Flexible classes of nonstationary covariance functions are important
for the analysis of historical climate data or atmospheric soundings, sparse measurements
of the upper atmosphere, and as a background (or prior) covariance in data assimilation.
The basic idea behind a wavelet representation is that the random field is a composition
of basis functions with localized support multiplied by random coefficients. The main
contribution of this modeling project is the construction of a covariance among the
wavelet coefficients that is much simpler than the covariance of the original field, can
be adaptive over space, and has many zero elements. One result of this project is a
statistical parametrization of the square root of the coefficient covariance matrix (H)
that can accurately approximate the shape/scale Matern family of one dimensional
covariances. Preliminary results also suggest that this parametrization can be extended to
a two-dimensional family including anisotropy. The sparsity of the H matrix makes it
possible to compute the likelihood of the covariance efficiently for large spatial
datasets using sparse matrix algorithms. Moreover, a Monte Carlo Expectation/Maximiation
(EM) algorithm has also be shown to work for irregularly sampled data. Here the E step is
accomplished using conditional simulation from the spatial model to estimate the missing
values of the field on a regular grid. The M step takes advantage of fast wavelet
algorithms for regular grids, and unlike Fourier techniques, does not require tapering at
the boundaries.
Precipitation Extremes
Daniel Cooley (GSP and University of Colorado), in
collaboration with Ulrike Schneider (GSP) Linda Mearns (ESIG), and Philippe Naveau
(University of Colorado), is developing statistical models to characterize the spatial
distribution of extreme rainfall. The behavior of univariate extremes can be easily
modeled parametrically, but little is known how the univariate distribution can vary over
space. In addition, no parametric model exists to describe the family of multivariate
extreme value distributions. This problem is exacerbated when considering processes of
extreme values, such as a field of annual maximum precipitation across a region.
As a joint project with ESIG, new statistical methodology
is being developed for extreme precipitation data from Colorado's Front Range. The goal is
to improve on methods currently in use by the United States Weather Service in the
construction of regional precipitation atlases. A Bayesian hierarchical model is proposed
in which the parameters of the extreme value distribution at each location are spatially
dependent and possibly dependent on other covariates (such as the annual precipitation,
slope and elevation). From this model one obtains a spatial prediction of
precipitation return levels, along with ensemble-based measures of uncertainty. Another
data motivated project arises from regional climate model (RCM) experiments. Unlike
observational data, the RCM output is a complete field, but due to limited computing
resources, does not cover a long period of time. The goal is to quantify the extreme
behavior of the RCM simulations of daily precipitation based on statistical extreme value
distributions. One important scientific issue is how extremes change according to the
spatial scale of the output grid. Preliminary results for the Pacific Northwest Laboratory
(PNL) RCM, forced by the NCAR Parallel Climate Model, show coherent spatial patterns of
return levels for simulations of current climate.
Some theoretical questions related to spatial dependence
extremes have also been investigated. Spatial dependencies are often modeled with the
variogram; however, use of the variogram requires that second moments be finite, which is
often not the case with extreme value distributions. However, the madogram (an L1 version
of the variogram) only requires that first moments be finite, and it also has convenient
relationships with maxima. Preliminary results suggest that the madogram can be used to
estimate spatial dependence parameters for fields of stationary max-stable random fields.

This figure shows the estimated
50-year return levels based the PNL RCM simulation of current climate. The 50-year return
level is the value that has a 1/50 probability of occuring in a given year and is
estimated for each grid cell of the RCM model grid. These quantiles were found by fitting
a generalized Pareto distribution (GPD) using maximum likelihood to the grid cell time
series. To reduce the variability of estimating the GPD parameters across the grid, a
local smoothing approach was taken. For a given location a local likelihood was used that
weighted the log likelihood of the central grid cell by 1/2 and the log likelihoods of the
neighboring grid cells by 1/8. Maximizing this criterion results in more stable parameter
estimates across space.
Determining Extremes in Daily Ozone
Eric Gilleland (Research Applications Program) and
Nychka have studied the problem of estimating the spatial field comprising the ozone
standard proposed by the United States Environmental Protection Agency (EPA). The standard
for an ozone monitoring station is an extreme order statistic (fourth highest) of the
daily ozone averages observed over the summer season. When the fourth highest daily
average (FHDA) exceeds 84 parts per billion, EPA deems that air quality standards have
been violated. This is a nonlinear statistic that has different spatial structure from
daily ozone measurements and is possibly not normally distributed. Despite these
difficulties, the goal is to estimate the spatial field of the standard at unmonitored
locations.
In this period previous work was extended to include a
spatial model for the tail of the daily ozone distribution. A generalized Pareto
distribution (GPD) was used to model ozone extremes as a function of spatial. Here the
region had a common shape parameter, and the scale parameter varied according to a spatial
model. To reduce the complexity of the model, the spatial covariance for the GPD scale was
thin-plate spline model. The posterior modes for the GPD parameters have been determined
for the North Carolina study region during the 1997 ozone season. Given the GPD
distribution one can calculate the probability that a given location will have an FHDA
statistic exceeding 84. The probability surface estimated through this model is very
similar in structure to a spatial analysis obtained by a space-time model applied to the
daily ozone measurements (These results are reported in the 2003 GSP Annual Science
Report.). The fact that an extremes approach agrees with a more conventional space time
model is surprising and reinforces the credibility of each method.

This figure summarizes the
probability of the FHDA ozone statistics exceeding 84 PPB based on two statistical methods
and estimated from monitoring data for 1997. The right image is based on a space-time
model for daily values and the distribution of the FHDA statistics is inferred using a
Monte Carlo conditional simulation. The left image is the probability of exceedance based
on a generalized Pareto distribution where the scale parameter is allowed to vary smoothly
over space.
Regression and Classification
Analysis of microarray data
Reinhard Furrer (GSP) and Stephan Sain (University of
Colorado at Denver) applied techniques based on sparse matrix methods for tapered
covariances to genomic microarray data. This is a technology transfer project that
showcases the use of methods developed for large geophysical datasets to an important area
in the life sciences. Microarrays are an array of DNA or protein samples that can be
hybridized with probes to study patterns of gene expression. They provide a powerful assay
to determine how a large number of genes within a cell interact with each other.
Typically, a single microarray includes over 400,000 individual observations, too much
data for classical analysis techniques. To estimate the effects of different genes, it is
appropriate to use a statistical mixed model with a random spatial component. Tapering the
covariance matrix in statistical calculations is known to be an accurate approximation and
makes the analysis feasible. By taking advantage of the sparse nature of such tapered
covariance matrices, back-fitting is used to estimate the fixed and random model
parameters. Results are demonstrated on an experiment using microarrays to build a profile
of differentially expressed genes relating to cerebral vascular malformations, an
important cause of hemorrhagic stroke and seizures. This approach can also be applied to
other problems in the environmental and biological sciences that involve the analysis of
large and spatially structured data.

This figure illustrates the
decomposition of log gene expression of a single microarray. Each pixel in these images
represents a single location on the microarray, and images decompose the response into an
overall mean, column and row effects, a spatial field, the gene expression itself, and a
remaining error field. Here the spatial field is seen to represent a large amount of the
variability compared to the error field.
Stochastic Multiresolution Models for Turbulence
This project is a collaboration among Brandon Whitcher
(formerly of GSP, now with GlaxoSmithKline), Jeff Weiss (University of Colorado at
Boulder), Thomas Lee (Colorado State University, CSU), and Curtis Storlie (CSU). With the
current state of computing technology, it is relatively straightforward to generate
high-resolution numerical integrations of the Naiver-Stokes (N-S) equations, the
fundamental physical equations that describe (nonreactive) fluid motion. One feature in
the simulation of geophysical type fluids is the presence of eddies and vortices. It has
become of interest in the past five years to describe these coherent structures (i.e., the
number, size, shape, amplitude of vortices), and our goal is to formulate a statistical
model for these structures. The initial statistical problem is to segment the time slices
from a numerical simulation of turbulence to reliably identify vortices. The statistics
estimated from the vortices allow one to verify some of the assumptions made by Weiss and
James McWilliams (University of California at Los Angeles) in the derivation of scaling
arguments. The vortex field (vorticity is the curl of velocity) is decomposed as a sum of
individual models for the vortices where the individual modes use a multi-resolution
basis. This model for the field is formulated as a multivariate multiple linear regression
in the wavelet domain, and individual vortex models are isolated through a model selection
strategy. The identification procedure depends on the usual kind of truncation ideas in
wavelet regression and has been implemented on a parallel computer architecture to make
analysis of a reasonable numerical experiment feasible. The amount of truncation is
determined by generalized cross-validation, in which the number of effective parameters is
obtained via a thresholding rule.
A second problem is to use sequences of images to track the
movements of vortices. A novel likelihood-based approach is currently being investigated
that allows for the birth, death, and emerging of vortices. Results from small-scale
numerical experiments demonstrate the promising empirical properties of this new approach.
Future research focuses include the studies of (i) the empirical performance of the
approach for large scale tracking problems, and (ii) the statistical properties of the
estimated movements of the vortices. Several scientific goals from this extension are to
quantify the interactions among vortices in simulations of turbulence and to track storms
and eddies from more realistic numerical experiments for the atmosphere and ocean.
Figure 1
Figure 2
These two figures (figure 1
and figure 2) are snapshots of the vorticity field from a
simulation of 2-d quasi-geostrophic flow taken at two different times. The statistical
problem is to readily identify the vortices in this image as distinct coherent elements
and to track the movements of the identified coherent elements. The crosses mark the
locations of all the detected structures and agree with a subjective visual inspection of
the images.

This figure summarizes the
properties of vortices from different time slices of a numerical simluaton of 2-D
turbulence. Here the vortices have been identifed from the vorticity field using the
methods illustrated in the previous figures. From scaling arguments one expects that the
number of vortices, the average vorticity (circulation), the average radius, and the
maximum vorticity to have a linear relationship when the variable and time are logged. The
empirical slopes of these linear relationships are useful for characterizing dynamics due
to coherent structures such as vortices.
Statistical Computing
The KriSp package for large spatial datasets
Furrer has written a package called KriSp (Kriging
with Sparse matrices) for the high-level language R. The functions are based on
similar methods and classes of the well-established package fields (Nychka, et
al.). The package allows one to handle large spatial prediction problems where
observations occur at irregular locations. The approach is to introduce sparsity in the
covariance matrix by tapering with a compactly supported (positive definite) kernel (e.g.,
one finds the direct product of the covariance matrix and the tapering matrix). Although
this technique is common in atmospheric data assimilation, it is not well used in general.
The theoretical background was established in the previous year stating that, given the
correct choice of a taper (e.g., matching the tail behavior of the spectral density of the
covariance function), the approximation is asymptotically optimal. The potential speedup,
based a sparse approximations and implemented KriSp, is striking suggesting that a spatial
prediction (Kriging) can be computed for several thousand points interactively in the R
environment. Also, the tapering approximation is quite accurate sacrificing little
statistical efficiency compared to the exact calculation.

This figure is an example of
the computational time for the key step in a spatial prediction problem. Different lines
indicate the time for solving ten linear systems for a 1-d Kriging problem based on two
different high-level languages (Matlab and R) and different levels of tapering. For the R
package the time difference between the exact solution verses using a tapered, sparse
approximation is more that a factor of 100 for 1000 spatial locations.

This figure is a copy of
the KriSp homepage and gives downloading instructions and a concise tutorial.
Model Validation and Synthesis
Combining multi-model regional climate projections
This project is a collaboration among Furrer, Nychka,
Sain, and Tom Wigley (Climate Analysis Section). The objective is to develop a Bayesian
mixed effects model to combine the results of different climate model simulations and
quantify the uncertainty. This research is in the context of synthesizing the numerical
experiments from different climate modeling centers in a way that produces credible
probabilistic forecasts of future climate change. The novelty of this approach is the use
of gridded, high-resolution data within a spatial framework, and is a generalization of
Claudia Tebaldi's (ESIG) former work (with collaboration with Mearns, and Richard Smith,
University of North Carolina). This single hierarchical Bayes model can be used to
quantify uncertainty of atmosphere-ocean general the circulation models (AOGCM)
responses on different scales, from regions of several hundred miles up to continental
size. A pilot dataset used for this work is the set of 16 AOGCM experiments runs contained
in the SCENGEN/MAGICC (A Regional Climate Scenario Generator/Model for the Assessment of
Greenhouse-gas Induced Climate Change) package and has a resolution of 5 x 5 degrees.
The posterior distributions are obtained with
computer-intensive Markov Chain Monte Carlo (MCMC) simulations with present and future
mean temperatures from the periods 19711990 and 20312050. Preliminary
results suggest that the AOGCM results are quite heterogeneous to the extent that spatial
models for the model disagreement have significant nonstationary and possibly non-Gaussian
components. It is planned to switch to the new high-resolution data used for the Fourth
Intergovernmental Panel on Climate Change report and to extend the model to the bivariate
case (temperature and precipitation). The coding of this model in the R statistical
language will facilitate the use of these methods by a wide audience.

This figure gives
posterior density estimates of the 20-year average temperature climate change between
20312050 and 19711990 of a 5 x 5 degree region centered in eastern Colorado.
The red curve represents the posterior density of mean climate change synthesize from all
16 models and borrowing information from adjacent grid points. The tick marks locate the
actual differences for the models, the "data" for this grid box. As an example,
the light blue (CAR Climate System Model) and yellow (ECM4) are the posterior densities
for two models.
Parameter estimation for intermediate climate
models
This project is collaborative research among Chris Forest
(Massachusetts Institute of Technology, MIT), Bruno Sanso (University of California at
Santa Cruz), Nychka, Tebaldi, and Dorin Drignei (GSP) and is intended to revise the method
for estimating the uncertainty in climate system properties from Forest, et al. (2002).
From a statistical point of view, the goal is to estimate the parameters of a nonlinear
regression model where the regression function is expensive to evaluate and the response
has correlated errors. Some of this work draws on the design and analysis of computer
experiments, but has the additional complexity that a run of the climate model only yields
a realization of the climate and so contains some sampling error. To apply a fully
Bayesian approach, we first approximate the response of the MIT intermediate climate model
with a statistical model that provides a response surface in the uncertain parameter
space. The three-dimensional parameter space is defined as climate sensitivity (CS), rate
of deep-ocean heat uptake (KV), and the net aerosol forcing (FA), and covers the three
major uncertain quantities that affect the ability to simulate accurately the 20th century
climate record. The availability of this response surface permits one to sample the joint
posterior distribution of the parameters using a MCMC technique. This approach facilitates
the testing of methodologies for performing a more computationally intensive project using
the complete MIT climate model, which is infeasible with current computer resources.
Currently, the choice of parameters to evaluate the climate model is not driven by
knowledge of the regression response surface. This statistical approach would lead
to computational designs that specify a minimal number of physical parameter vectors
required to achieve a desired precision of the parameter estimates.
Previous work has used an error covariance estimated from
long control runs for all other combinations of the above physical parameters. It is
possible that the error covariance is dependent on these parameters and a better approach
would be to construct a global statistical model that allows the error covariance to vary
as a function of the climate model parameters.

This figure is an example of
output of temperature change from MIT model and the interpolated temperatures, showing the
performance of the interpolation algorithm. The plotted quantity is surface temperature
change, averaged over 4 zonal bands (4 plots), and 5 decades (along the x-axis of each
plot). Black solid line, MIT model ensemble mean, dashed lines individual ensemble
members, red line interpolated temperatures. This specific case corresponds to the setting
CS=0.5° C, KV=40cm2/s and FA=-0.5 W/m2 |