Director's Message | Table of Contents | Executive Summary | RAP Achievements
Education and Outreach | Community Service | Awards | Publications | People | ASR 2004 Home

RAP Achievements

M. Statistics Applications and Forecast Verification

[Background] [R-verification package]
[Use of cross-validation]
[An object-oriented approach]
[Verification of CIP]
[Verification of oceanic weather cloud-top predictions]


1. Background


The RAP Verification Group (VG) provides independent verification of aviation weather forecasting systems and other types of forecasts that are being developed at NCAR and other laboratories. The VG works closely with other verification groups [e.g., the Forecast Verification Branch at NOAA’s Forecast Systems Laboratory (FSL)] to evaluate the forecasting capabilities of experimental products and products that are being considered for operational status. During the past year, the VG supported or was responsible for major evaluations of two in-flight icing algorithms, a convection forecasting system, and a turbulence forecasting algorithm. These evaluations were conducted as part of the process of transitioning these algorithms to experimental, and eventually operational, status.

In addition to these algorithm evaluations, the VG develops new statistical techniques and verification approaches that are appropriate for aviation weather forecasts and observations. In particular, because aviation weather forecasts are non-standard and because the phenomena of concern (e.g., icing, turbulence) often are difficult to observe, it frequently is necessary to identify alternative observations and to develop new approaches. Moreover, in many cases standard verification approaches are not able to provide the diagnostic information needed to make optimal use of the forecasts or to improve them. In addition, finer-scale spatial forecasts can be strongly penalized for relatively small temporal or spatial errors. New verification approaches are needed to cope with these various issues.

These studies represent a small sample of the work that has been undertaken in this area in the last year. Research also continued in many other areas, including development of approaches to evaluate ceiling and visibility forecasts, the use of new observations to evaluate turbulence forecasts, applications of extreme value theory, and the development of methods to use aircraft situation information for verification.



2. R-verification package

Matt Pocernich of the RAP VG developed a new forecast verification software package using the R programming language. R is an open source statistical language that has been widely embraced by the statistical community. Developed by statisticians around the world, more than 350 packages on a huge variety of topics have been contributed to the R library for general use. The verification package has been added to this collection of packages and is available to the general public via the R-project website. Providing software directly to the general public offers the opportunity to reach a larger audience, thereby getting more immediate feedback on algorithms and fulfilling NCAR's mission to foster the transfer of knowledge and technology. This method of technology transfer follows a precedent established by NCAR's Geophysical Statistical Initiative, which has transferred other software packages to the R user community.

Functions in the package perform both routine verification tasks as well as more innovative and experimental methods. Routine tasks include receiver operating characteristic (ROC) plots, attributes diagrams and reliability plots. Figure M-1 provides an example of a ROC plot produced by the package. In the R environment, additional information about the verification process is also available, such as the p-values for the curves and data values. Other functions included in the package are more research-oriented in nature, and include methods developed outside of NCAR. For example, during a visit in Spring 2004, Barbara Casati (Recherche en Prévision Numérique, Canada) contributed a spatial scale-intensity skill score function. This function is used to verify spatial forecasts, taking into account the effects of scale on a skill score. William Briggs (GIM, Weill Cornell Medical College) has contributed an approach that accounts for measurement error in calculating a skill score. Tilmann Gneiting (University of Washington) contributed a continuous ranked probability score function for evaluating probabilistic forecasts. While the verification package was primarily developed to study meteorological forecasts, it has been created in a generic way so that it can be useful across many disciplines. The functions and plots are designed to provide verification analyses for a variety of types of forecasts and observations, namely binary, continuous, probabilistic and distribution.

An overview of this package as well as others developed at NCAR was presented by Eric Gilleland and M. Pocernich at userR! 2004 – first R users' conference, which took place during Summer 2004. This conference was attended by more than 200 statisticians and statistical programmers from around the world. A more detailed presentation and demonstration of the package was given at the International Verification Methods Workshop in Montreal. During this presentation, weather oriented examples were given. Workshop attendees provided many useful suggestions as well as offers to submit additional functions. Internally, members of RAP's verification group have conducted a series of informal seminars intended to encourage scientists to use R. [Top]

Figure M-1. Typical Receiver Operating Characteristic Plot produced by the *roc.plot* function in the R-verification package.



3. Use of cross-validation in forecast verification

Ideally, the observations used for verification of forecast products are collected independently from the observations used to create the forecast. Unfortunately, independent collection is not possible in some cases. In this situation, cross-validation techniques can be used. Such techniques are commonly used in statistics, especially in the development of statistical models. They have been adapted for use in forecast verification.

Cross-validation takes a single set of collected data and splits it into two sets. The training set is used to create the forecast while a testing set is used to verify the forecast. For large amounts of data, splitting the data once into testing and training sets may provide enough cases both for forecast development and for verification. However, for smaller sets of data, a single split leaves insufficient cases for training, testing or both. In this case, cross-validation can be repeated several times, each time leaving the majority of the data for forecasting and a small number of cases for verification. Using this method, a quality forecast is created and a large enough verification set is available.

For cross-validation to work ideally, several assumptions must be met. The testing and training sets must be independent of each other. Observations in the testing and training sets come from the same distribution. The testing set observations must be both accurate and unbiased. Most weather data violate some or all of these criteria.

Tressa Fowler and others in the RAP VG applied cross-validation methods for verification of the Juneau Airport Wind System (JAWS) prototype. All observations were collected by research aircraft, and these observations were used in both the system creation and verification. Measurements taken during a single flight have a time correlation. A single instrument took all measurements, thus a systematic bias in the measurements is possible.

The measurements were divided up by location and wind regime, into twelve hazard areas and three wind regimes. Forecasts are created separately for each combination of hazard area and wind regime. Thus, verification for the forecasts is also done separately. Some areas had a large number of cases while others had a small number of cases. Varying cross-validation strategies were used in area, depending on how many cases were available. For example, one area had 21 cases. Cross-validation was run for this area 21 times, each time leaving a single point for verification. Thus, each forecast was created using 20 cases, but 21 cases are available for verification. Table N-1 shows verification statistics for both the cross-validated (JAWS A) and overall version (JAWS B). The statistics include the probability of detecting a turbulence event (PODy), the probability of detecting a non-event (PODn), the false alarm ratio (FAR), and the true skill statistic (TSS). Verification statistics for the JAWS system using cross-validation were somewhat worse than the statistics calculated when cross-validation was not used. This is expected, since cross-validation should give a more independent and realistic estimate of the forecast capability.

Cross-validation methods have also been adapted for use in verification of the National Ceiling and Visibility (NCV) analysis product. The verification will be completed in the coming fiscal year.

Cross-validation methods were used for forecast verification when better methods were not possible. By using cross-validation, a greater degree of independence between the forecasts and observations was achieved. Thus, estimates of forecast performance are more accurate. These methods will be applied again in the future when the nature of the observations available for verification makes cross-validation a beneficial option. [Top]

  Table N-1

 
PODy
(56 cases)
PODn
FAR
(59 cases)
TSS
JAWS (A)
0.46
0.98
0.56
0.44
JAWS (B)
0.58
0.99
0.31
0.57




4. An object-oriented approach for verifying spatial precipitation and convective forecasts

In recent years, the need for alternative approaches to evaluate spatial forecasts of precipitation and convection has become apparent. This need is partly associated with the desire for verification measures that are more clearly tied to the operational usefulness of the forecasts and that can provide diagnostic information about the quality of the forecasts, for feedback to forecast developers, forecasters, and forecast users. In addition, as forecasts have moved to finer grid resolution, it has become clear that standard verification approaches are often not able to provide meaningful measures of the their capabilities.

In response to these needs, in a cross-divisional (MMM and RAP) project led by Barbara Brown and Chris Davis, the VG has developed an object-oriented verification approach that is applicable to human-generated (generally polygonal) and automated (generally gridded) forecasts. With this approach, forecast and observed precipitation/convective areas are reduced to regions of interest that can be compared to one another in a meaningful way. In general, the object-oriented approach requires several steps:

1) objects are identified in the observation and forecast fields;

2) basic attributes of the individual objects (e.g., location, size, orientation, underlying precipitation distribution) are measured;

3) adjacent objects or objects in close proximity that naturally “belong” together are merged into composite objects;

4) forecast and observed objects are optimally matched to each other;

5) relevant measures of the similarities and differences between the matched shapes are computed, and can be summarized across a set of forecasts.

During the past year, two approaches for the matching and merging steps were developed; in particular, E. Gilleland and Randy Bullock tested two different approaches for this step, one based on a binary image-matching technique and another based on application of fuzzy logic.

The object-oriented approach using the fuzzy logic matching approach was applied to gridded precipitation forecasts from a 22-km version of the Weather Research and Forecasting (WRF) model for two summers (2001 and 2002), and precipitation observations based on the Stage IV multi-sensor precipitation dataset. Figure M-2 shows an example of one of the cases included in this evaluation. For this example, four composite observed and forecast shapes were identified. Comparisons of the matched forecast and observed shapes indicated that the forecast objects were generally located too far to the north and west; the median forecast precipitation values were somewhat too large, whereas the 0.90th percentile precipitation values were somewhat too small, indicating the model forecast did not represent the observed extreme precipitation very well. These results clearly provide much more information about the forecast performance than can be obtained using standard verification statistics such as the probability of detection and the critical success index. B. Brown and C. Davis have summarized the statistics for all cases, and found similar characteristics of the forecast performance.

A similar approach was also developed for application to human-generated forecasts by John Halley Gotway and R. Bullock. An example of an application to the convective significant meteorological advisories (C-SIGMETs; operational convective advisories issued by the National Weather Service) is shown in Figure M-3. In this case the optimal locations of the forecasts were identified using the object-oriented techniques; the resulting improvement in verification scores represents the performance of a “practically perfect” forecast.

In summary, these new techniques show promise for providing useful diagnostic information regarding the performance of spatial forecasts. While further development and testing are still needed, the progress this year has led to a methodology that can begin providing feedback on the capabilities of a variety of types of forecasts.

Figure M-2. Example case for 12-h WRF precipitation forecasts valid on 2 July 2001 at 0000 UTC: (a) precipitation objects defined by convolution-threshold approach; (b) merged and matched areas. Merged areas are identified by the red band joining them; matched WRF and Stage 4 objects have the same letter identifier. Note there are two small Stage 4 objects that were not matched to a WRF object.

Figure M-3. Example of a C-SIGMET object optimization for 2-h C-SIGMETs. Green polygons represent the original forecasts; blue polygons represent the optimized forecasts. “Before” and “after” verification statistics are shown in the lower right corner. [Top]



5. Verification of CIP at high altitudes


The Current Icing Potential (CIP) is a diagnostic in-flight icing algorithm that is currently classified as operational at altitudes less than 18 kft. Several recent formal and informal studies have been performed to verify CIP using observations of icing from pilot reports (PIREPs) and research aircraft. The results of these evaluations showed the efficiency of CIP in detecting icing conditions at altitudes below 18kft. In an effort to examine the algorithms’ performance at altitudes above 18 kft, Ben Bernstein initiated a supplemental PIREP collection program in collaboration with Skywest Airlines from 13 – 25 August 2003. The purpose of this effort was to obtain a consistent observational data set that could be compared to and used in conjunction with regularly archived PIREPs over the same time period. Richard Bateman was responsible for the decoding of the Skywest PIREPs into a form that could be easily analyzed. These reports proved extremely valuable in assessing the accuracy of the regularly archived PIREPs, which have less certainty, especially with respect to the time and location of the report. An evaluation of CIP above 18 kft for the 13-25 August 2003 period was performed by Mike Chapman and Cory Wolff.

M. Chapman also performed an evaluation of CIP at high altitudes for the period 01 January – 31 March 2003 using regular PIREPs as observations in order to evaluate CIP’s performance over several winter months. The study was designed to assess the overall performance of CIP during two different seasons at altitudes greater than 18 kft, and to examine in greater depth some specific cases that had varied verification results.

Verification was accomplished by comparing the icing potential field to PIREPs of positive and negative icing at altitudes greater than 18,000 feet. The four grid points surrounding the PIREP and at 1-kft flight levels above and below it were examined. The methods utilized in the evaluation of CIP were based on standard techniques of forecast verification that treats icing forecasts and observations as Yes/No values. Icing diagnoses produced by CIP can be converted into a set of Yes/No values by applying a variety of thresholds. PODy and PODn are the primary verification statistics that are included in this evaluation. They are estimates of the proportions of Yes and No observations that are correctly diagnosed. Together, PODy and PODn measure the ability of the forecasts to discriminate between (or correctly categorize) Yes and No icing observations. PODy and PODn can be combined into an overall measure of this discrimination capability, the True Skill Statistic (TSS), which is also known as the Hanssen-Kuiper’s discrimination statistic. The relationship between PODy and 1-PODn for different thresholds is the basis for the verification approach known as Signal Detection Theory (SDT). This relationship can be represented for a given algorithm by the curve joining the (1-PODn, PODy) points for different thresholds. The resulting curve is known as the Relative Operating Characteristic (ROC) curve in SDT.

The statistical verification results indicate that the CIP performed well at levels above 18 kft during the 13 – 25 August 2003 time period, based on both the standard PIREPs and the Skywest Airlines supplemental PIREPs. For example, when using a relatively high threshold of 0.60, CIP captured 78% of the YES observations and 71% of the NO observations. Figure M-4 is a ROC curve comparing the August 2003 study to the Winter 2003 study with both curves showing a verification of CIP above 18 kft. The plot for August has an area under the curve of 0.76, which shows that CIP has some skill at discriminating between YES and NO observations.

The verification results were less impressive for CIP during the 01 January – 31 March 2003 time period. When compared to CIP for all levels during this period (Figure M-5 ), the results for high altitudes showed notably less skill with an area under the curve of 0.59 versus 0.76 for CIP including all levels. A possible reason for this poor result could be that the interest maps included in CIP for this study only provide a possibility for icing at temperatures down to -22°C. Another possibility is the lack of reports at the higher altitudes (>24kft), where only 15% of the observations for MOG icing were associated with the 18 – 30kft layer.

In the future, a similar study will be applied to the newest version of CIP, which includes the new interest maps and utilizes the 20km RUC. Because the same sets of observations can be used, the new analyses will provide a valuable comparison to this study.

Figure M-4. ROC diagram for CIP for two different time periods (13 – 25 Aug 2003 and 01 Jan - //31 Mar 2003// for altitudes > 18kft.

Figure M-5. ROC diagram for CIP (01 Jan – //31 Mar 2003//) for all levels (CIP-Winter) and levels greater than 18,000 feet (CIP-WinterHigh). [Top]



6. Verification of oceanic weather cloud-top height predictions

One challenge in evaluating oceanic weather diagnoses and forecasts for aviation weather elements is finding independent datasets for use in the verification analyses. The CTOP product, developed by the Oceanic Weather Product Development Team (OWPDT) of the FAA/AWRP, uses a combination of infrared (IR) reflectance values from GOES satellites with temperature and pressure profiles from the Global Forecast System (GFS). The focus of this evaluation is to ensure that the CTOP is consistent with other standard estimates of cloud-top height.

Because cloud-top heights are not observed directly, matching observations for verification must be inferred from data sources not used in the CTOP algorithm. Agnes Takacs et al. (2004) summarized the global observational datasets available for verifying a variety of oceanic weather forecasts. It was determined that because of its availability, the primary dataset used for this evaluation was the GOES sounder-based cloud-top pressure (CTP) product produced by NESDIS. In addition, radiosonde and radar observations available over the CONUS, coastal areas and islands were also used to derive cloud-top height estimates.

After taking into consideration radiosonde drift, several methods of determining cloud-top heights were used, such as the Wang and Rossow method (1995; RCTP) based on changes in relative humidity. Radar Echo top heights (ET) are another observational dataset used in this evaluation. A survey of literature by A. Takacs found RCTP and ET observations of cloud-top height are expected to provide at least the lower bounds on the estimated CTOP values, while pilot report observations of cloud tops, also considered over the CONUS, are likely to provide an upper bound. Additionally, the equilibrium level and the equilibrium level using virtual temperature from radiosondes were used as estimates for cloud-top heights.

Robert Hueftle matched the RCTP, ET, National Environmental Satellite Data Information Service (NESDIS) CTP, and CTOP values by temporal and spatial proximity in a point-to-point approach for matching observations collected during Spring 2004. This matching was done over the Pacific, the Gulf of Mexico, Mexico, Hawaii, and Caribbean and compiled into an extensive database for comparing cloud-top heights between datasets. E. Gilleland and T. Fowler began initial statistical investigations into the relationships between cloud-top height estimates using this dataset. Additional data were also collected during August/September over CONUS as well as the aforementioned domains. A. Takacs performed qualitative comparisons using maps to compare and contrast datasets over individual cases. One such example, presented in Figure M-6 (a, b and c) shows the radar echo top, the CTOP product, and the NESDIS CTP products for the case of hurricane Frances, respectively (3 September 2004). Using the rich datasets available from these varied observation types, a variety of verification statistics are being computed for comparison of the CTOP values with each type of observation. These verification statistics essentially represent a robust measure of the association between the CTOP and the other types of cloud-top measures.

Verification of oceanic weather cloud-top height diagnoses


(a) Echo tops detected by the Miami WSR-88D radar (08:56:25 UTC on 9/3/2004).
(b) CTOP product over the Gulf of Mexico – South America region with hurricane Frances (08:56:25 UTC on 9/3/2004).


(c) NESDIS Cloud-Top Height Pressure (CTP) products for smaller geographical regions of one of the GOES four sounder sectors with hurricane Frances (08:56:25 UTC on 9/3/2004).

Figure M-6 (a, b and c). Examples for the verifications datasets. Red circle shows a 234 km radius around the Miami radar station where radiosonde observations are also available.

---

References

Takacs, A., B. Brown, and J. Mahoney, 2004: Verification of oceanic weather diagnoses and forecasts for aviation weather elements. Preprints, 84th Annual Meeting of the American Meteorological Society. Seattle, WA 11-15 January, American Meteorological Society (Boston).

Wang, J., and W.B Rossow, 1995: Determination of cloud vertical structure from upper- air observations. Journal of Applied Meteorology, 34, 2243-2258.

[Top]


RAP Achievements

 

Director's Message |Table of Contents | Executive Summary |RAP Achievements
Education and Outreach | Community Service | Awards | Publications | People | ASR 2004 Home

National Center for Atmospheric Research University Corporation for Atmospheric Research National Science Foundation Annual Scientific Report - Home Atmospheric Chemistry Division Advanced Studies Program Atmospheric Chemistry Division Climate and Global Dynamics Division Environmental and Societal Impacts Group High Altitude Observatory Mesoscale & Microscale Meteorological Division Research Applications Program National Center for Atmospheric Research Scientific Computing Division