![]() |
||||||||||||||||||||
|
Director's Message | Table
of Contents | Executive Summary
| RAP Achievements |
||||||||||||||||||||
|
RAP Achievements
M. Statistics Applications and Forecast Verification [Background] [R-verification
package]
In addition to these algorithm evaluations, the VG develops new statistical techniques and verification approaches that are appropriate for aviation weather forecasts and observations. In particular, because aviation weather forecasts are non-standard and because the phenomena of concern (e.g., icing, turbulence) often are difficult to observe, it frequently is necessary to identify alternative observations and to develop new approaches. Moreover, in many cases standard verification approaches are not able to provide the diagnostic information needed to make optimal use of the forecasts or to improve them. In addition, finer-scale spatial forecasts can be strongly penalized for relatively small temporal or spatial errors. New verification approaches are needed to cope with these various issues. These studies represent a small sample
of the work that has been undertaken in this area in the last year.
Research also continued in many other areas, including development of
approaches to evaluate ceiling and visibility forecasts, the use of
new observations to evaluate turbulence forecasts, applications of extreme
value theory, and the development of methods to use aircraft situation
information for verification. Matt Pocernich of the RAP VG developed a new forecast verification software package using the R programming language. R is an open source statistical language that has been widely embraced by the statistical community. Developed by statisticians around the world, more than 350 packages on a huge variety of topics have been contributed to the R library for general use. The verification package has been added to this collection of packages and is available to the general public via the R-project website. Providing software directly to the general public offers the opportunity to reach a larger audience, thereby getting more immediate feedback on algorithms and fulfilling NCAR's mission to foster the transfer of knowledge and technology. This method of technology transfer follows a precedent established by NCAR's Geophysical Statistical Initiative, which has transferred other software packages to the R user community. Functions in the package perform both routine verification tasks as well as more innovative and experimental methods. Routine tasks include receiver operating characteristic (ROC) plots, attributes diagrams and reliability plots. Figure M-1 provides an example of a ROC plot produced by the package. In the R environment, additional information about the verification process is also available, such as the p-values for the curves and data values. Other functions included in the package are more research-oriented in nature, and include methods developed outside of NCAR. For example, during a visit in Spring 2004, Barbara Casati (Recherche en Prévision Numérique, Canada) contributed a spatial scale-intensity skill score function. This function is used to verify spatial forecasts, taking into account the effects of scale on a skill score. William Briggs (GIM, Weill Cornell Medical College) has contributed an approach that accounts for measurement error in calculating a skill score. Tilmann Gneiting (University of Washington) contributed a continuous ranked probability score function for evaluating probabilistic forecasts. While the verification package was primarily developed to study meteorological forecasts, it has been created in a generic way so that it can be useful across many disciplines. The functions and plots are designed to provide verification analyses for a variety of types of forecasts and observations, namely binary, continuous, probabilistic and distribution. An overview of this package as well as others developed at NCAR was presented by Eric Gilleland and M. Pocernich at userR! 2004 first R users' conference, which took place during Summer 2004. This conference was attended by more than 200 statisticians and statistical programmers from around the world. A more detailed presentation and demonstration of the package was given at the International Verification Methods Workshop in Montreal. During this presentation, weather oriented examples were given. Workshop attendees provided many useful suggestions as well as offers to submit additional functions. Internally, members of RAP's verification group have conducted a series of informal seminars intended to encourage scientists to use R. [Top]
Figure M-1. Typical Receiver Operating Characteristic Plot produced by the *roc.plot* function in the R-verification package.
Ideally, the observations used for verification of forecast products are collected independently from the observations used to create the forecast. Unfortunately, independent collection is not possible in some cases. In this situation, cross-validation techniques can be used. Such techniques are commonly used in statistics, especially in the development of statistical models. They have been adapted for use in forecast verification. Cross-validation takes a single set of collected data and splits it into two sets. The training set is used to create the forecast while a testing set is used to verify the forecast. For large amounts of data, splitting the data once into testing and training sets may provide enough cases both for forecast development and for verification. However, for smaller sets of data, a single split leaves insufficient cases for training, testing or both. In this case, cross-validation can be repeated several times, each time leaving the majority of the data for forecasting and a small number of cases for verification. Using this method, a quality forecast is created and a large enough verification set is available. For cross-validation to work ideally, several assumptions must be met. The testing and training sets must be independent of each other. Observations in the testing and training sets come from the same distribution. The testing set observations must be both accurate and unbiased. Most weather data violate some or all of these criteria. Tressa Fowler and others in the RAP VG applied cross-validation methods for verification of the Juneau Airport Wind System (JAWS) prototype. All observations were collected by research aircraft, and these observations were used in both the system creation and verification. Measurements taken during a single flight have a time correlation. A single instrument took all measurements, thus a systematic bias in the measurements is possible. The measurements were divided up by location and wind regime, into twelve hazard areas and three wind regimes. Forecasts are created separately for each combination of hazard area and wind regime. Thus, verification for the forecasts is also done separately. Some areas had a large number of cases while others had a small number of cases. Varying cross-validation strategies were used in area, depending on how many cases were available. For example, one area had 21 cases. Cross-validation was run for this area 21 times, each time leaving a single point for verification. Thus, each forecast was created using 20 cases, but 21 cases are available for verification. Table N-1 shows verification statistics for both the cross-validated (JAWS A) and overall version (JAWS B). The statistics include the probability of detecting a turbulence event (PODy), the probability of detecting a non-event (PODn), the false alarm ratio (FAR), and the true skill statistic (TSS). Verification statistics for the JAWS system using cross-validation were somewhat worse than the statistics calculated when cross-validation was not used. This is expected, since cross-validation should give a more independent and realistic estimate of the forecast capability. Cross-validation methods have also been adapted for use in verification of the National Ceiling and Visibility (NCV) analysis product. The verification will be completed in the coming fiscal year. Cross-validation methods were used for forecast verification when better methods were not possible. By using cross-validation, a greater degree of independence between the forecasts and observations was achieved. Thus, estimates of forecast performance are more accurate. These methods will be applied again in the future when the nature of the observations available for verification makes cross-validation a beneficial option. [Top] Table N-1
In recent years, the need for alternative approaches to evaluate spatial forecasts of precipitation and convection has become apparent. This need is partly associated with the desire for verification measures that are more clearly tied to the operational usefulness of the forecasts and that can provide diagnostic information about the quality of the forecasts, for feedback to forecast developers, forecasters, and forecast users. In addition, as forecasts have moved to finer grid resolution, it has become clear that standard verification approaches are often not able to provide meaningful measures of the their capabilities. In response to these needs, in a cross-divisional (MMM and RAP) project led by Barbara Brown and Chris Davis, the VG has developed an object-oriented verification approach that is applicable to human-generated (generally polygonal) and automated (generally gridded) forecasts. With this approach, forecast and observed precipitation/convective areas are reduced to regions of interest that can be compared to one another in a meaningful way. In general, the object-oriented approach requires several steps: 1) objects are identified in the observation
and forecast fields; During the past year, two approaches for the matching and merging steps were developed; in particular, E. Gilleland and Randy Bullock tested two different approaches for this step, one based on a binary image-matching technique and another based on application of fuzzy logic. The object-oriented approach using the fuzzy logic matching approach was applied to gridded precipitation forecasts from a 22-km version of the Weather Research and Forecasting (WRF) model for two summers (2001 and 2002), and precipitation observations based on the Stage IV multi-sensor precipitation dataset. Figure M-2 shows an example of one of the cases included in this evaluation. For this example, four composite observed and forecast shapes were identified. Comparisons of the matched forecast and observed shapes indicated that the forecast objects were generally located too far to the north and west; the median forecast precipitation values were somewhat too large, whereas the 0.90th percentile precipitation values were somewhat too small, indicating the model forecast did not represent the observed extreme precipitation very well. These results clearly provide much more information about the forecast performance than can be obtained using standard verification statistics such as the probability of detection and the critical success index. B. Brown and C. Davis have summarized the statistics for all cases, and found similar characteristics of the forecast performance. A similar approach was also developed for application to human-generated forecasts by John Halley Gotway and R. Bullock. An example of an application to the convective significant meteorological advisories (C-SIGMETs; operational convective advisories issued by the National Weather Service) is shown in Figure M-3. In this case the optimal locations of the forecasts were identified using the object-oriented techniques; the resulting improvement in verification scores represents the performance of a practically perfect forecast. In summary, these new techniques show promise for providing useful diagnostic information regarding the performance of spatial forecasts. While further development and testing are still needed, the progress this year has led to a methodology that can begin providing feedback on the capabilities of a variety of types of forecasts.
Figure M-2. Example case for 12-h WRF precipitation forecasts valid on 2 July 2001 at 0000 UTC: (a) precipitation objects defined by convolution-threshold approach; (b) merged and matched areas. Merged areas are identified by the red band joining them; matched WRF and Stage 4 objects have the same letter identifier. Note there are two small Stage 4 objects that were not matched to a WRF object.
Figure M-3. Example of a C-SIGMET object optimization for 2-h C-SIGMETs. Green polygons represent the original forecasts; blue polygons represent the optimized forecasts. Before and after verification statistics are shown in the lower right corner. [Top]
M. Chapman also performed an evaluation of CIP at high altitudes for the period 01 January 31 March 2003 using regular PIREPs as observations in order to evaluate CIPs performance over several winter months. The study was designed to assess the overall performance of CIP during two different seasons at altitudes greater than 18 kft, and to examine in greater depth some specific cases that had varied verification results. Verification was accomplished by comparing the icing potential field to PIREPs of positive and negative icing at altitudes greater than 18,000 feet. The four grid points surrounding the PIREP and at 1-kft flight levels above and below it were examined. The methods utilized in the evaluation of CIP were based on standard techniques of forecast verification that treats icing forecasts and observations as Yes/No values. Icing diagnoses produced by CIP can be converted into a set of Yes/No values by applying a variety of thresholds. PODy and PODn are the primary verification statistics that are included in this evaluation. They are estimates of the proportions of Yes and No observations that are correctly diagnosed. Together, PODy and PODn measure the ability of the forecasts to discriminate between (or correctly categorize) Yes and No icing observations. PODy and PODn can be combined into an overall measure of this discrimination capability, the True Skill Statistic (TSS), which is also known as the Hanssen-Kuipers discrimination statistic. The relationship between PODy and 1-PODn for different thresholds is the basis for the verification approach known as Signal Detection Theory (SDT). This relationship can be represented for a given algorithm by the curve joining the (1-PODn, PODy) points for different thresholds. The resulting curve is known as the Relative Operating Characteristic (ROC) curve in SDT. The statistical verification results indicate that the CIP performed well at levels above 18 kft during the 13 25 August 2003 time period, based on both the standard PIREPs and the Skywest Airlines supplemental PIREPs. For example, when using a relatively high threshold of 0.60, CIP captured 78% of the YES observations and 71% of the NO observations. Figure M-4 is a ROC curve comparing the August 2003 study to the Winter 2003 study with both curves showing a verification of CIP above 18 kft. The plot for August has an area under the curve of 0.76, which shows that CIP has some skill at discriminating between YES and NO observations. The verification results were less impressive for CIP during the 01 January 31 March 2003 time period. When compared to CIP for all levels during this period (Figure M-5 ), the results for high altitudes showed notably less skill with an area under the curve of 0.59 versus 0.76 for CIP including all levels. A possible reason for this poor result could be that the interest maps included in CIP for this study only provide a possibility for icing at temperatures down to -22°C. Another possibility is the lack of reports at the higher altitudes (>24kft), where only 15% of the observations for MOG icing were associated with the 18 30kft layer. In the future, a similar study will be applied to the newest version of CIP, which includes the new interest maps and utilizes the 20km RUC. Because the same sets of observations can be used, the new analyses will provide a valuable comparison to this study.
Figure M-4. ROC diagram for CIP for two different time periods (13 25 Aug 2003 and 01 Jan - //31 Mar 2003// for altitudes > 18kft.
Figure M-5. ROC diagram for CIP (01 Jan //31 Mar 2003//) for all levels (CIP-Winter) and levels greater than 18,000 feet (CIP-WinterHigh). [Top]
One challenge in evaluating
oceanic weather diagnoses and forecasts for aviation weather elements
is finding independent datasets for use in the verification analyses.
The CTOP product, developed by the Oceanic Weather Product Development
Team (OWPDT) of the FAA/AWRP, uses a combination of infrared (IR) reflectance
values from GOES satellites with temperature and pressure profiles from
the Global Forecast System (GFS). The focus of this evaluation is to
ensure that the CTOP is consistent with other standard estimates of
cloud-top height. Verification of oceanic weather cloud-top height diagnoses
(a) Echo tops detected
by the Miami WSR-88D radar (08:56:25 UTC on 9/3/2004).
(c) NESDIS Cloud-Top Height Pressure (CTP) products for smaller geographical regions of one of the GOES four sounder sectors with hurricane Frances (08:56:25 UTC on 9/3/2004). Figure M-6 (a, b and c). Examples for the verifications datasets. Red circle shows a 234 km radius around the Miami radar station where radiosonde observations are also available. --- References Takacs, A., B. Brown, and J. Mahoney, 2004: Verification of oceanic weather diagnoses and forecasts for aviation weather elements. Preprints, 84th Annual Meeting of the American Meteorological Society. Seattle, WA 11-15 January, American Meteorological Society (Boston). Wang, J., and W.B Rossow, 1995: Determination of cloud vertical structure from upper- air observations. Journal of Applied Meteorology, 34, 2243-2258. [Top] RAP Achievements
|
||||||||||||||||||||
|
Director's Message |Table
of Contents | Executive Summary
|RAP Achievements |