## Abstract

The aim of this study was to evaluate the predictive fit of probability distributions to annual maximum flood data, and in particular to evaluate (1) which combination of distribution and estimation method gives the best fit and (2) whether the answer to (1) depends on record length. These aims were achieved by assessing the sensitivity to record length of the predictive performance of several probability distributions. A bootstrapping approach was used by resampling (with replacement) record lengths of 30 to 90 years (50 resamples for each record length) from the original record and fitting distributions to these subsamples. Subsequently, the fits were evaluated according to several goodness-of-fit measures and to the variability of the predicted flood quantiles. Our initial hypothesis that shorter records favor two-parameter distributions was not clearly supported. The ordinary moments method was the most stable while providing equivalent goodness-of-fit.

- bootstrapping
- design floods
- flood frequency analysis
- probability distributions
- reliability
- stability

## INTRODUCTION

The motivation for this study is the need to revise guidelines for design flood estimation in Norway. According to Norwegian dam safety regulations (Lovdata 2010), dam safety should be evaluated for floods with 500 or 1,000 years return periods, depending on an individual dam safety class. According to building regulations (TEK10 2016), buildings and infrastructure should resist or be protected from floods with 20, 200, or 1,000 years return periods, depending on the consequences of flooding. Flood inundation maps used for land use planning are also based on design flood estimates.

Existing guidelines are given in Midttømme *et al.* (2011) and Castellarin *et al.* (2012), and summarized in Table 1. The approach is based on using annual maximum floods, and the recommendations depend on the length of the local data record. A minimum of 30 years of local observations is required for local flood frequency analysis and at least 50 years of data should be available to use three-parameter distributions. The Gumbel (two parameters) and generalized extreme value (GEV) (three parameters) are the preferred distributions. More recently, Glad *et al.* (2014) found the generalized logistic (GL) to be the preferred distribution for annual maximum floods in small catchments.

Other guidelines for flood frequency estimation include the USA (Stedinger & Griffis 2008, 2011), Australia (Ball *et al.* 2016), and Europe (Castellarin *et al.* 2012). The four distributions that are most commonly used for annual maximum floods are the GEV distribution (Australia, Austria, Cyprus, Germany, France, Italy, Lithuania, Slovakia, Spain) with the Gumbel distribution (Finland, Greece) as a special case, the GL (UK) and the log-Pearson III (USA, Australia, Lithuania, Poland, Slovenia). Two-component Gumbel distributions are recommended in Italy and Spain in order to account for different flood generating processes.

Four methods are commonly used to estimate distribution parameters: ordinary moments, linear moments, maximum likelihood (ML), and Bayesian. The method of linear moments has been recommended for its robustness with small sample sizes (Hosking 1990). In recent years, Bayesian flood frequency estimation has gained increased attention in the research community (e.g., Coles & Tawn 1996; Gaál *et al.* 2010; Gaume *et al.* 2010; Renard *et al.* 2013b), and is recommended in the operational guidelines in Australia (see Chapter 2.6.3 in Ball *et al.* 2016). The benefit of the Bayesian method is the flexibility in model formulation, the possibility to include prior and/or regional knowledge in the local estimation, and the possibility to account for errors in rating curves (Ball *et al.* 2016).

For many cases the streamflow record is either non-existing or much shorter than the target return period. In order to predict flood quantiles in ungauged catchments or to reduce the estimation uncertainty for high flood quantiles, three different strategies can be followed (Merz & Blöschl 2008): (i) use flood data from several locations within a region (e.g., Dalrymple 1960); (ii) use historical, (e.g., Benson 1950) and/or paleo-hydrological information (e.g., Benito & O'Connor 2013); or (iii) use causal information, i.e., by combining precipitation statistics with precipitation–runoff models (e.g., Lawrence *et al.* 2014).

The recommendations provided in the national guidelines should preferably be based on systematic evaluations. A recent example is provided in Kochanek *et al.* (2014) where local, regional, and local-regional flood frequency analysis, as well as local and regional applications of a simulation approach are systematically compared resulting in recommendations. Renard *et al.* (2013a) provide a short review of evaluation frameworks and distinguish between simulation-based and data-based frameworks. In the simulation-based approach, the true distribution is known, and Monte-Carlo-generated samples from the true distribution are used to assess the performance of different distributions and/or parameter estimation methods (e.g., Hosking *et al.* 1985). It is especially useful for assessing robustness (e.g., Stedinger & Cohn 1986) and evaluating the estimates of standard errors (e.g., Cohn *et al.*, 2001). For data-based approaches, the true distribution is not known, and the aim of the evaluation is to assess if the observations might be realizations of the estimated distribution. Goodness-of-fit tests combined with split-sample or cross-validation are used in order to assess the predictive performance of the fitted distribution. The goodness-of-fit criteria measure the reliability, i.e., how well the model fits to (independent) data. Renard *et al.* (2013a) introduced ‘stability’ as an additional criterion. It measures the sensitivity of the design flood estimates to different subsets of data. Design flood estimates that depend strongly on the underlying data might lead to re-assessment of the design flood. This can, for example, result in large costs for dam owners as the design of dams has to be re-assessed every 20 years. Stability is therefore an important criterion in order to choose between the most reliable models.

The aim of this study is to perform a systematic evaluation of the predictive performance of local flood frequency distributions and estimation methods applied to annual maximum data. The results will later be used as a foundation for recommendations in new guidelines.

In this study, we wanted to answer the following research questions:

Which combination of distribution and estimation method best fits the data?

Does the answer to (1) depend on local data availability?

To answer these questions, we set up a test bench for local flood frequency analysis using data-based evaluation methods inspired by Renard *et al.* (2013a) by using a bootstrapping approach where we systematically evaluated how the predictive performance depends on record length.

## STUDY AREA AND DATA

We used annual maximum floods from 529 streamflow stations of the Norwegian hydrological database ‘Hydra II’. We present here a brief summary of the dataset and associated quality control methods, which are described in detail in Engeland *et al.* (2016). All data influenced by river regulations were removed. In addition, quality controls of the data including quality assessment by the field hydrologist and of the rating curve for high flows, were used to select flood data with a sufficient quality. For all gauging stations, we extracted a set of catchment properties (for details see Engeland *et al.* 2016). Figure 1 shows the histogram for record length, catchment areas, lake percentage, mean annual temperature and precipitation and the rain contribution to floods. Figure 2 presents a map of mean annual precipitation, temperature and floods and the rain contribution to floods. All climatological descriptors are based on the gridded temperature and precipitation data product in SeNorge (www.senorge.no). In this study, we used 280 stations which have at least 30 years of record. Only 103 stations have more than 50 years of data. The catchment area spans between 0.5 and 20,300 km^{2} with 163 km^{2} as the median. The presence of lakes influences flood sizes, and 262 of the catchments have more than 1% of the catchment area covered by lakes. For these catchments the median lake percentage is 6.1. The mean annual precipitation ranges from 408 to 3,137 mm with 1,047 mm as the median. We see a strong west–east gradient with the highest precipitation on the west coast. The mean annual temperature ranges from −3.73 to 7.62 °C with 0.56 °C as the median. The temperatures are influenced by elevation as well as latitude (temperature decrease with elevation and longitude). The relative contribution of rain was estimated by calculating the ratio of accumulated rain and snowmelt in a time window prior to each flood and then averaging these ratios over all floods (for details see Engeland *et al.* 2016). Rainfall processes dominate most coastal catchments and none of the catchments are completely dominated by snowmelt. A majority of stations, i.e., those where contribution from snow melt is important, show a prevalence of floods in spring and very few floods during winter. The catchments dominated by rain floods do not show a clear seasonal pattern by frequently displaying floods in summer and winter. Both the flood records and the catchment properties datasets (catchment area, record length, mean annual runoff and several other catchment descriptors) are available upon request to the authors.

## METHODS

### Distributions

We evaluated five probability distributions (Supplementary materials, Table S1): GEV, Gumbel, Pearson III, gamma, and the GL distribution. The equations for the quantile functions and the probability density functions (pdf) are provided in Supplementary materials, Tables S2 and S3 (Tables S1–S3 are available with the online version of this paper); below we provide the equations for the distribution functions. See also Bezak *et al.* (2014) for a recent overview.

#### GEV distribution

The extreme value theorem is also known as the Fisher–Tippett theorem, which says that the maximum value from a sample of independent and identically distributed (iid) random variables follows the GEV distribution (e.g., (Fisher & Tippett 1928; Embrechts *et al.* 1997):
1where *m* is a location parameter, *α* a scale parameter, and *k* a shape parameter. Defined on the region . The mean exists if *k* > −1.0, and the variance if *k* > −0.5. The shape parameter *k* is important in the GEV distribution as it shapes the tail of the distribution. A negative value indicates a heavy tail, whereas positive values describe a light tail and an upper limit for the variable *x*.

#### Gumbel distribution

The Gumbel distribution is a special case of the GEV distribution (shape parameter *k* = 0) and is written as:
2where *m* is a location parameter and a scale parameter.

This distribution is often recommended for small datasets. Maximum values of random variables, with an exponential like upper tail (e.g. Normal, lognormal, Gamma), will theoretically follow a Gumbel distribution.

#### Generalized logistic

The GL distribution (Hosking & Wallis 1997) is recommended for flood frequency estimation in the UK (Robson & Reed 1999) and was recently recommended for predicting floods in small ungauged catchments in Norway (Glad *et al.* 2014). The distribution is a re-parameterization of the log-logistic distribution (Ahmad *et al.* 1988), and has some similarities to the GEV distribution, as shown in Equation (3):
3where *m* is a location parameter, *α* scale parameter and *k* a shape parameter. As for the GEV distribution, the GL distribution has an upper bound of *k* > 0. This is the case only when the skewness is negative whereas for the GEV distribution, there is also an upper bound for positive skewness, i.e., L-skewness <0.17 (Robson & Reed 1999). Thus, for flood data we could expect the shape parameter to be between −0.5 and 0.2.

#### Gamma distribution

The gamma distribution is a flexible two-parameter distribution often used in environmental sciences:
4Here, *Γ* denotes the complete gamma function and *γ* the lower incomplete gamma function.

#### Pearson III

The Pearson type III distributions are given as:
5where *m* is a location parameter, *α* a scale parameter, and *k* a shape parameter. For *m* = 0, the P3 distribution reduces to the gamma distribution. Applied to log-transformed floods, this distribution is recommended for flood frequency analysis in the USA (Stedinger & Griffis 2008; Dawdy *et al.* 2012) and Australia (Haddad & Rahman 2008). Prior distributions are given in Reis & Stedinger (2005).

### Fitting methods

Four methods for fitting the distributions to observed data were used: ordinary moments, linear moments, ML and Bayesian estimation.

#### Ordinary moments (O-moments)

The method of ordinary moments means that the moments (mean, variance, and skewness) are estimated based on the data, and subsequently, the parameters of the selected distribution are calculated based on a theoretical relationship between the moments and the distribution parameters. Two-parameter distributions need the estimates of mean and standard deviation whereas the three-parameter distributions also require an estimate of the skewness. The specific equations for each distribution used in this study are given in Bezak *et al.* (2014) and are also provided in Supplementary materials, Table S4 (available online).

#### Linear moments (L-moments)

The method of linear moments is a popular method in hydrology since it is a direct analog to the method of moments, easy to apply and the parameter estimates are less sensitive to outliers in the data (Hosking 1990). As for the O-moments, the linear moments are estimated from the data, and subsequently, the parameters of the selected distribution are calculated based on a theoretical relationship between the L-moments and the distribution parameters. The specific equations for each distribution used in this study are given in Hosking (1990), and are also provided in Supplementary materials, Table S5 (available online).

#### Maximum likelihood

The ML method chooses the values of the parameters' estimates that maximize the probability of the data sample. This probability is the product of the probability density function evaluated at all observations (with a common parameter set) and is called the likelihood function *l(θ|x)* of the parameters *θ* given data ** x**. The objective is to maximize this function. The likelihood functions are specified in Bezak

*et al.*(2014). For numerical reasons, the log-likelihood (and not the likelihood) is maximized. For distributions used in flood frequency analysis, numerical optimization is needed for estimating the parameters. For small samples, the ML estimator is known to be more biased and to give larger estimation uncertainty compared to the two moment estimators for the GEV distribution (Hosking

*et al.*1985; Madsen

*et al.*1997). It might also provide absurd estimates of the shape parameter (Martins & Stedinger 2000). Those issues are most conveniently minimized by adding a prior likelihood for the shape parameter (Coles & Dixon 1999; Martins & Stedinger 2000). An alternative estimation approach is suggested in Laio (2004). Finally, the shape parameter of the Pearson type III distribution is challenging to estimate using the ML approach (Arora & Singh 1989). An estimation strategy is suggested in Laio (2004).

#### Bayesian estimation

Bayes theorem combines the knowledge brought by the prior distribution and the data (through the likelihood) into the posterior distribution of parameters, whose pdf is noted *p*:
6

The Bayesian method might include prior knowledge that could be expert knowledge or regional information (e.g., Kuczera 1982; Gaume *et al.* 2010). It is possible to express the prior knowledge on the estimated quantiles, i.e., design floods (Coles & Tawn 1996). It can be extended to non-stationary models accounting for trends or shifts in extremes (Benito *et al.* 2004; Renard *et al.* 2006; 2013a), and to include historical information in the estimation (e.g., Reis & Stedinger 2005; Viglione *et al.* 2013).

The Bayesian method allows the calculation of predictive distributions, confidence intervals, and the median or mean of return levels based on the posterior sample from the distribution of parameters (Coles *et al.* 2003; Renard *et al.* 2013b). For the application of the Bayesian approach, we specified non-informative priors except for the shape parameters in the GEV and GL distributions. Those were normally distributed with mean and standard deviations specified as *N*(*0, 0.2*) and *N*(*−0.15, 0.175*), respectively. The non-informative prior in location parameter was proportional to a constant whereas the scale parameter was proportional to a constant on the log-transformed scale. The prior for the GEV parameters was suggested in Renard *et al.* (2013a), whereas the prior for the GL parameters was obtained from scatter plots of the L-moment skewness for flood data in the UK (Robson & Reed 1999). In the Results section, we use the mode of the resulting posterior distribution.

### Evaluation methods

We followed the evaluation strategy specified by Renard *et al.* (2013a) and evaluated goodness-of-fit according to both reliability and stability indices. Reliability evaluates how well the estimated model predicts return levels whereas stability measures to what degree the design flood estimates depend on the data used for estimation. The reliability can only be evaluated for a return period within the length of the data records whereas the stability can be analyzed for any return period.

The approach used in Renard *et al.* (2013a) is based on a split sample cross-validation test where, at each station *s*, each sample is in turn used for estimation and evaluation. The aim of this study is to assess performance as a function of record length *l*. We therefore chose a bootstrapping strategy by drawing, with replacement, 50 random samples (noted *m*) for each record length *l* sampled every five years between 30 and 90 years (30, 35, 40 …). Subsequently, for each sample, we fitted a distribution *F _{l,s,m}*, and derived the associated return levels

*X*and evaluation scores

_{T,l,s,m}*H*where

_{T,l,s,m}*T*is the return period. The complete original flood data at each station were used for evaluation. Results were averaged over all subsamples to obtain average scores for each record length

*H*. To yield general conclusions, station-specific results were then averaged over all sites and groups of similar sites in order to obtain evaluation score

_{T,l,s,*}*H*as a function of record length. Both the fitted distribution parameters and the return levels were used for evaluation, as described below.

_{T,l,*,*}#### Stability

The stability measure is a property of the statistical model only and we can thus evaluate it for any return period, including those greatly exceeding the length of record. Here, we evaluated the stability by calculating the coefficient of variation (CV) of the return levels for each site *s*, each resampling record length *l*, and each return period *T* over all subsamples *m* = 1, …, 50 : CV_{T,l,s,*}_{.} Subsequently, we calculated the average coefficient of variation over all sites: CV* _{T,l,*,*}*. This allowed us to show CV as a function of record length for individual sites as well as averaged over several sites.

#### Reliability: evaluation of distributions

The Anderson–Darling (AD) test measures the integral of the distance between empirical and fitted cumulative distribution functions. Here, is the fitted cumulative distribution to subsample *m* for record length *l* at site *s* and is the empirical cumulative distribution at site *s* with *n* data. It places more importance on the tail of the distribution than the Kolmogorov–Smirnov (KS) test:
7

The KS test evaluates how well an empirical distribution fits to a parametric one. The statistics is based on the maximum distance between the two cumulative distributions and should therefore be as small as possible: 8

#### Reliability: evaluation of thresholds

Since the aim of flood frequency analysis is to assess critical design flood, it is relevant to evaluate the fitted distributions according to how well they predict thresholds.

The Brier score (BS) (Brier 1950) is commonly used for evaluation, and was used in this paper for evaluating the predicted T-years event for flood frequency distributions. The BS compares the predicted probability of the exceedance of a threshold (given by ) to actual exceedance of the threshold by independent data (given by : 9where is the threshold defined by a return period T and is an indicator function that is 1 if and otherwise 0.

The quantile score (QS) compares observed floods to the estimated flood quantile for a given return period *T* and gives the difference a low weight if the observed flood is smaller than the estimated quantile:
10

Since the shortest records have 30 years of data, BS and QS were evaluated for return periods up to 30 years (2, 5, 15, 20, and 30). The thresholds in the BS equation were estimated for each station by applying the Hazen plotting position shown in Equation (11) (Makkonen 2008):
11where *i* is the rank of the observation , *n* is the number of observations, and is the estimated cumulative probability. According to Stedinger *et al.* (1993), the Hazen plotting position is a traditional choice that is least specific to a particular distribution.

#### Reliability: evaluation of empirical L-moments

The L-moment ratio diagram compares sample estimates of *τ*_{2}, *τ*_{3}, and *τ*_{4} (standard deviation, skewness, and kurtosis) to the theoretical population for parametric distributions by plotting the relationship between *τ*_{4} and *τ*_{3} for three-parameter distributions and between *τ*_{3} and *τ*_{2} for two-parameter distributions. It was introduced by Hosking (1990), and approximations for several distributions are given in Hosking & Wallis (1997). The advantage of this evaluation is that we visually compare how several theoretical distributions fit to our data sample, and it has become a standard tool in regional flood frequency analysis (Peel *et al.* 2001).

## RESULTS

### Estimation computational chain and open access to results

Based on the methods presented above, our research approach was highly multi-dimensional and involved saving a great amount of data. For this reason, we chose to save the input and model data into a NetCDF database. The full computational chain was carried out with the R software (R Core Team 2016)). The following libraries were used: *RNetCDF* (Michna & Woods 2016) for managing the NetCDF files, *doSNOW* (Revolution Analytics & Weston 2015a) and doMC for parallel backend on Windows and Linux, respectively, *foreach* (Revolution Analytics & Weston 2015b) for parallel computation. In addition, the following libraries were used for fitting the distributions: *evd* (Stephenson 2002), *nsRFA* (Viglione 2014), *fitdistrplus* (Delignette-Muller & Dutang 2015), *ismev* (Heffernan & Stephenson 2016), and *pracma* (Borchers 2017). For the Bayesian inference, we created MCMC chains of length 5,000 and did not discard any simulations. Two packages were created to facilitate the re-usability of this work. Code and data are available at https://github.com/NVE/FlomKart and https://github.com/NVE/fitdistrib. Given the size and multidimensionality of both NetCDF files (estimated parameters and goodness-of-fit indices), an easy-to-use visualization tool was required to analyze the data. The R package *Shiny* (Cheng *et al.* 2016) was used to create a browser-based graphical user interface. In addition, the following libraries were used to create the graphical interface: *shinyBS* (Bailey 2015), *leaflet* (Cheng & Xie 2016), *DT* (Xie 2015), and *formattable* (Ren & Russell 2015).

The code of this visualization tool was organized as in the R package available at: https://github.com/NVE/FlomKart_ShinyApp. For every station, key plots can be drawn to compare the modeled probability distribution to the empirical distribution of data, and the evaluation criteria are shown for each station. Since we were interested in extracting general conclusions for this study, we chose to present results aggregated over all stations.

### Station averaged results

We start by presenting the evaluation of reliability as average values over all stations and subsamples. The reliability measures, i.e., KS test statistics, AD test statistics, BS for a threshold corresponding to the overall 20 year return period, and QS for 20 year return periods, are shown in Figures 3–6, respectively. All 280 stations with more than 30 years of data were used, and the reliability measures are plotted as a function of the length of the subsample used for estimating distribution parameters. This allowed us to evaluate how the performance depends on the length of the available data. We made one subplot for each distribution and one line for each estimation procedure. In these plots, the lowest value indicates the best performance.

The evaluation according to stability is shown in Figure 7, where the average CV in return levels is plotted as a function of record length. The calculation of the CV was based on the 50 subsamples for each record length. All distributions and methods become more stable as record length increases.

In order to summarize the relative performance of the different distributions and estimation methods, Figure 8 contains a subplot of each of the performance measures. For each distribution, the estimation method providing the best performance was selected. For the three-parameter distributions, we excluded the ML methods from the reliability criteria since it was only marginally performing better and provided unstable results. When selecting the estimation methods for the CV, we excluded the method of moments from the three-parameter distributions, since this method never obtained the most reliable predictions. Figure 8 thus allows us to compare the performance of the different distributions for the estimation method that performed best for each of them.

The L-moments ratios plotted in Figure 9 give a good visual impression of the spread in L-kurtosis and L-skewness across all stations. A moving average of L-skewness along L-kurtosis removes much of the scatter and thus helps analyze the data.

## DISCUSSION

The first research question raised in the Introduction sought to determine which combination of distribution and estimation method best fits the data. From the results presented herein, we see that it is difficult to disentangle the performance of the estimation methods from the performance of the distributions, and that the combinations of estimation method and distribution that give the best performance vary between the performance measures. The interpretation of the results in order to answer the research questions, is therefore, challenging.

From the performance of the reliability criteria, we see that the best estimation methods for the three-parameter distributions perform, in general, equally well or better than the best estimation methods for two-parameter distributions for all record lengths (Figure 8). The gain in using a three-parameter distribution increases with record length. The only exception is the QS, where the Gumbel distribution is equally good as the three-parameter distributions (Figure 8). Among the three-parameter distributions, the GEV and the GL distributions give the best performance. The GL distribution is better than the GEV distribution for the BS, whereas for the other two scores, the GEV distribution slightly outperforms the GL distribution. The GL distribution seems to be more challenging to estimate than the GEV distribution, since it is rather sensitive to the estimation methods used. Taking into account the stability criterion, the method of moments is most stable with the GL distribution. However, choosing to look only at the L-moments and Bayesian estimators that are the most reliable, we see that the difference in stability between the GEV and GL distributions is small (Figure 7). This indicates a slight preference for the GEV distribution.

Concerning the choice of estimation methods, the ML method should not be used in combination with three-parameter distributions since this combination provides very unstable results (Figure 7) and is, in some cases, only marginally better than the Bayesian and L-moment approaches (Figures 4–6). The method of moments is the most stable method for all distributions (Figure 7), but it also provides the most unreliable results for several scores (Figures 4–6). For all three-parameter distributions, either the L-moments or the Bayesian methods is preferred (Figure 8).

An unexpected result is the relatively low performance, as measured by the Brier and QS, when the Bayesian and ML methods are used to fit the data to the Gumbel distribution. In contrast, these two estimation methods perform relatively well for the AD and KS test statistics (Figures 3 and 4). Further investigations revealed that this low performance is, to a large degree, controlled by the skewness of the original data. The relatively low performance of the ML and Bayesian methods happens when the L-skewness is lower than 0.15, which is slightly lower than the L-skewness of the Gumbel distribution (0.17). This indicates that, for the Gumbel distribution, the ML and Bayesian estimators are more sensitive to low outliers in the dataset than the other estimation methods, and that they should be avoided when the L-skewness of the data is close to zero or negative.

The second research question was whether the answer to (1) depends on local data availability. To answer this question, we plotted all evaluation scores as a function of record length. As expected, for all evaluation scores, the performance improves with increasing record length. The difference in reliability between the distributions increases with record length, indicating that for the shortest record lengths, there is little gain in choosing a three-parameter distribution (Figure 8). The BS is an exception where the three-parameter distributions are better than the two-parameter distributions for all record lengths (Figure 5). With the exception of the method of moments, three-parameter distributions show lower stability than two-parameter distributions, even for the longest record length. There is no clear threshold in record length above which one should use a three-parameter distribution rather than a two-parameter distribution. A threshold at 50 years of record for switching from two- to three-parameter distributions could be justified if we only looked at the AD and QS test statistics. The difference between the GEV and Gumbel distributions is, indeed, small with those criteria. The Gumbel distribution is, however, considerably more stable for any length of record (Figure 8, upper right panel).

The results presented herein might be influenced by several factors that are not directly related to the choice of distribution. For the Bayesian method in particular, the choice of prior distribution might influence our conclusions. For the GEV distribution, values were chosen from the literature. Less information is available for the GL distribution, and the prior for the shape-parameter was set subjectively based on previous studies. For the Pearson-III distribution, we used a non-informative prior. We might therefore expect the performance of the Pearson-III distribution to be lower than for the other two. The results are prior-sensitive, in particular for the shortest record lengths. Providing different priors might change our conclusions. In addition, many of the algorithms used herein require numerical solutions, and the convergence of these algorithms might in some cases be misleading. For the MCMC in particular, we could not monitor the convergence of the more than 390,000 chains that were estimated using our resampling approach. Convergence checks commonly consist of running multiple MCMC chains with varying starting values and checking that all chains converge to the same values. While we have not performed such an analysis, we have run multiple MCMC chains with varying starting values and slight changes in the datasets. Through the stability assessment, we have then assessed whether the different chains yield similar results. As the stability of the Bayesian method is on the order of that of the other estimation approaches (see Figure 7), we conclude that the convergence rate of our chains is sufficient.

The resampling with replacement approach allowed us to compare all stations with sample sizes longer than 30 years, i.e., resampled records of lengths up to 90 years were created from the original record of 30 years. The benefit of using this approach was that more stations could be included in the evaluation. We used 280 stations, of which, only 35 had record lengths of 90 years or more. The drawback of this approach is that stations with short record lengths will get resampled several times. By grouping stations according to their length of record and plotting the group-averaged CV of return levels for each group, we saw that (i) the average CV is lowest for the shortest record lengths, and (ii) the spread in CV is largest for the shortest record lengths. An explanation for the second issue is that the resampling approach used here might be sensitive to outliers in the underlying data, as those might be sampled several times for short records. We identified three stations that may exhibit this behavior, but excluding them from the evaluation showed little influence on the average performance.

Another aspect not tackled in this study is the possible non-stationarity of flooding patterns in Norway. The standard approach for addressing non-stationarity related to climate change is to look for a climate factor that describes the expected change in design flood estimates (Lawrence 2016). The climate factor assumes that the reference design flood is based on a stationarity assumption. The non-stationarity in our case might be linked to (1) the measurement process, (2) changes in climate, and (3) changes in land use. As part of the quality control, we have in particular looked for trends and/or step changes in floods with a focus on the measurement process. To identify non-stationarity in a flood time series is also challenging due to the large noise to signal ratio. Vormoor *et al.* (2016) have shown that rain-dominated catchments have a tendency to more frequent floods but the trend is less clear regarding flood magnitude. They also identified shifts in flood generating processes with a transition from snowmelt floods to rain-dominated floods in many catchments.

Finally, some relationships between catchment properties and the most appropriate distribution were investigated. There were however no clear indications that catchments could be grouped according to their physical properties. This could be due to the very fragmented hydro-meteorological patterns in Norway. A more in-depth study of those relationships was beyond the scope of this paper and will be investigated in subsequent studies.

## CONCLUSIONS

The aim of this study was to evaluate the predictive fit of probability distributions to annual maximum flood data, and in particular to evaluate (1) which combination of distribution and estimation method gives the best fit and (2) whether the answer to (1) depends on record length. These aims were achieved by assessing the sensitivity to record length of the predictive performance of several probability distributions. A bootstrapping approach was used by resampling (with replacement) record lengths of 30 to 90 years (50 resamples for each record length) from the original records and fitting distributions to these subsamples. Subsequently, the fits were evaluated according to several goodness-of-fit measures and to the variability of the predicted flood quantiles.

Based on the results presented herein we conclude the following:

The GEV and GL distribution provided the most reliable results.

The method of linear moments or the Bayesian method are the recommended estimation methods.

The ML method was particularly unstable with three-parameter distributions, even for short return periods. This method should therefore be avoided.

For the Gumbel distribution, the L-moment approach is recommended. The Bayesian approach was sensitive to the skewness of the data.

The method of ordinary moments was consistently the most stable estimation method. This stability results in a light but consistent trade-off on goodness-of-fit against the method of linear moments.

There is no clear threshold in record length above in which one should use a three-parameter distribution rather than a two-parameter distribution.

We focused on developing a reproducible workflow so that the methodology can be reused and improved as more data become available.

The results herein show that the use of the GEV or the GL distribution is challenging since, in particular, the shape parameter is sensitive to the underlying data resulting in more unstable results. Alternative approaches such as using a mixture of two-parameter distributions, should therefore be investigated.

## ACKNOWLEDGEMENTS

This work was jointly funded by the Research Council of Norway and Energy Norway (grant ES519956 FlomQ) and internal research funding at the Norwegian Water Resources and Energy Directorate. The data were extracted from the national hydrological database at the Norwegian Water Resources and Energy Directorate and are available upon request to the main author. The source code for estimating flood frequency distribution is done in the statistical programming language R (http://www.Rproject.org/) and is available on GitHub.

- First received 12 April 2017.
- Accepted in revised form 18 July 2017.

- © 2018 The Authors

This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).

Sign-up for alerts