Comparison of hybrid regression and multivariate regression in the regional flood frequency analysis: A case study in Khorasan Razavi province

Background: Magnitude, rate and frequency of the stochastic and unexpected events are of great significance and importance in hydrology. Nowadays, for economic planning of the projects, the use of analytical methods of unexpected events in hydrology is unavoidable. The aim of this study was to compare hybrid regression and multivariate regression to estimate flood peak discharge in the province of Khorasan Razavi and in the selected water measured stations. Methods: For this purpose, 19 hydrometric stations were selected and analyzed. In the first step, the rate of peak discharge was estimated with different return periods and by selecting the best regional distribution (lognormal distribution type ΠΙ). In the next step, independent and important variables including area, mean annual rainfall, the average height of the watershed and its slope were determined using functional analysis and using SPSS software version 22. Then, two hydrologically homogeneous regions were determined by homogeneity test using cluster analysis, and accordingly, two models were presented for the whole area and also for homogeneous areas. To compare and evaluate the accuracy and efficiency of the estimated models, the rates of discharges were estimated and compared with observational rates using three control watersheds. To compare models, it was used from the average absolute values of the relative error index. Results: It was revealed that the hybrid method was more accurate than the multivariate regression method in the return period of 50 years and provides better results of flood discharges for the area. Homogenous areas had a higher coefficient of determination (R2) and lower relative standard error (RSE) compared to the whole area. It was also revealed that with increase of return period, the rates of R2 decreased but the rates of relative standard error increased. Conclusion: The accuracy of multivariate regression and hybrid methods was the same in the 25-year return period. In the present study, the importance and necessity of homogenous areas compared with the model of the whole area are completely evident.


Introduction
Flood is one of the most important natural disasters in the world that makes main problems for the development and reclamation of many countries. This natural phenomenon has been converted to a continuous phenomenon that can destroy capital and financial facilities of countries. Flood management in a basin will not be successful unless the hydrological behaviors of catchments are predicted (1)(2)(3)(4). Hence, the determination of flood peak discharge with different return periods is essential for conducting flood control plans, designing hydraulic structures, preventing erosion and designing storages in an area (5)(6)(7)(8). For this reason, different methods such as regional flood analysis have been presented. All these methods are made based on the recognition of factors affecting flood. To estimate flood frequency at ungauged drainage basins, flood frequency analysis using regression analysis is performed (9). In regional methods for flood prediction, the estimation of the index flood is a crucial point in order to attain reliable predictions in ungauged basins (10). In recent years, the regression methods have been successfully employed and demonstrated satisfactory results (11)(12)(13). Kjeldsen and Rosbjerg compared different methods for estimating T-year events. Their simulation study revealed that in regional index-flood estimation, the method of probability-weighted moments is preferable to the method of moment estimation with regard to bias and RMSE (14). Kuldkepp in a study on watersheds located in arid areas of Nevada State, presented a new method called hybrid method to estimate peak flood discharge in areas with deficient hydrologic statistics (15). Kjeldsen et al, in a flood frequency analysis in South Africa, suggested that the effective factors in the peak flood discharge like the physiographic and climatic characteristics including longitude, latitude, watershed area, mean annual rainfall, rainfall concentration and soil compaction coefficient (16). Eng et al estimated the peak flow of 50 years using hybrid regression in the Southeast of the United States and concluded that this method has less error compared with geographic methods or methods based on the predictive variables (17). Song and Singh performed flood frequency analysis in Illinois by comparing curves of flood frequency obtained from flood simulation series with those obtained from equations of regional regression. They found that the curves obtained from flood simulation series have a higher accuracy compared to those obtained from the equations of regional regression (18). Eng et al used a hybrid region-of-influence (HRoI) regression method (as a hybrid geographic region-of-influence approach) to estimate the 50-year peak flow (Q50) in the Southeastern United States (17). They concluded that by increasing the accuracy of the geographic proximity of the stations, the accuracy of the model also increases. Marofi showed that Gumbel method is the best-fitted distribution in 10 regions of Hamadan province (19). Shukla et al dealt with the probabilistic estimates of extreme maximum rainfall using the regional generalized extreme value (GEV) distribution and concluded that the GEV model satisfied the selection criteria (20). Haberlandt and Radtke evaluated a hybrid rainfall model for the flood frequency analysis and reported a good performance in capturing average and extreme rainfall characteristics (21). Liu et al in downscaling some data-driven methods to forecast the daily precipitation and daily maximum and minimum temperature series concluded that time-lagged feed forward neural network (TLFFNN) and evolutionary polynomial regression (EPR) have more efficiency than statistical downscaling model (SDSM) (22). Smith and Karr investigated the flood frequency analysis using the Cox regression model and considered the assessment of the relative importance of physical processes such as snowmelt or soil moisture storage on flood frequency at an area and derivation of time-varying flood frequency estimates (23). Speyrer and Ragas examined the impact of flood risk and mandatory flood insurance on property values (24). Several studies performed the popular SPARROW model using a hybrid approach combining conventional regression methods with spatial data based on landscape characteristics and stream properties to predict continuous water quality from point observations (25)(26)(27). Firat compared different models for daily river flow forecasting and concluded that adaptive neuro-fuzzy inference system (ANFIS) has higher accuracy and reliability than other models (28). Soong and Soong compared flood-frequency curves derived from two different methods and concluded that flood-frequency curves derived from regional regression equations was slightly sloppy than those derived from the simulated flood-frequency curves (29). The aim of this study was to determine the physical and climatic factors affecting regional flood analysis, to find the best statistical distribution for determining the maximum flood discharges and also to compare two models of multivariate regression and hybrid method in the regional flood frequency analysis.

Study area
The study area was located in Northeast of the country and from geological view, the sedimentary basin of Hezar Masjed-Kopedagh and Binalood as two separate tectonic, was the main part of the study area. Based on Domartan, the climate of these areas is arid and semi-arid. Figure  1 shows the location of the study stations in Khorasan Razavi.
Selection of the most suitable regional frequency distribution After testing and data reconstruction, the flood frequency analysis included Pearson type Π distribution, Log Pearson type ΠΙ, two-parameter gamma, three-parameter Lognormal, two-parameter normal and Lognormal, were fitted with peak discharge of each station by HYFA software. To select the best statistical distribution, Chisquare test was used, so that its minimum value was considered χ 2 =1 and its maximum value was considered χ 2 =6. Selection of the independent variables and limiting variables In this study, a method of factor analysis was performed for 17 variables in the selected watersheds using SPSS software. Variables include different characteristics of watershed such as area, perimeter, average slope, the slope of the main river, length of the watershed, length of the main river, average height, average rainfall, maximum 24-hour precipitation, drainage density, maximum and minimum height and finally Gravelius, Horton, Miller and Schumm coefficients. As a unit of measurement, each variable differed from another one, therefore, for proper comparison of the variables with each other, all of them were standardized to be the same unit. As the results obtained from factor analysis was complex at first and did not produce an optimal solution, so to maximize the variance of each factors and to simplify the expression of variables of functional structure, functional axis has been rotated using a varimax rotation, one of the conventional methods, to be as an independent factor. Meanwhile, the naming of factors is done on the basis of the rotated functional values. Then, the matrices of functional scores of the stations were extracted using regression estimation. To confine the number of factors, Kaiser-Meyer-Olkin (KMO) measure was used to determine the rate of proportionality of a number of selective factors. Nonessential variables were extrapolated using the anti-image correlation matrix by measurement system analysis (MSA). These variables were diameter elements of the correlation matrix. In the elimination of data, the rate of KMO and percentage of the variance should be considered, because by eliminating one variable, KMO value and percentage of variance increase or decrease.
The coefficient values of about 0.9 were considered very suitable, 0.8, suitable, 0.7, moderate, 0.6, medium, and less than 0.5 were not suitable. After determination of essential variables (12 factors), functional analysis was performed based on these variables.
Identification of homogeneous regional classes The first step in the analysis of flood-prone area is to determine the homogeneous areas that were done using multivariate regression and index flood methods by cluster analysis. For this purpose, in the first step independent variables were determined by applying the process of factor analysis among input variables. In the next step, the homogenous areas were identified by cluster analysis using cumulative classes in SPSS software version 22. In the hybrid method, homogeneous areas were partitioned based on the most important parameters of the watershed (obtained from factor analysis), that affect the flood characteristics.
Hybrid method Hybrid method is on the basis of the station-year method. In this method, all statistics available to the stations were used to overcome the lack of statistics. In the hybrid method, standardization was done by dividing each statistical data by a power of hydrological parameter of the watershed and then statistics of each class was combined with each other. The hybrid method included two sections: In the first section, the area was partitioned into at least three homogenous areas on the basis of the most important parameter of the watershed, and the flood peak statistics in each class were combined with each other. In the second section, statistics of annual flood peak discharge were standardized using an approximate factor, and then at the end of the hybrid process, an approximate factor of the standard was corrected using a technique combined from regression and flood frequency analysis. The process of standardization was done on using one of the parameters of the watershed in each replication. The model used in the hybrid method (equation 1), is like those that are used in different methods of regional flood analysis.
Where Q T is peak discharge in return period of T years, A, B, C are independent parameters of the watershed, B, c, d, symbols in the regression equation and A is the fixed rate of the equation. Partitioning of homogenous areas was done on the basis of the most important parameter of the watershed that affects flood characteristics. So, the area of the watershed, as the most important parameter of the watershed, was used in the first process. Partitioning must be done in a way that in each class, total data would be at least 100 classes and sum of a number of data in different could be near to each other and otherwise, by weighing to class data could be near as far as possible. A maximum number of classes with respect to the hypothesis of at least 100 data in each class were determined using equation 2: Hybrid method is on the basis of the station-year method. In this method, all statistics available to the stations were used to overcome the lack of statistics. In the hybrid method, standardization was done by dividing each statistical data by a power of hydrological parameter of the watershed and then statistics of each class was combined with each other. The hybrid method included two sections: In the first section, the area was partitioned into at least three homogenous areas on the basis of the most important parameter of the watershed, and the flood peak statistics in each class were combined with each other. In the second section, statistics of annual flood peak discharge were standardized using an approximate factor, and then at the end of the hybrid process, an approximate factor of the standard was corrected using a technique combined from regression and flood frequency analysis. The process of standardization was done on using one of the parameters of the watershed in each replication. The model used in the hybrid method (equation 1), is like those that are used in different methods of regional flood analysis.
Where QT is peak discharge in return period of T years, A, B, C are independent parameters of the watershed, B, c, d, symbols in the regression equation and A is the fixed rate of the equation.
Partitioning of homogenous areas was done on the basis of the most important parameter of the watershed that affects flood characteristics. So, the area of the watershed, as the most important parameter of the watershed, was used in the first process. Partitioning must be done in a way that in each class, total data would be at least 100 classes and sum of a number of data in different could be near to each other and otherwise, by weighing to class data could be near as far as possible. A maximum number of classes with respect to the hypothesis of at least 100 data in each class were determined using equation 2: (2) Where J is the maximum number of classes while its minimum value is three classes and Nf is the number of total statistical data in each class. Then, the average weight of the catchment area (first parameter) was determined according to equation 3: Where ̅ is average weight of the area of the watershed in class I, Aijk, area of the watershed in class i and at the station j and i=1, 2, 3…f, number of classes, J=1, 2, 3… g, number of stations in ith class, and K, number of years in the jth station. It should be noted that the number of classes should be determined on the basis of the second and nth physiographic parameters in order and the average weight of the area in classes should be also calculated. Standardization of peak discharge was replicated from equation 4 and it was performed 100 Nf J  Where J is the maximum number of classes while its minimum value is three classes and Nf is the number of total statistical data in each class. Then, the average weight of the catchment area (first parameter) was determined according to equation 3: Hybrid method is on the basis of the station-year method. In this method, all statistics available to the statio were used to overcome the lack of statistics. In the hybrid method, standardization was done by dividing eac statistical data by a power of hydrological parameter of the watershed and then statistics of each class w combined with each other. The hybrid method included two sections: In the first section, the area w partitioned into at least three homogenous areas on the basis of the most important parameter of th watershed, and the flood peak statistics in each class were combined with each other. In the second sectio statistics of annual flood peak discharge were standardized using an approximate factor, and then at the en of the hybrid process, an approximate factor of the standard was corrected using a technique combined fro regression and flood frequency analysis. The process of standardization was done on using one of th parameters of the watershed in each replication. The model used in the hybrid method (equation 1), is lik those that are used in different methods of regional flood analysis.
Where QT is peak discharge in return period of T years, A, B, C are independent parameters of the watershe B, c, d, symbols in the regression equation and A is the fixed rate of the equation.
Partitioning of homogenous areas was done on the basis of the most important parameter of the watershe that affects flood characteristics. So, the area of the watershed, as the most important parameter of th watershed, was used in the first process. Partitioning must be done in a way that in each class, total da would be at least 100 classes and sum of a number of data in different could be near to each other an otherwise, by weighing to class data could be near as far as possible. A maximum number of classes wi respect to the hypothesis of at least 100 data in each class were determined using equation 2: Where J is the maximum number of classes while its minimum value is three classes and Nf is the number total statistical data in each class. Then, the average weight of the catchment area (first parameter) w determined according to equation 3: Where J is the maximum number of classes while its minimum va total statistical data in each class. Then, the average weight of determined according to equation 3: Where ̅ is average weight of the area of the watershed in class I, the station j and i=1, 2, 3…f, number of classes, J=1, 2, 3… g, num of years in the jth station. It should be noted that the number of cl the second and nth physiographic parameters in order and the aver also calculated. Standardization of peak discharge was replicate 100 Nf J  is average weight of the area of the watershed in class I, A ijk , area of the watershed in class i and at the station j and i=1, 2, 3…f, number of classes, J=1, 2, 3… g, number of stations in ith class, and K, number of years in the jth station. It should be noted that the number of classes should be determined on the basis of the second and nth physiographic parameters in order and the average weight of the area in classes should be also calculated.
Standardization of peak discharge was replicated from equation 4 and it was performed through dividing peak discharge by the average weight of the area in each class; in the first replication, b was considered equal to 1.
Then, S ijk of each class was fitted by the curve of intermittent flood or by an empirical equation. For each class, a suitable probability distribution was determined, then, S ijk with different return periods for each class, was obtained using a suitable distribution (Sti). To obtain peak discharge with a return period of T years, the rates of S ti was non-standardized as equation 5: Qti=Sti (Āi) b (5) The average weight of the area was fixed along with the replication but the values of b vary to a constant value. The new exponent in each return period was obtained as equation 6: 6 through dividing peak discharge by the average weight of the area in each class; in the first replication, b was considered equal to 1.

(4)
Then, Sijk of each class was fitted by the curve of intermittent flood or by an empirical equation. For each class, a suitable probability distribution was determined, then, Sijk with different return periods for each class, was obtained using a suitable distribution (Sti). To obtain peak discharge with a return period of T years, the rates of Sti was non-standardized as equation 5: The average weight of the area was fixed along with the replication but the values of b vary to a constant value. The new exponent in each return period was obtained as equation 6: In fact, if a regression equation is plotted between Qti and Ai, bt is the slope of the equation, in other words, it is the coefficient variation of Ai. By the new exponent of calculation, the next stage was replicated so that the rates of Qijk was standardized through dividing Aijk by exponent of bt,, and stages of determination of a suitable distribution and calculation of Sti were replicated to get a newer bt. This replication was continued until the variation of bt reached less than 1% . After application of a parameter, if b was not fixed, it might be indicated that there was no linear relationship between parameter and peak discharge, so it was not included in the model. When the most important parameter of the watershed (area) was included in the model and fixed exponent and final bt were calculated, then, it was time to determine the second parameter of the watershed, for example, the annual rainfall of the watershed. After determination of the second parameter, all stages performed for the first parameter, were also replicated for the selected parameter with a difference that for standardization of discharges of each rainfall class, Qijk of each station was divided by (A b P c ) to obtain b. Exponent of c was also equal to 1 in the first replicate. After standardizing the discharges, the steps for determining the Sti and Ct are repeated. The repetition process continues as long as the power of each new parameter is fixed. The rate of coefficient A was determined based on the correlation.

Method of multivariate regression
In this method, some ratios are presented between discharges in different return periods and physiographic characteristics of the watershed (12). The general relation of multiple regression is as equation 7: In fact, if a regression equation is plotted between Q ti and A i , b t is the slope of the equation, in other words, it is the coefficient variation of A i . By the new exponent of calculation, the next stage was replicated so that the rates of Q ijk was standardized through dividing A ijk by exponent of b t, , and stages of determination of a suitable distribution and calculation of S ti were replicated to get a newer b t . This replication was continued until the variation of b t reached less than 1% . After application of a parameter, if b was not fixed, it might be indicated that there was no linear relationship between parameter and peak discharge, so it was not included in the model. When the most important parameter of the watershed (area) was included in the model and fixed exponent and final b t were calculated, then, it was time to determine the second parameter of the watershed, for example, the annual rainfall of the watershed. After determination of the second parameter, all stages performed for the first parameter, were also replicated for the selected parameter with a difference that for standardization of discharges of each rainfall class, Q ijk of each station was divided by (A b P c ) to obtain b. Exponent of c was also equal to 1 in the first replicate. After standardizing the discharges, the steps for determining the S ti and C t are repeated. The repetition process continues as long as the power of each new parameter is fixed. The rate of coefficient A was determined based on the correlation.

Method of multivariate regression
In this method, some ratios are presented between discharges in different return periods and physiographic characteristics of the watershed (12). The general relation of multiple regression is as equation 7: Estimation of the regional distribution parameters In this method, regression equations were developed between characteristics of watershed and parameters of the probability distribution. Firstly, after frequency analysis, suitable parameters were obtained for each station, then; regression equations were developed if necessary, to estimate the parameters of the probability distribution. In fact, using characteristics of the watershed, distribution parameters were obtained for different areas with deficient statistics, then, by application of obtained distribution parameters, different Q Ts were calculated (1,4).

Functional analysis
In this stage, essential variables were determined based on the value of KMO = 0.721. After determination of the essential variables (14 factors), the functional analysis was performed based on these variables. Table 1 shows the rates of invisible matrix root and variance percentage of the factors. By application of functional analysis, from input variables, four factors including area, mean annual rainfall, the average height of watershed and slope of the main river, were identified as the most important factors. The total variance of these factors was considered 91.18%.

Hybrid method
In the study area, 19 watersheds that had at least 10 statistical years of peak flood discharge, were selected. These 19 watersheds had totally 328 years of discharge statistics. After functional analysis, the factor of area was identified as the most important factor and also the factor of rainfall was identified as the second effective factor on the peak discharge. So, in the first step, the

Estimation of the regional distribution parameters
In this method, regression equations were developed between characteristics of watershed and parameters of the probability distribution. Firstly, after frequency analysis, suitable parameters were obtained for each station, then; regression equations were developed if necessary, to estimate the parameters of the probability distribution. In fact, using characteristics of the watershed, distribution parameters were obtained for different areas with deficient statistics, then, by application of obtained distribution parameters, different QTs were calculated (1,4).

Functional analysis
In this stage, essential variables were determined based on the value of KMO = 0.721. After determination of the essential variables (14 factors), the functional analysis was performed based on these variables. Table 1 shows the rates of invisible matrix root and variance percentage of the factors. area was homogenized based on the basin area, then, the average weight of area for each class was calculated using equation 3. In the process of the hybrid method, In the process of hybrid, the instantaneous flood observations of the studied classes were standardized according to equation 4. Also the amount of b in the first replication was equal to 1. In other words, peak flood discharge of each station was divided by the exponent of its area to obtain the standardized discharges. As in the subsequent replication, the area exponent (b) would change, thus, for regional flood frequency analysis in each class; a longterm standardized compound statistic was obtained.
Then, based on the distribution of the three-parameter Lognormal (the best regional distribution), standard floods with return periods of 2, 5, 10, 25, 50 and 100 years were extracted for each class. After this stage, peak discharge of t years in each class (Q ti ) was determined as equation 5. The second replication of the hybrid method was begun from equation 4. In this stage, instead of b that in the first stage was equal to 1, the rates of Table 2 that have been determined for each return period was used and replication process continued until exponent of b was fixed and if the power (b) is not fixed, the corresponding variable is abandoned and it indicating that there was no linear relationship between the related factor and discharge. In this research, the parameter b in the third replication means that the fourth step was fixed. These rates have been shown in Table 3. After the factor of watershed area, mean annual rainfall was used. In the first step, the number of classes was determined and the stages done for the area, were replicated. As shown in equation 4, the method used for obtaining standard discharges for the first factor (area) and the second one (mean annual rainfall) is different. For subsequent agents, this process is repeated. After determination of regression coefficients of the area in different return periods, the stages for subsequent important factors were replicated to complete the abovementioned relations. For discharges with different return periods, the abovementioned stages for the factor   of mean annual rainfall were replicated and when the regression coefficient of these factors was fixed, the repetition was stopped and coefficients were replaced in the equation. In this study, the rates of c in different return periods were fixed after four steps (Table 4). After the factor rainfall factors, average height and slope of watershed were determined and as the regression coefficients of these factors did not reach a fixed and logical value, they were not included in the model and put aside. After determination of c and b and other factors (average slope of the watershed, the average height of watershed) that could not include in the model, the fixed rate of the factor was determined. To determine regression fixed value (a) in each station, first, the rates of Q t were obtained using the dominant distribution of the area, and simultaneously, with respect to physiographic characteristics included in the model, the rates of (A b P c L d ) t in different return periods for each station were calculated separately. Then, the ratio of Q t to (A b P c L d ) was determined for each station, finally, the median of these rates in different return periods, as regression fixed value (a), was included in the model. These rates are presented in the Table 5.
Multivariate regression All models used for the estimation of peak flood discharge in a return period of 2, 5, 10, 25, 50 and 100 years for the whole area and homogenous areas of watersheds, are presented in Tables 6 and 7, respectively.

Discussion
The first step in the analysis of regional flood is to determine homogenous areas that were done using multivariate regression method by cluster analysis. For homogenizing the areas, firstly, functional analysis was performed on the input variables, and four factors including area, mean annual rainfall, average height of the watershed and slope of the main river, were identified as the most important factors. The total variance of these factors was considered 91.18%.
In the next stage, homogeneity of the watershed was determined by cluster analysis and two homogenous areas were determined. Also, with respect to four main variables, two series of models were estimated: one in the whole area and another one in the homogenous areas. Homogenous areas had a higher coefficient of determination (R 2 ) and less relative standard errors (RSE) compared to the models for the whole area. It was also revealed that with increase of return period, the rates of R 2 decreased but the rates of RSE increased. Figure 3 shows the mean relative error in the whole area and in the homogenous areas in the multivariate regression method. Kjeldsen and Rosbjerg performed a flood frequency analysis in South Africa and suggested that the physiographic and climatic characteristics of the area including the area of the watershed, mean annual rainfall, rainfall concentration, the average slope of the watershed, altitude, edaphic characters and finally soil compaction coefficient, are effective in maximum flood discharges (14). Ouarda et al introduced some effective characteristics of the watersheds in peak discharges like area, length of the canal, the slope of the main canal and the mean annual rainfall (30). Honarbakhsh, in the analysis of regional flood in watersheds of Salt Lake, introduced physiographic characteristics like the area of the watershed, average height and the average slope of the watershed, annual rainfall and length of the watershed as the most important parameters affecting the flood (31). In this study, according to the functional analysis, area, mean annual rainfall, the average height of watershed and slope of the main river, were identified as the most important factors. Figure 3 shows that SE of the model of the whole area was higher than that of the homogenous areas and R 2 was also lower than it. Therefore, it is concluded that models of homogenous areas have higher efficiency compared to those of the whole area.

Conclusion
The aim of this study was to determine the physiographic and climatic characteristics affecting analysis of regional flood, to find the best statistical distribution for determining the maximum discharges and also to compare two models of multivariate and hybrid in the analysis of regional flood. Many studies have shown the importance of hydrologically homogenous areas in higher accuracy and efficiency of the regional analysis models. In his study, the importance and necessity of homogenous areas compared with the whole area were completely evident. Generally, homogenizing in most cases makes effective factors in the flow, to show higher accuracy in the model. The comparison showed that in the hybrid method, by increasing the return period, the rate of error decreased at first and then increased. Hybrid method in the return period of 50 years had a higher accuracy compared to the multivariate regression method and presented better results of flood discharge for the area.  Figure 3. Comparison of the mean relative error in the multivariate regression (whole area) and the multivariate regression (homogenous areas) and hybrid regression.
Methods of multivariate regression and hybrid in a 25year return period have similar accuracy.