Application of imputation methods for missing values of PM 10 and O 3 data: Interpolation, moving average and K-nearest neighbor methods

Background: PIn air quality studies, it is very often to have missing data due to reasons such as machine failure or human error. The approach used in dealing with such missing data can affect the results of the analysis. The main aim of this study was to review the types of missing mechanism, imputation methods, application of some of them in imputation of missing of PM 10 and O 3 in Tabriz, and compare their efficiency. Methods: Methods of mean, EM algorithm, regression, classification and regression tree, predictive mean matching (PMM), interpolation, moving average, and K-nearest neighbor (KNN) were used. PMM was investigated by considering the spatial and temporal dependencies in the model. Missing data were randomly simulated with 10, 20, and 30% missing values. The efficiency of methods was compared using coefficient of determination (R 2 ), mean absolute error (MAE) and root mean square error (RMSE). Results: Based on the results for all indicators, interpolation, moving average, and KNN had the best performance, respectively. PMM did not perform well with and without spatio-temporal information. Conclusion: Given that the nature of pollution data always depends on next and previous information, methods that their computational nature is based on before and after information indicated better performance than others, so in the case of pollutant data, it is recommended to use these methods.


Introduction
Air pollution as one of the most important issues has major environmental risks to humans, animal health, and other living organisms (1,2). Several natural and human factors produce pollutants such as particulate matter (PM), ozone (O 3 ), carbone monoxide, etc. in the atmosphere (2,3). In recent decades, urbanization and development of industrial towns and factories have been the main sources of increasing pollutants' production (2,4,5).
Particulate matter (PM 10 ) is one of the most important particles in the atmosphere due to its physical and chemical structural properties (6). Ozone is a secondary pollutant formed in photochemical reactions with precursors, which is produced by humans (7). According to scientific studies, exposure to PM 10 and O 3 increases the risk of asthma, cardiovascular, pulmonary diseases, and depression (7)(8)(9).
In recent years, many epidemiological studies and systematic analysis have been conducted on pollutants and their relationships with various diseases and premature death (2,10). These studies require detailed data of the pollutant concentrations, so the Air Quality Monitoring Organization always measures the concentration of pollutants in the air. Due to the high volume of information, it is always possible that some parts of observations would not be measured when machines fail, position of monitors change, filters are changed, the level of pollution is reduced from the specified range, and human error occurs (11,12). Therefore, we are always faced with incomplete dataset in air quality data, which in the analysis leads to different conclusions from the results of the complete dataset (13).
There are three major problems in dealing with missing data. First, the loss of information decreases the efficiency and power of the analyses. Second, irregularities in the data structure and the impossibility of using standard software reflect in complexities related to data management and analysis, especially in time series analysis which we need sequential data to make predictions. And thirdly, systematic differences between observed and unobserved data are among the most important problems of missing data, which change the obtained results (12).
There is no conducted study on the imputation of missing values of pollutant concentration in Tabriz. Due to the importance of missing issue and the consequences of arbitrary removal of missing data, which causes bias in the results, the aim of this study was to evaluate and compare the efficiency of imputation methods for missing values of PM 10 and O 3 concentrations in Tabriz in 2017. These methods include single univariate methods of mean, K-nearest neighbor, linear interpolation, moving average (simple, linear, and exponential), EM algorithm, linear regression and univariate multiple classification, and regression tree method, as well as multivariate multiple predictive mean matching (PMM) method by considering the spatial and temporal dimensions.

Data
In this study, the hourly mean concentrations of PM 10 and O 3 recorded in Tabriz air quality monitoring stations in 2017 were used. For both pollutants, no data were recorded from January 1 to February 22, June 22 to July 22, and August 23 to September 22. Therefore, the mentioned periods were excluded, and approximately, data of 8 months of 2017 were analyzed. Concentrations of PM 10 and O 3 were measured by beta attenuation and UVspectrophotometry methods at each station, respectively. After removing the outliers using the Z-score method, only 24 000 cases of hourly PM 10

Identifying the missing mechanism
To deal with the missing problem and have accurate statistical analysis, it is necessary to identify the pattern and mechanism of missing data. First, the pattern of missing must be determined. Then, in accordance with the pattern of missing, we must adopt an appropriate approach to deal with the missing data. To identify the mechanism of missing data, the missing data classification system was used according to the Rubin's theory (14). This system actually describes the relationship between the data and the probability of missing values. To better understand and describe the distribution mechanism of the missing data: Suppose the vector X = (X 1 , X 2 , …, X n ) T represents a random variable of complete data that includes the observed values X o and the missing values X m with the probability density function f θ . The goal is to estimate the unknown parameter vector θ. The missing data indicator M = (M 1 , M 2 , …, M n ) T is a binary variable that identifies the observed or missing state of the variable (if the value x i is observed, M i = 0, and if the value is missing, M i = 1), in fact, the missing data indicator defines the missing pattern. Representing missing data as a variable indicates that a probabilistic distribution manages the value of the missing data indicator. In practice, it is impossible to understand the exact distribution of M. However, the nature of the relationship between the indicator M and the data reveals the mechanism of the missing data as defined by the conditional distribution f (M | X, φ) of M over the complete data X as the vector φ is an unknown parameter that indicates the probability of missing data.
The missing completely at random (MCAR) mechanism requires that the probability of missing data in one variable X to be unrelated to the other measured variables, as well as the X values themselves; in this case, a random sample of complete data can be considered. In cases where there is a univariate time series, time is considered as an implicit variable, the probability of missing an item is independent of the observed time. Given the abovementioned information, the distribution that controls the MCAR mechanism is as follows: A more limited assumption than MCAR is that the missing values depend only on the observed variables, not on the variables that have the missing values, so the mechanism of missing data is called missing at random (MAR). Since there is no variable other than time in univariate time series, it is assumed that the probability of missing data depends on the point in time at which it is observed. The distribution of the MAR mechanism is as follows: Finally, when the probability of missing data in X is dependent on X values that are observed or missed, the data are missing not at random (MNAR). In the case of univariate time series, the probability of missing data may depend, but not necessarily, on the point in time at which it is observed. The distribution of the probability of MNAR mechanism is as follows: The concentration of pollutants is recorded at a specific time and location. In this dataset, time and location are as observed variables. The probability of missing an observation in pollutant concentration variable is independent of other observations but dependent on time and location variables (15)(16)(17). Thus, we can conclude that the missing mechanism of PM 10 and O 3 concentrations would be MAR.
In addition, a number of studies have been conducted on the mechanism of missing that have considered the missing mechanism in air pollution data as MAR. Therefore, according to the available evidences and reviewing the literature (16)(17)(18)(19)(20)(21)(22)(23)(24), the mechanism of missing in PM 10 and O 3 data was considered as MAR.

Imputation methods
In recent decades, various techniques have been introduced to solve the problem of missing data (14). One of the most popular and simplest methods is to delete the missing data, which is done in two ways: Pairwise deletion and listwise deletion. In the pairwise deletion, only the missing observations are deleted, which is the reason why the number of observations for analysis varies from one variable to another. In listwise deletion (also known as case deletion or complete analysis), all observations that have missing data in one or more variables are deleted. If the missing data mechanism is MCAR and the missingness rate is less than 5%, the complete data series can be obtained by deleting the missing values. Otherwise, if the missingness rate is high or the mechanism is MNAR or both of them, by eliminating information, a reduction in power or bias of results will occur (25,26).
On the other side of the deletion methods is the imputation approach, in which an estimate for the missing values is obtained and used. Imputation can be done with different techniques. These techniques can be categorized based on the number of imputed values (single and multiple) generated in the presence or absence of other variables (univariate or multivariate). Imputed values are replaced for each of the missing values (14).
In the single imputation method, the missing values are filled by only one amount and the imputation process is performed only once. The multiple imputation method for maintaining uncertainty in missing data (14) generates several simulated values for each missing value.
In a univariate imputation, the missing values of a variable are estimated as a function of the observed values of the same variable. In multivariate imputation, the missing values in one variable are estimated with other variables that are recorded simultaneously, in which multivariate imputation performance may be better than that of the univariate imputation (27). However, when a number of variables that are simultaneously recorded are missing, it is difficult to access the original data pattern (28).
The univariate single imputation generally works using the mean or median of the measured values, moving the previous observations forward, the next observations backwards, or the average of the before and after observations. The single multivariate imputation also proceeds to use a function of the mean or median of the values measured simultaneously (29)(30)(31).
In this study, 8 imputation methods including univariate single methods such as mean, K-nearest neighbor, linear interpolation, moving average (simple, linear, and exponential), EM algorithm, linear regression, and multiple univariate method such as classification and regression tree, as well as multivariate multiple PMM method were examined to select a better method for air pollution data analysis. R (4.0.2) (packages: mice (3.9.0), imputeTS (3.1), VIM (6.0.0)), SPSS version 25, and Microsoft Excel software were used to perform the imputation methods and analyze the obtained information.
To compare the efficiency of the imputation methods in this study, 5 performance characteristics including coefficient of determination (R 2 ), mean absolute error (MAE), root mean square error (RMSE), index of agreement (d 2 ), and coefficient of efficiency (E 2 ) were used. The values of each of these characteristics were compared using the original and estimated values in the test group to select the best method for estimating the missing values. A brief description of how each of the methods used works, is presented as follows.

Mean
Mean imputation technique is one of the simplest methods for imputing the missing values. In this method, the total missing values in the dataset are filled by the average of the available values (32). Mean imputation method has advantages and disadvantages. One of its advantages is that it is comprehensible and applicable in most statistical softwares. In this case, the sample size also does not decrease due to the fact that all the missing values are placed with the average. One of its disadvantages is that the mean imputation method leads to the bias of multivariate estimates such as correlation or regression coefficients. In general, the values imputed by the mean of the variables have no correlation with the other variables. Thus, the relationship between the variables is skewed to zero, and the standard error of the imputed variables are biased (33).

Moving average (MA)
In this function, the missing values are replaced by the moving average values. In this method, the mean is taken from an equal number of observations on either side of a central value, that is, for the missing value in position i of a time series, observations i-1, i+1, i+1, i+2, and so on (assuming window size of k = 2) are used to calculate the mean.
Since long gaps of missing values may occur and all values next to the central value are also missing, the algorithm has a semi-adaptive window size. When there are less than 2 non-missing values in the full available window, the window size will gradually increase to at least 2 non-missing values. In all other cases, the algorithm returns to the size of the preset window.
In simple moving average (SMA), all observations in the window have the same weight to calculate the average. In linear weighted moving average (LWMA), the weight of observations decreases in the algorithm process. Observations that are exactly next to the central value i, have a weight of 1/2 ; observations one farther away (i+2.i-2) have a weight of 1/3; the next (i+3.i-3) have a weight of 1/4, and so on. The exponential weighted moving average (EWMA) also uses weighting factors that decrease exponentially. Observations that are exactly next to the central value of i have a weight of ; observations one farther away (i+2.i-2) have a weight of ; the next (i+3.i-3) have a weight of , and so on.

Linear interpolation
In linear interpolation method, two data points are connected by a line and the interpolation function as Eq. (4).
So the independent variable, x i (i = 0.1.…), is a known value and the coefficient is unknown. So in Eq. (4), we have: 0 , in this case, f = f 1 have the same distribution (34).

K-Nearest Neighbor
The k-nearest neighbor (KNN) imputation method is the simplest strategy because the endpoints of the gaps (missing parts) are used as estimates for all missing values (28). The KNN method is presented in Eq. (5). (5) where, y is interpolated, x is the time points of interpolation, y 1 and x 1 are the coordinates of the starting points of the gap, y 2 and x 2 are the coordinates of the end points of the gap.

EM algorithm
The EM algorithm can be used when the joint distribution of missing data (X m ) and observed data (X o ) is known (14,35). If for is a probability density function of X = (X o .X m ), the goal of the EM algorithm is to find an estimate of θ that maximizes the accuracy of the observed data. This quantity cannot be explicitly calculated in general cases; the EM algorithm finds the expected MLE value by repeating it in the complete data likelihood maximization. Then, by starting with the initial value of θ (o) , and assuming that θ (t) is an estimate of θ in t, the algorithm is performed in two steps: E-Step: Calculation of the expected value of complete data likelihood according to the conditional distribution of the parameter value of the missing variable of θ (t) .
M-Step: Maximizing the Q function and determining the value of θ (+1t).

Multiple imputation by chained equation
Assume that X is the data matrix (n×p) and X = (X p .X c ) so that X p consists of p 1 columns of X which are almost as observed and X c contains the rest of the columns that are completely observed. X o is a set of elements observed in X, and X m is a set of missing observations in X. For a multiple imputation based on chain equations, the equation specifies a set of conditional distributions P(X i |X -i ), where X i is the i-th column of X p and X -i is the matrix of X whose i-th column is omitted. Imputed values are generated in 4 steps: Step 1: The initial values for the missing values are completed as follows: The matrix Z is defined as X c .
Then, for each i = 1.….p 1 , the missing values in X i are imputed using the conditional posterior distribution on Z, and the full version of X i is added to Z before the value of i is increased.
Step 2: For each i = 1.….p 1 , the missing values in X i are replaced on X -i using the conditional prior distribution.
Step 3: The second step is implemented l times.
Step 4: The first to third steps are repeated until m imputation sets are obtained. The X p columns are adjusted to increase the number of missing values until more information is available in the second part of the first step. Although random convergence can be conventionally investigated using a diagnostic tool such as scale reduction coefficient, satisfactory results are usually obtained using l = 10 (36). In the first and second steps, the basis of prediction is the use of generalized linear models as a criterion. In this study, classification and regression tree, linear regression and PMM were performed based on the multiple imputation by chained equation method.

Classification and regression tree
Classification and regression tree (CART) models seek to approximate the conditional distribution of a response variable on several predictor variables. The CART algorithm divides the space of the predictors so that the subset of the units formed by the partitions have relatively homogeneous results. Partitions are formed by the recursive binary division of the prophets. A set of partitions can be effectively represented by a tree structure with its leaves corresponding to a single subset.
The values in each leaf represent the conditional distribution of the response variable for the units in the data with the predictors that define the leaf separation criteria.
If the parametric models are equal and there is no discontinuity in the separation boundaries (37), the performance of CART method decreases, which is one of the main disadvantages of CART in comparison to parametric models.
Once a tree has grown, it can be trimmed by removing the branches. When trees are used as an analytical tool, it is better to modify them because smaller trees are easier to interpret. The trees are not modified when using multiple imputation methods. Instead, by adjusting the minimum number of observations and minimizing the heterogeneity in the values per leaf, the size of the trees is adjusted to allow for more division. More information on the CART method is presented in previous studies (38,39). Regression imputation is classified into two different types: Stochastic and deterministic regression imputation. Deterministic regression imputation replaces missing values with exact predictions from the regression model. Therefore, the imputed values are too accurate and lead to the overestimation of the correlations between X and Y. To solve this problem, stochastic regression imputation is used instead of deterministic regression imputation. Stochastic regression imputation adds a random error sentence to the predicted value, so it can reproduce the correlation between X and Y more appropriately.

Regression (stochastic vs deterministic)
One of the advantages of the above-mentioned method is that the relationships between X and Y (correlation, regression coefficients, etc.) are preserved because the imputed values are based on regression models. And the disadvantage of this method is that it may lead to impossible values. There are some limitations in this method and variables are often limited to a certain range (e.g., income must always be positive), so that the regression imputation is not able to operate under such restrictions (40,41).

Predictive mean matching
The PMM method is a new method in imputation methodology (42,43). The PMM algorithm can be divided into 6 steps:  )). 5. Then, it randomly selects one of these three items and places it with the corresponding value for the missing value. 6. In the multiple imputation, steps 1 to 5 are repeated several times. Each iteration of steps 1 to 5 creates a new imputed data set. In a multiple imputation, missing data is usually imputed 5 times. In order to choose which of the 5 times is the final imputed values, it is better to average the obtained values and analyze the obtained average as the final imputation values. The advantage of the PMM method is that it only operates based on the values observed for other units, so that the range of imputed values is always between the lowest and highest observed values. In addition, unlike other methods such as regression imputation overestimating the variance of small values X and underestimating the variance of large values X, the PMM method reflects the structure of the observed values well (44)(45)(46)(47).

Evaluation the performance of imputation methods
In this study, the training-testing validation approach was used to evaluate the performance of imputation methods. In this approach, a number of data were randomly deleted from the existing main dataset. Then, the deleted data were replaced by the estimated values obtained from different imputation methods to compare with the original data. To perform the above-mentioned approach, complete data without any missing data were selected from the original dataset. In the next step, from 22 813 and 18 883 complete and available observations for PM 10 and O 3 , respectively, 10, 20, and 30% were randomly selected and deleted (48,49). These deleted data were treated like the missing data and considered as test data. Then, missing values were imputed by various imputation techniques to recover the deleted values. Finally, the imputed values were compared with the observed values.

Performance indicators
To evaluate and compare the methods, the coefficient of determination (R 2 ), mean absolute error (MAE), root mean squared error (RMSE), as well as two measurement criteria without dimension of agreement index (d 2 ) and efficiency coefficient (E 2 ) were used.

Coefficient of determination (R 2 )
The value of R 2 indicates how much of the changes in the imputed data can be described by the observed data or points that are close to the regression line (28), so we have: The value of R 2 is between 0 and 1, with values closer to 1 implying a better fit.

Mean absolute error
The average difference between imputed and observed data is shown by the following equation (28): MAE ranges from 0 to infinity and a perfect fit is obtained when MAE = 0.

Root mean square error
RMSE is one of the most common methods for evaluating numerical prediction (28). Its value is calculated by Eq. (8).

Index of agreement (d 2 )
d 2 is a measure of relative error between imputed and observed data, which is given by Eq. (9) (50).
The value of d 2 is always between 0 and 1, and higher values indicating better agreement.

Coefficient of efficiency (E 2 )
The value of E 2 is calculated as follows (51): E 2 is always between infinity to 1, and higher values indicating better agreement (52). Table 1 shows the descriptive statistics for 10, 20, and 30% missing for PM 10 and O 3 .

Results
By changing the percentage of missing, the average value has changed very little and is always higher than the median value. As shown in Table 1, there is very little variation in percentiles of various missing rate. This is due to the random generation of missing values and the large number of observations in the same range. After imputation with different methods, the efficiency of each method was calculated and compared. Table 2 shows the performance indicators values for various methods with missing of 10, 20, and 30% for both pollutants.
First, the results obtained for PM 10 are discussed in detail. According to the results for 20% missing as the medium rate, values of R 2 , RMSE, and MAE for the linear interpolation method were 0.822, 15.14, and 8.33, respectively. After linear interpolation, the moving average and the nearest neighbor had the best performance and PMM and linear regression showed the worst fit.
PMM method has been introduced as one of the efficient methods in the field of missing data. Thus, spatial and temporal information of the data were entered into the model as an independent variable and examined. First of all, the model was fitted by considering spatial and temporal information of the data (R 2 = 0.007, RMSE = 38.39), temporal information (R 2 = 0.006, RMSE = 38.47), and spatial information (R 2 = 0.009, RMSE = 37.71). Finally, the model was fitted with none of the spatial and temporal information (R 2 = 0.52, RMSE = 37.71). The PMM method without spatial and temporal dependencies in imputation has shown good performance in 20% missing in all validation methods. Although the results for 10% missing were consistent with the results of 20%, in 30% missing, considering the time-dependent variable in imputation, it performed well in most validation methods.
Based on the results, when spatial and temporal information was not entered into the model, PMM method showed better performance, which indicates that even with fitting the model using spatial and temporal information as an independent variable, the PMM method does not have a good ability to impute air pollution data which are time series.
CART method, which is based on regression and decision tree, and the EM algorithm method, which is based on iteration in estimating parameters and convergence, both had better performance than the PMM and regression methods in evaluation and validation.  The moving average method was also evaluated in three models: SMA, LWMA, and EWMA. According to the results, EWMA model (R 2 = 0.808 and RMSE = 15.64) had the best performance, followed by LWMA model, and, SMA model had the weakest performance.
According to the obtained results for PM 10 data, mean imputation (RMSE = 35.73 and d 2 = 0.019) had poor performance, so that it is not a suitable method for nesting missing values. The results of evaluating methods in 10% and 30% missing are the same as 20% missing. In all scenarios, the linear interpolation, moving average, and nearest neighbor methods had the best performance, respectively.
Regarding O 3 , according to the obtained results for all missing rates in the imputation of O 3 missing values, linear interpolation had the lowest RMSE and MAE whereas these values were the highest for PMM and MEAN. It should be noted that the results obtained for O 3 were similar to those obtained for PM 10 , and the linear interpolation, moving average, and nearest neighbor in terms of performance indicators were the best methods for imputing the missing values, respectively. Figure 1 and Figure 2 show the scatter plot of imputed values on the original dataset for each imputation method for PM 10 and O 3 , separately.
The obtained plots for PM 10 and O 3 also show the highly overlap of the original and missing values in the LINT, KNN, LWMA, and EWMA methods compared to the other methods. For PM 10 , PMM plots had the least overlap in all cases and the plots of the EM and LRNP also did not show a significant overlap. In the case of O 3 , the plot of EM had a better overlap, but the results obtained in terms of performance indicators are not acceptable to EM.

Discussion
The aim of this study was to evaluate and compare several imputation methods in the various missing rates (10%, 20%, and 30%) for air quality dataset of Tabriz. Due to the importance of missing issue and the consequences of arbitrary removal of missing data, which causes bias in the results, the mechanism of missingness was assessed and several imputation methods were performed. Moreover, performance indicators for evaluation and validation of the underlying methods along with statistical concepts and models were reported.
In this study, the nature of air pollution data in the PMM method was completely considered. That is, in this method, the spatial information as air quality monitoring stations, and temporal information such as diurnal and hourly recordings of the air pollution data were considered. The results obtained from the PMM method were compared with those from other methods whose efficiency and validity have been determined in other studies (36)(37)(38)(42)(43)(44)(45)(46).
It is essential to notice that every method can not be used to fill the unobserved values for imputation, even if the introduced method is one of the best imputation methods. Each method in each situation works differently depending on the nature of the data. In the present study, several methods were investigated for imputing the missing values of PM 10 and O 3 . The results of both pollutants were similar, so the results obtained in this study can be generalized to other pollutants. Due to the nature of air pollution data in Iran, which is almost similar to the data of Tabriz, the proposed methods in this study with acceptable accuracy can be used to replace the missing values in similar air pollution dataset.
The results showed poor performance for mean imputation method. According to a study by Junninen et al mean imputation always disrupts the intrinsic structure of the data and causes a high bias in correlation, so it is not a good method for imputation, especially if there is a high percentage of missing (28).
The nature of air pollution data is a type of time series and always depends on the previous and next information. Methods as linear interpolation, moving average, and nearest neighbor have computational nature based on before and after information. Thus, in comparison to other methods, they showed better agreement and fit, according to the all performance criteria.
Engels and Diehr compared several regression methods based on predictor variables such as hot deck, mean and median of a column, row, previous rows, before and after rows (30). They concluded that the methods based on previous observations, and before and after methods are better methods to place missing values in longitudinal data.
Since the issue of missing data is one of the main problems in environmental data, many studies have been conducted and published in this regard, but no paper was found to use the PMM imputation method considering spatial and temporal dependencies in air pollution data.
According to the study of Norazian et al (12) who examined the three models of linear, quadratic, and cubic interpolation methods, the linear interpolation method was better than the quadratic and cubic method.
According to a review of the literature on air quality studies, several studies have been conducted on imputation methods. Junninen et al collected air quality data along with meteorological data collected simultaneously at two stations to evaluate several imputation methods, including univariate single imputation (linear, spline, and nearest neighbor), multivariate single imputation (regression-based nest, nearest neighbor), a combination of univariate and multivariate imputation and multiple imputation methods (calculation of average methods used multivariate and hybrid methods) (28).
In another study, Plaia and Bondì (54) used air quality data collected at eight nearby stations to evaluate several methods including univariate single imputation (hourly average, mean before and after), multivariate single imputation (mean of simultaneous values measured at close stations), and multivariate multiple imputation methods (model-based multiple imputation). Troyanskaya et al used the mean, median, EM algorithm, nearest neighbor, sequential nearest neighbor, and singular value decomposition methods to impute the missing values on gene microarray data, so that the EM algorithm, nearest neighbor, and sequential nearest neighbor appeared to perform well (60).

Conclusion
Imputation methods were used to estimate 3 randomly simulated missing data pattern in PM 10 and O 3 . Due to the nature of the air pollution data, which is time series, methods that depended on before and after information showed good performance. By considering spatial and temporal dependent information in imputation, it was found that PMM technique did not perform well at all percentages of missing. Therefore, in order to choose the appropriate imputation method with the best performance, it is very important to pay attention to the type of data examined.

Limitations
The main limitation of this study was the lack of data in summer and winter in all stations.