|
|
|
|
@@ -8,10 +8,11 @@
|
|
|
|
|
% Binned Gauss KDE
|
|
|
|
|
% -> complexity/operation count
|
|
|
|
|
|
|
|
|
|
The histogram is a simple and for a long time the most used non-parametric estimator.
|
|
|
|
|
However, its inability to produce a continuous estimate dismisses it for many applications where a smooth distribution is assumed.
|
|
|
|
|
In contrast, KDE is often the preferred tool because of its ability to produce a continuous estimate and its flexibility.
|
|
|
|
|
|
|
|
|
|
%The histogram is a simple and for a long time the most used non-parametric estimator.
|
|
|
|
|
%However, its inability to produce a continuous estimate dismisses it for many applications where a smooth distribution is assumed.
|
|
|
|
|
%In contrast,
|
|
|
|
|
The KDE is often the preferred tool to estimate a density function from discrete data samples because of its ability to produce a continuous estimate and its flexibility.
|
|
|
|
|
%
|
|
|
|
|
Given a univariate random sample $X=\{X_1, X_2, \dots, X_n\}$, the kernel estimator $\hat{f}$ which defines the estimate at the point $x$ is given as
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\label{eq:kde}
|
|
|
|
|
@@ -19,29 +20,29 @@ Given a univariate random sample $X=\{X_1, X_2, \dots, X_n\}$, the kernel estima
|
|
|
|
|
\end{equation}
|
|
|
|
|
where $K_h(t)=K(t/h)/h$ is the normalized kernel \cite[138]{scott2015} and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
|
|
|
|
|
%, and $h=h_n$ is a function of the sample size $n$ with $h\rightarrow0$ as $n\rightarrow\infty$ \cite{rosenblatt1956remarks}.
|
|
|
|
|
Any function which satisfy $\int K_h(u) \dop{u} = 1$ is a valid kernel.
|
|
|
|
|
Any function which satisfies $\int K_h(u) \dop{u} = 1$ is a valid kernel.
|
|
|
|
|
In general any kernel can be used, however the general advice is to chose a symmetric and low-order polynomial kernel.
|
|
|
|
|
Thus, several popular kernel functions are used in practice, like the Uniform, Gaussian, Epanechnikov, or Silverman kernel \cite[152.]{scott2015}.
|
|
|
|
|
|
|
|
|
|
While the kernel estimate inherits all the properties of the kernel, usually it is not of crucial matter if a non-optimal kernel was chosen \cite[151f.]{scott2015}.
|
|
|
|
|
As a matter of fact, the quality of the kernel estimate is primarily determined by the smoothing parameter $h$ \cite[145]{scott2015}.
|
|
|
|
|
In theory it is possible to calculate an optimal bandwidth $h^*$ regarding to the asymptotic mean integrated squared error.
|
|
|
|
|
However, in order to do so the density function to be estimated needs to be known which is obviously unknown in practice.
|
|
|
|
|
|
|
|
|
|
Any non-optimal bandwidth causes undersmoothing or oversmoothing.
|
|
|
|
|
An undersmoothing estimator has a large variance and hence a small $h$ leads to undersmoothing.
|
|
|
|
|
On the other hand given a large $h$ the bias increases, which leads to oversmoothing \cite[7]{Cybakov2009}.
|
|
|
|
|
Clearly with an adverse choice of the bandwidth crucial information like modality might get smoothed out.
|
|
|
|
|
All in all it is not obvious to determine a good choice of the bandwidth.
|
|
|
|
|
|
|
|
|
|
This is aggravated by the fact that the structure of the data may vary significantly.
|
|
|
|
|
Given such a situation it is beneficial to adapt the bandwidth to the neighbourhood of the given data point.
|
|
|
|
|
As a result, a lot of research is put into developing data-driven bandwidth selections algorithms to obtain an adequate value of $h$ directly from the data.
|
|
|
|
|
%In theory it is possible to calculate an optimal bandwidth $h^*$ regarding to the asymptotic mean integrated squared error.
|
|
|
|
|
%However, in order to do so the density function to be estimated needs to be known which is obviously unknown in practice.
|
|
|
|
|
%
|
|
|
|
|
%Any non-optimal bandwidth causes undersmoothing or oversmoothing.
|
|
|
|
|
%An undersmoothing estimator has a large variance and hence a small $h$ leads to undersmoothing.
|
|
|
|
|
%On the other hand given a large $h$ the bias increases, which leads to oversmoothing \cite[7]{Cybakov2009}.
|
|
|
|
|
%Clearly with an adverse choice of the bandwidth crucial information like modality might get smoothed out.
|
|
|
|
|
%All in all it is not obvious to determine a good choice of the bandwidth.
|
|
|
|
|
%
|
|
|
|
|
%This is aggravated by the fact that the structure of the data may vary significantly.
|
|
|
|
|
%Given such a situation it is beneficial to adapt the bandwidth to the neighbourhood of the given data point.
|
|
|
|
|
%As a result, a lot of research is put into developing data-driven bandwidth selections algorithms to obtain an adequate value of $h$ directly from the data.
|
|
|
|
|
% TODO aus gründen wird hier die Bandbreite als gegeben angenommen
|
|
|
|
|
|
|
|
|
|
As mentioned above the particular choice of the kernel is only of minor importance as it affects the overall result in an negligible way.
|
|
|
|
|
%
|
|
|
|
|
%As mentioned above the particular choice of the kernel is only of minor importance as it affects the overall result in an negligible way.
|
|
|
|
|
It is common practice to suspect that the data is approximately Gaussian, and therefore the Gaussian kernel is frequently used.
|
|
|
|
|
Note that this assumption is different compared to assuming a concrete distribution family like a Gaussian distribution or mixture distribution.
|
|
|
|
|
%Note that this assumption is different compared to assuming a concrete distribution family like a Gaussian distribution or mixture distribution.
|
|
|
|
|
In this work we choose the Gaussian kernel in favour of computational efficiency as our approach is based on the approximation of the Gaussian filter.
|
|
|
|
|
The Gaussian kernel is given as
|
|
|
|
|
\begin{equation}
|
|
|
|
|
@@ -83,25 +84,21 @@ where the bandwidth is given as a vector $\bm{h}=(h_1, \dots, h_d)$.
|
|
|
|
|
%\end{equation}
|
|
|
|
|
|
|
|
|
|
The flexibility of the KDE comes at the expense of computational efficiency, which leads to the development of more efficient computation schemes.
|
|
|
|
|
The computation time depends, besides the number of calculated points, on the number of data points $n$.
|
|
|
|
|
The computation time depends, besides the number of calculated points, on the number of data points $N$.
|
|
|
|
|
In general, reducing the size of the sample negatively affects the accuracy of the estimate.
|
|
|
|
|
Still, the sample size is a suitable parameter to speedup the computation.
|
|
|
|
|
\todo{neu schreiben}
|
|
|
|
|
Silverman \cite{silverman1982algorithm} suggested to reduce the number of single data points by combining adjacent points into data bins.
|
|
|
|
|
This approximation is called binned kernel density estimate (BKDE) and was extensively analysed \cite{fan1994fast} \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
|
|
|
|
|
|
|
|
|
|
Usually the data is binned over an equidistant grid.
|
|
|
|
|
Due to the equally-spaced grid many kernel evaluations are almost the same and can be saved, which greatly reduces the number of evaluated kernels and naturally leads to a reduced computation time \cite{fan1994fast}.
|
|
|
|
|
|
|
|
|
|
\todo{bin size variable einführen}
|
|
|
|
|
At first the data, i.e. a random sample $X$, has to be assigned to a grid.
|
|
|
|
|
A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$ for $j\in\Z$ and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$.
|
|
|
|
|
For computation a finite grid is used on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$.
|
|
|
|
|
|
|
|
|
|
While the estimate can be efficiently computed it is unknown how large the grid should be chosen.
|
|
|
|
|
Because the computation time heavily depends on the grid size, it is desirable to chose a grid as small as possible without losing to much accuracy.
|
|
|
|
|
In general, there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size.
|
|
|
|
|
The roughness of the unknown density directly affects the grid size.
|
|
|
|
|
Coarser grids allow a greater speedup but at the same time might conceal important details of the unknown density \cite{wand1994fast}.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Given a binning rule $w_j$ the BKDE $\tilde{f}$ of a density $f$ computed pointwise at the grid point $g_x$ is given as
|
|
|
|
|
\begin{equation}
|
|
|
|
|
@@ -138,12 +135,21 @@ and the common linear binning rule
|
|
|
|
|
\end{align}
|
|
|
|
|
An advantage of these often used binning rules is that their effect on the approximation is extensively investigated and well understood \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
|
|
|
|
|
|
|
|
|
|
\todo{textfluß}
|
|
|
|
|
While the estimate can be efficiently computed it is unknown how large the grid should be chosen.
|
|
|
|
|
Because the computation time heavily depends on the grid size, it is desirable to chose a grid as small as possible without losing to much accuracy.
|
|
|
|
|
In general, there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size.
|
|
|
|
|
The roughness of the unknown density directly affects the grid size.
|
|
|
|
|
Coarser grids allow a greater speedup but at the same time might conceal important details of the unknown density \cite{wand1994fast}.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
As already stated the computational savings are achieved by reducing the number of evaluated kernels.
|
|
|
|
|
A naive implementation of \eqref{eq:binKde} reduces the number evaluations to $\landau{G^2}$ \cite{fan1994fast}.
|
|
|
|
|
Because of the fixed grid spacing $\delta$ most of the kernel evaluations are the same, as each $g_j-g_{j-k}=k\delta$ is independent of $j$ \cite{fan1994fast}.
|
|
|
|
|
Therefore, many evaluated kernels can be reused, so that the number kernel evaluations are reduced to $\landau{G}$ \cite{fan1994fast}.
|
|
|
|
|
|
|
|
|
|
However, more important for this work the fact that the BKDE can be seen as a convolution operation.
|
|
|
|
|
\todo{Satz}
|
|
|
|
|
Once the grid counts $N_j$ in \eqref{eq:gridCnts} and kernel values are computed they need to be combined, which is, in fact, a discrete convolution \cite{wand1994fast}.
|
|
|
|
|
This makes it possible to apply a wide range of well studied techniques from the DSP field.
|
|
|
|
|
Often a FFT-convolution based computation scheme is used to efficiently compute the estimate \cite{silverman1982algorithm}\cite[210ff.]{scott2015}.
|
|
|
|
|
@@ -153,6 +159,7 @@ Using the Gaussian kernel from \eqref{eq:gausKern} in conjunction with the BKDE
|
|
|
|
|
\hat{f}(g_x)=\frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{G} N_j \expp{-\frac{(x-X_i)^2}{2h^2}} \text{.}
|
|
|
|
|
\end{equation}
|
|
|
|
|
|
|
|
|
|
\todo{großes N zu großes C und im Text unten benutzen damit klarer}
|
|
|
|
|
As already stated the above formula is a convolution operation of the data and the kernel.
|
|
|
|
|
More precisely it is a discrete convolution of the finite data grid and the Gaussian function.
|
|
|
|
|
In terms of DSP this is analogous to filter the binned data with a Gaussian filter.
|
|
|
|
|
|