144 lines
9.2 KiB
TeX
144 lines
9.2 KiB
TeX
\section{Kernel Density Estimation}
|
|
% KDE by rosenblatt and parzen
|
|
% general KDE
|
|
% Gauss Kernel
|
|
% Formula Gauss KDE
|
|
% -> complexity/operation count
|
|
% Binned KDE
|
|
% Binned Gauss KDE
|
|
% -> complexity/operation count
|
|
|
|
%The histogram is a simple and for a long time the most used non-parametric estimator.
|
|
%However, its inability to produce a continuous estimate dismisses it for many applications where a smooth distribution is assumed.
|
|
%In contrast,
|
|
The KDE is often the preferred tool to estimate a density function from discrete data samples because of its ability to produce a continuous estimate and its flexibility.
|
|
%
|
|
Given a univariate random sample $X=\{X_1, \dots, X_N\}$, where $X$ has the density function $f$ and let $w_1, \dots w_N$ be associated weights.
|
|
The kernel estimator $\hat{f}$ which estimates $f$ at the point $x$ is given as
|
|
\begin{equation}
|
|
\label{eq:kde}
|
|
\hat{f}(x) = \frac{1}{W} \sum_{i=1}^{N} \frac{w_i}{h} K \left(\frac{x-X_i}{h}\right)
|
|
\end{equation}
|
|
where $W=\sum_{i=1}^{N}w_i$ and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
|
|
$K$ is a kernel function such that $\int K(u) \dop{u} = 1$.
|
|
In general any kernel can be used, however the general advice is to chose a symmetric and low-order polynomial kernel.
|
|
Thus, several popular kernel functions are used in practice, like the Uniform, Gaussian, Epanechnikov, or Silverman kernel \cite{scott2015}.
|
|
|
|
While the kernel estimate inherits all the properties of the kernel, usually it is not of crucial matter if a non-optimal kernel was chosen.
|
|
As a matter of fact, the quality of the kernel estimate is primarily determined by the smoothing parameter $h$ \cite{scott2015}.
|
|
%In theory it is possible to calculate an optimal bandwidth $h^*$ regarding to the asymptotic mean integrated squared error.
|
|
%However, in order to do so the density function to be estimated needs to be known which is obviously unknown in practice.
|
|
%
|
|
%Any non-optimal bandwidth causes undersmoothing or oversmoothing.
|
|
%An undersmoothing estimator has a large variance and hence a small $h$ leads to undersmoothing.
|
|
%On the other hand given a large $h$ the bias increases, which leads to oversmoothing \cite[7]{Cybakov2009}.
|
|
%Clearly with an adverse choice of the bandwidth crucial information like modality might get smoothed out.
|
|
%All in all it is not obvious to determine a good choice of the bandwidth.
|
|
%
|
|
%This is aggravated by the fact that the structure of the data may vary significantly.
|
|
%Given such a situation it is beneficial to adapt the bandwidth to the neighbourhood of the given data point.
|
|
%As a result, a lot of research is put into developing data-driven bandwidth selections algorithms to obtain an adequate value of $h$ directly from the data.
|
|
% TODO aus gründen wird hier die Bandbreite als gegeben angenommen
|
|
%
|
|
%As mentioned above the particular choice of the kernel is only of minor importance as it affects the overall result in an negligible way.
|
|
It is common practice to suspect that the data is approximately Gaussian, and therefore the Gaussian kernel is frequently used.
|
|
%Note that this assumption is different compared to assuming a concrete distribution family like a Gaussian distribution or mixture distribution.
|
|
In this work we choose the Gaussian kernel in favour of computational efficiency as our approach is based on the approximation of the Gaussian filter.
|
|
The Gaussian kernel is given as
|
|
\begin{equation}
|
|
\label{eq:gausKern}
|
|
K_G(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} } \text{.}
|
|
\end{equation}
|
|
|
|
The flexibility of the KDE comes at the expense of computational efficiency, which leads to the development of more efficient computation schemes.
|
|
The computation time depends, besides the number of calculated points, on the number of data points $N$.
|
|
In general, reducing the size of the sample negatively affects the accuracy of the estimate.
|
|
Still, the sample size is a suitable parameter to speedup the computation.
|
|
|
|
Since each single sample is combined with its adjacent samples into bins, the BKDE approximates the KDE.
|
|
Each bin represents the count of the sample set at a given point of a equidistant grid with spacing $\delta$.
|
|
A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$, indexed by $j\in\Z$.
|
|
% and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$.
|
|
Computation requires a finite grid on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$.
|
|
|
|
Given a binning rule $r_j$ the BKDE $\tilde{f}$ of a density $f$ computed pointwise at the grid point $g_x$ is given as
|
|
\begin{equation}
|
|
\label{eq:binKde}
|
|
\tilde{f}(g_x) = \frac{1}{W} \sum_{j=1}^{G} \frac{C_j}{h} K \left(\frac{g_x-g_j}{h}\right)
|
|
\end{equation}
|
|
where $G$ is the number of grid points and
|
|
\begin{equation}
|
|
\label{eq:gridCnts}
|
|
C_j=\sum_{i=1}^{n} r_j(x_i,\delta)
|
|
\end{equation}
|
|
is the count at grid point $g_j$, such that $\sum_{j=1}^{G} C_j = W$ \cite{hall1996accuracy}.
|
|
|
|
In theory, any function which determines the count at grid points is a valid binning rule.
|
|
However, for many applications it is recommend to use the simple binning rule
|
|
\begin{align}
|
|
\label{eq:simpleBinning}
|
|
r_j(x,\delta) &=
|
|
\begin{cases}
|
|
w_j & \text{if } x \in ((j-\frac{1}{2})\delta, (j-\frac{1}{2})\delta ] \\
|
|
0 & \text{else}
|
|
\end{cases}
|
|
\end{align}
|
|
or the common linear binning rule which divides the sample into two fractional weights shared by the nearest grid points
|
|
\begin{align}
|
|
\label{eq:linearBinning}
|
|
r_j(x,\delta) &=
|
|
\begin{cases}
|
|
w_j(1-|\delta^{-1}x-j|) & \text{if } |\delta^{-1}x-j|\le1 \\
|
|
0 & \text{else.}
|
|
\end{cases}
|
|
\end{align}
|
|
An advantage is that their impact on the approximation error is extensively investigated and well understood \cite{hall1996accuracy}.
|
|
Both methods can be computed with a fast $\landau{N}$ algorithm, as simple binning is essentially the quotient of an integer division and the fractional weights of the linear binning are given by the remainder of the division.
|
|
As linear binning is more precise it is often preferred over simple binning \cite{fan1994fast}.
|
|
|
|
While linear binning improves the accuracy of the estimate the choice of the grid size is of more importance.
|
|
The number of grid points $G$ determines the trade-off between the approximation error caused by the binning and the computational speed of the algorithm.
|
|
Clearly, a large value of $G$ produces a estimate close to the regular KDE, but requires more evaluations of the kernel compared to a coarser grid.
|
|
However, it is unknown what particular $G$ gives the best trade-off for any given sample set.
|
|
In general, there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size \cite{hall1996accuracy}.
|
|
|
|
A naive implementation of \eqref{eq:binKde} reduces the number of kernel evaluations to $\landau{G^2}$, assuming that $G<N$ \cite{fan1994fast}.
|
|
However, due to the fixed grid spacing several kernel evaluations are the same and can be reused.
|
|
This reduces the number of kernel evaluations to $\landau{G}$, but the number of additions and multiplications required are still $\landau{G^2}$.
|
|
Using the FFT to perform the discrete convolution, the complexity can be further reduced to $\landau{G\log{G}}$, which is currently the fastest exact BKDE algorithm.
|
|
|
|
The \mbox{FFT-convolution} approach is usually highlighted as the striking computational benefit of the BKDE.
|
|
However, for this work it is the key to recognize the discrete convolution structure of \eqref{eq:binKde}, as this allows one to interpret the computation of a density estimate as a signal filter problem.
|
|
This makes it possible to apply a wide range of well studied techniques from the broad field of digital signal processing (DSP).
|
|
Using the Gaussian kernel from \eqref{eq:gausKern} in conjunction with \eqref{eq:binKde} results in the following equation
|
|
\begin{equation}
|
|
\label{eq:bkdeGaus}
|
|
\tilde{f}(g_x)=\frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{G} C_j \expp{-\frac{(x-X_i)^2}{2h^2}} \text{.}
|
|
\end{equation}
|
|
|
|
The above formula is a convolution operation of the data and the Gaussian kernel.
|
|
More precisely it is a discrete convolution of the finite data grid and the Gaussian function.
|
|
In terms of DSP this is analogous to filter the binned data with a Gaussian filter.
|
|
This finding allows to speedup the computation of the density estimate by using a fast approximation scheme based on iterated box filters.
|
|
|
|
|
|
|
|
|
|
%Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with
|
|
%\begin{equation}
|
|
%\label{eq:kde}
|
|
%\hat{f}_n = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x-X_i}{h} \right) \text{,} %= \frac{1}{n} \sum_{i=1}^{n} K_h(x-x_i)
|
|
%\end{equation}
|
|
%where $K$ is the kernel function and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
|
|
%While any density function can be used as the kernel function $K$ (such that $\int K(u) \dop{u} = 1$), a variety of popular choices of the kernel function $K$ exits.
|
|
%In practice the Gaussian kernel is commonly used:
|
|
%\begin{equation}
|
|
%\label{eq:gausKern}
|
|
%K(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} }
|
|
%\end{equation}
|
|
%
|
|
%\begin{equation}
|
|
%\hat{f}_n = \frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{n} \expp{-\frac{(x-X_i)^2}{2h^2}}
|
|
%\end{equation}
|
|
|