Fusion2018/tex/chapters/kde.tex

\section{Kernel Density Estimation}
% KDE by rosenblatt and parzen
% general KDE
% Gauss Kernel
% Formula Gauss KDE
% -> complexity/operation count
% Binned KDE
% Binned Gauss KDE
% -> complexity/operation count

%The histogram is a simple and for a long time the most used non-parametric estimator.
%However, its inability to produce a continuous estimate dismisses it for many applications where a smooth distribution is assumed.
%In contrast,
The KDE is often the preferred tool to estimate a density function from discrete data samples because of its ability to produce a continuous estimate and its flexibility.
%
Given a univariate random sample $X=\{X_1, \dots, X_N\}$, where $X$ has the density function $f$ and let $w_1, \dots w_N$ be associated weights.
The kernel estimator $\hat{f}$ which estimates $f$ at the point $x$ is given as
\begin{equation}
\label{eq:kde}
\hat{f}(x) = \frac{1}{W} \sum_{i=1}^{N} \frac{w_i}{h} K \left(\frac{x-X_i}{h}\right)
\end{equation}
where $W=\sum_{i=1}^{N}w_i$ and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
$K$ is a kernel function such that $\int K(u) \dop{u} = 1$.
In general any kernel can be used, however the general advice is to chose a symmetric and low-order polynomial kernel.
Thus, several popular kernel functions are used in practice, like the Uniform, Gaussian, Epanechnikov, or Silverman kernel \cite{scott2015}.

While the kernel estimate inherits all the properties of the kernel, usually it is not of crucial matter if a non-optimal kernel was chosen.
As a matter of fact, the quality of the kernel estimate is primarily determined by the smoothing parameter $h$ \cite{scott2015}.
%In theory it is possible to calculate an optimal bandwidth $h^*$ regarding to the asymptotic mean integrated squared error.
%However, in order to do so the density function to be estimated needs to be known which is obviously unknown in practice.
%
%Any non-optimal bandwidth causes undersmoothing or oversmoothing.
%An undersmoothing estimator has a large variance and hence a small $h$ leads to undersmoothing.
%On the other hand given a large $h$ the bias increases, which leads to oversmoothing \cite[7]{Cybakov2009}.
%Clearly with an adverse choice of the bandwidth crucial information like modality might get smoothed out.
%All in all it is not obvious to determine a good choice of the bandwidth.
%
%This is aggravated by the fact that the structure of the data may vary significantly.
%Given such a situation it is beneficial to adapt the bandwidth to the neighbourhood of the given data point.
%As a result, a lot of research is put into developing data-driven bandwidth selections algorithms to obtain an adequate value of $h$ directly from the data.
% TODO aus gründen wird hier die Bandbreite als gegeben angenommen
%
%As mentioned above the particular choice of the kernel is only of minor importance as it affects the overall result in an negligible way.
It is common practice to suspect that the data is approximately Gaussian, and therefore the Gaussian kernel is frequently used.
%Note that this assumption is different compared to assuming a concrete distribution family like a Gaussian distribution or mixture distribution.
In this work we choose the Gaussian kernel in favour of computational efficiency as our approach is based on the approximation of the Gaussian filter.
The Gaussian kernel is given as
\begin{equation}
\label{eq:gausKern}
K_G(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} } \text{.}
\end{equation}

The flexibility of the KDE comes at the expense of computational efficiency, which leads to the development of more efficient computation schemes.
The computation time depends, besides the number of calculated points, on the number of data points $N$.
In general, reducing the size of the sample negatively affects the accuracy of the estimate.
Still, the sample size is a suitable parameter to speedup the computation.

Since each single sample is combined with its adjacent samples into bins, the BKDE approximates the KDE.
Each bin represents the count of the sample set at a given point of a equidistant grid with spacing $\delta$.
A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$, indexed by $j\in\Z$.
% and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$.
Computation requires a finite grid on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$.

Given a binning rule $r_j$ the BKDE $\tilde{f}$ of a density $f$ computed pointwise at the grid point $g_x$ is given as
\begin{equation}
\label{eq:binKde}
\tilde{f}(g_x) = \frac{1}{W} \sum_{j=1}^{G} \frac{C_j}{h} K \left(\frac{g_x-g_j}{h}\right)
\end{equation}
where $G$ is the number of grid points and
\begin{equation}
\label{eq:gridCnts}
    C_j=\sum_{i=1}^{n} r_j(x_i,\delta)
\end{equation}
is the count at grid point $g_j$, such that $\sum_{j=1}^{G} C_j = W$ \cite{hall1996accuracy}.

In theory, any function which determines the count at grid points is a valid binning rule.
However, for many applications it is recommend to use the simple binning rule
\begin{align}
\label{eq:simpleBinning}
    r_j(x,\delta) &=
    \begin{cases}
        w_j & \text{if } x \in ((j-\frac{1}{2})\delta,  (j-\frac{1}{2})\delta ] \\
        0 & \text{else}
    \end{cases}
\end{align}
or the common linear binning rule which divides the sample into two fractional weights shared by the nearest grid points
\begin{align}
\label{eq:linearBinning}
    r_j(x,\delta) &=
    \begin{cases}
        w_j(1-|\delta^{-1}x-j|) & \text{if } |\delta^{-1}x-j|\le1 \\
        0				   & \text{else.}
    \end{cases}
\end{align}
An advantage is that their impact on the approximation error is extensively investigated and well understood \cite{hall1996accuracy}.
Both methods can be computed with a fast $\landau{N}$ algorithm, as simple binning is essentially the quotient of an integer division and the fractional weights of the linear binning are given by the remainder of the division.
As linear binning is more precise it is often preferred over simple binning \cite{fan1994fast}.

While linear binning improves the accuracy of the estimate the choice of the grid size is of more importance.
The number of grid points $G$ determines the trade-off between the approximation error caused by the binning and the computational speed of the algorithm.
Clearly, a large value of $G$ produces a estimate close to the regular KDE, but requires more evaluations of the kernel compared to a coarser grid.
However, it is unknown what particular $G$ gives the best trade-off for any given sample set.
In general, there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size \cite{hall1996accuracy}.

A naive implementation of \eqref{eq:binKde} reduces the number of kernel evaluations to $\landau{G^2}$, assuming that $G<N$ \cite{fan1994fast}.
However, due to the fixed grid spacing several kernel evaluations are the same and can be reused.
This reduces the number of kernel evaluations to $\landau{G}$, but the number of additions and multiplications required are still $\landau{G^2}$.
Using the FFT to perform the discrete convolution, the complexity can be further reduced to $\landau{G\log{G}}$, which is currently the fastest exact BKDE algorithm.

The \mbox{FFT-convolution} approach is usually highlighted as the striking computational benefit of the BKDE.
However, for this work it is the key to recognize the discrete convolution structure of \eqref{eq:binKde}, as this allows one to interpret the computation of a density estimate as a signal filter problem.
This makes it possible to apply a wide range of well studied techniques from the broad field of digital signal processing (DSP).
Using the Gaussian kernel from \eqref{eq:gausKern} in conjunction with \eqref{eq:binKde} results in the following equation
\begin{equation}
\label{eq:bkdeGaus}
\tilde{f}(g_x)=\frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{G} C_j \expp{-\frac{(x-X_i)^2}{2h^2}} \text{.}
\end{equation}

The above formula is a convolution operation of the data and the Gaussian kernel.
More precisely it is a discrete convolution of the finite data grid and the Gaussian function.
In terms of DSP this is analogous to filter the binned data with a Gaussian filter.
This finding allows to speedup the computation of the density estimate by using a fast approximation scheme based on iterated box filters.


%Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with
%\begin{equation}
%\label{eq:kde}
%\hat{f}_n = \frac{1}{nh} \sum_{i=1}^{n} K \left(  \frac{x-X_i}{h} \right)  \text{,} %= \frac{1}{n} \sum_{i=1}^{n} K_h(x-x_i)
%\end{equation}
%where $K$ is the kernel function and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
%While any density function can be used as the kernel function $K$ (such that $\int K(u) \dop{u} = 1$), a variety of popular choices of the kernel function $K$ exits.
%In practice the Gaussian kernel is commonly used:
%\begin{equation}
%\label{eq:gausKern}
%K(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} }
%\end{equation}
%
%\begin{equation}
%\hat{f}_n = \frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{n} \expp{-\frac{(x-X_i)^2}{2h^2}}
%\end{equation}