This commit is contained in:
2018-02-06 17:14:08 +01:00
parent 00754a00da
commit a7fcf64cfe
2 changed files with 133 additions and 11 deletions

View File

@@ -82,6 +82,7 @@
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{subcaption}
\usepackage{bm} % bold math symbols with \bm
% replacement for the SI package
@@ -105,7 +106,9 @@
\newcommand{\dop} [1]{\ensuremath{ \mathop{\mathrm{d}#1} }}
\newcommand{\R} {\ensuremath{ \mathbf{R} }}
\newcommand{\Z} {\ensuremath{ \mathbf{Z} }}
\newcommand{\expp} [1]{\ensuremath{ \exp \left( #1 \right) }}
\newcommand{\landau}[1]{\ensuremath{ \mathcal{O}\left( #1 \right) }}
\newcommand{\qq} [1]{``#1''}

View File

@@ -10,20 +10,139 @@
The histogram is a simple and for a long time the most used non-parametric estimator.
However, its inability to produce a continuous estimate dismisses it for many applications where a smooth distribution is assumed.
In contrast, the KDE is often the preferred tool because of its ability to produce a continuous estimate and its flexibility.
Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with
In contrast, KDE is often the preferred tool because of its ability to produce a continuous estimate and its flexibility.
Given a multivariate random variable $X=(x_1,\dots ,x_d)$ in $d$ dimensions.
The sample $\bm{X}$ is a $n\times d$ matrix defined as \cite[162]{scott2015}
\begin{equation}
\label{eq:kde}
\hat{f}_n = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x-X_i}{h} \right) \text{,} %= \frac{1}{n} \sum_{i=1}^{n} K_h(x-x_i)
\end{equation}
where $K$ is the kernel function and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
While any density function can be used as the kernel function $K$ (such that $\int K(u) \dop{u} = 1$), a variety of popular choices of the kernel function $K$ exits.
In practice the Gaussian kernel is commonly used:
\begin{equation}
K(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} }
\bm{X}=
\begin{pmatrix}
X_1 \\
\vdots \\
X_n \\
\end{pmatrix}
=
\begin{pmatrix}
x_{11} & \dots & x_{1d} \\
\vdots & \ddots & \vdots \\
x_{n1} & \dots & x_{nd}
\end{pmatrix} \text{.}
\end{equation}
The multivariate kernel estimator $\hat{f}$ which defines the estimate pointwise at $\bm{x}=(x_1, \dots, x_d)^T$ is given as \cite[162]{scott2015}
\begin{equation}
\hat{f}_n = \frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{n} \expp{-\frac{(x-X_i)^2}{2h^2}}
\hat{f}(\bm{x}) = \frac{1}{n} \sum_{i=1}^{n} \bm{K_h}(\bm{x} - \bm{X_i})
\end{equation}
where $\bm{h}$ is an arbitrary smoothing parameter called bandwidth and $\bm{K_h}$ is a multivariate kernel function.
The bandwidth is given as a vector $\bm{h}=(h_1, \dots, h_d)$, i.e. for every dimension a different smoothing parameter is used.
Valid kernel functions must satisfy $\int \bm{K_h}(\bm{u}) \dop{\bm{u}} = 1$.
A common approach to define a multivariate kernel function is to apply a univariate kernel in each dimension.
Univariate kernel estimators can easily be extended to multivariate distributions.
In fact, the univariate kernel is just applied in every dimension and the individual results are combined by a product.
Therefore, a multivariate kernel estimator which is obtained by this concept is called product kernel \cite[162]{scott2015}.
\begin{equation}
\bm{K_h}(t) = \frac{1}{h_1 \dots h_d} \prod_{j=1}^{d} K\left( \frac{x_j-x_{ij}}{h_j} \right)
\end{equation}
Multivariate Gauss-Kernel
\begin{equation}
K(u)=\frac{1}{(2\pi)^{d/2}} \expp{-\frac{1}{2} \bm{x}^T \bm{x}}
\end{equation}
In summary the kernel density estimation is a powerful tool to obtain an estimate from an unknown distribution.
It can obtain a smooth estimate from a sample without relying on assumptions about the underlying model.
Selection of a good choice of the bandwidth is non-obvious and has a crucial impact on the quality of the estimate.
In theory it is possible to calculate an optimal bandwidth $h*$ regarding to the asymptotic mean integrated squared error.
However, in order to do so the to be estimated density needs to be known which is obviously unknown.
Any non-optimal bandwidth causes undersmoothing or oversmoothing.
An undersmoothing estimator has a large variance and hence a small $h$ leads to undersmoothing.
On the other hand given a large $h$ the bias increases, which leads to oversmoothing \cite[7]{Cybakov2009}.
Clearly with an adverse choice of the bandwidth crucial information like modality might get smoothed out.
Sophisticated methods are necessary to obtain a good bandwidth from the data.
The flexibility of the KDE comes at the expense of computational efficiency, which leads to the development of more efficient computation schemes.
The computation time depends, besides the number of calculated points, on the number of data points $n$.
In general, reducing the size of the sample negatively affects the accuracy of the estimate.
Still, the sample size is a suitable parameter to speedup the computation.
To reduce the number of single data points adjacent points are combined into a bin.
Usually the data is binned over an equidistant grid.
A kernel estimator which computes the estimate over such a grid is called binned kernel density estimator (BKDE).
Due to the equally-spaced grid many kernel evaluations are almost the same and can be saved, which greatly reduces the number of evaluated kernels \cite{fan1994fast}.
This naturally leads to a reduced computation time.
So far the binned estimator introduced two additional parameters which affects the accuracy of the estimation.
The choice of the number of grid points is crucial, because it determines the trade-off between the approximation error introduced by the binning and the computational speed of the algorithm \cite{wand1994fast}.
Also, the choice of the binning rule affects the accuracy of the approximation.
There are several possible binning rules \cite{hall1996accuracy} but the most commonly used binning rules are the simple \eqref{eq:simpleBinning} and common linear binning rule \eqref{eq:linearBinning}.
An advantage of these often used binning rules is that their effect on the approximation is extensively investigated and well understood \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
At first the data, i.e. a random sample $X$, has to be assigned to a grid.
A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$ for $j\in\Z$ and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$.
For computation a finite grid is used on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$.
Equipped with a method to bin the data the binned kernel estimator $\tilde{f}$ of a density $f$, based on $\bm{X}$ can be pointwise computed at the grid point $g_x$ with
\begin{equation}
\label{eq:binKde}
\tilde{f}(g_x) = \frac{1}{n} \sum_{j=1}^{G} N_j K_h(g_x-g_j)
\end{equation}
where $G$ is the number of grid points and
\begin{equation}
N_j=\sum_{i=1}^{n} w_j(x_i,\delta)
\end{equation}
is the count at grid point $g_j$ \cite{hall1996accuracy}.
Once the counts $N_j$ and kernel values are computed they need to be combined, which is, in fact, a discrete convolution \cite{wand1994fast}.
This makes it possible to apply a wide range of well studied techniques from the DSP field.
Often a FFT based computation scheme is used to efficiently compute the estimate \cite{silverman1982algorithm}\cite[210ff.]{scott2015}.
As already stated the computational savings are achieved by reducing the number of evaluated kernels.
A naive implementation of \eqref{eq:binKde} reduces the number evaluations to $\landau{G^2}$ \cite{fan1994fast}.
Because of the fixed grid spacing $\delta$ most of the kernel evaluations are the same, as each $g_j-g_{j-k}=k\delta$ is independent of $j$ \cite{fan1994fast}.
Therefore, many evaluated kernels can be reused, so that the number kernel evaluations are reduced to $\landau{G}$ \cite{fan1994fast}.
While the estimate can be efficiently computed it is still unknown how large the grid should be chosen.
Because the computation time heavily depends on the grid size, it is desirable to chose a grid as small as possible without losing to much accuracy.
In general there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size.
The roughness of the unknown density directly affects the grid size.
Coarser grids allow a greater speedup but at the same time might conceal important details of the unknown density \cite{wand1994fast}.
%Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with
%\begin{equation}
%\label{eq:kde}
%\hat{f}_n = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x-X_i}{h} \right) \text{,} %= \frac{1}{n} \sum_{i=1}^{n} K_h(x-x_i)
%\end{equation}
%where $K$ is the kernel function and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
%While any density function can be used as the kernel function $K$ (such that $\int K(u) \dop{u} = 1$), a variety of popular choices of the kernel function $K$ exits.
%In practice the Gaussian kernel is commonly used:
%\begin{equation}
%\label{eq:gausKern}
%K(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} }
%\end{equation}
%
%\begin{equation}
%\hat{f}_n = \frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{n} \expp{-\frac{(x-X_i)^2}{2h^2}}
%\end{equation}