Merge branch 'master' of https://git.frank-ebner.de/FHWS/Fusion2018
This commit is contained in:
@@ -82,6 +82,7 @@
|
||||
\usepackage{algorithm}
|
||||
\usepackage{algpseudocode}
|
||||
\usepackage{subcaption}
|
||||
\usepackage{bm} % bold math symbols with \bm
|
||||
|
||||
|
||||
% replacement for the SI package
|
||||
@@ -105,7 +106,9 @@
|
||||
|
||||
\newcommand{\dop} [1]{\ensuremath{ \mathop{\mathrm{d}#1} }}
|
||||
\newcommand{\R} {\ensuremath{ \mathbf{R} }}
|
||||
\newcommand{\Z} {\ensuremath{ \mathbf{Z} }}
|
||||
\newcommand{\expp} [1]{\ensuremath{ \exp \left( #1 \right) }}
|
||||
\newcommand{\landau}[1]{\ensuremath{ \mathcal{O}\left( #1 \right) }}
|
||||
|
||||
\newcommand{\qq} [1]{``#1''}
|
||||
|
||||
|
||||
@@ -10,20 +10,139 @@
|
||||
|
||||
The histogram is a simple and for a long time the most used non-parametric estimator.
|
||||
However, its inability to produce a continuous estimate dismisses it for many applications where a smooth distribution is assumed.
|
||||
In contrast, the KDE is often the preferred tool because of its ability to produce a continuous estimate and its flexibility.
|
||||
Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with
|
||||
In contrast, KDE is often the preferred tool because of its ability to produce a continuous estimate and its flexibility.
|
||||
|
||||
Given a multivariate random variable $X=(x_1,\dots ,x_d)$ in $d$ dimensions.
|
||||
The sample $\bm{X}$ is a $n\times d$ matrix defined as \cite[162]{scott2015}
|
||||
\begin{equation}
|
||||
\label{eq:kde}
|
||||
\hat{f}_n = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x-X_i}{h} \right) \text{,} %= \frac{1}{n} \sum_{i=1}^{n} K_h(x-x_i)
|
||||
\end{equation}
|
||||
where $K$ is the kernel function and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
|
||||
While any density function can be used as the kernel function $K$ (such that $\int K(u) \dop{u} = 1$), a variety of popular choices of the kernel function $K$ exits.
|
||||
In practice the Gaussian kernel is commonly used:
|
||||
\begin{equation}
|
||||
K(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} }
|
||||
\bm{X}=
|
||||
\begin{pmatrix}
|
||||
X_1 \\
|
||||
\vdots \\
|
||||
X_n \\
|
||||
\end{pmatrix}
|
||||
=
|
||||
\begin{pmatrix}
|
||||
x_{11} & \dots & x_{1d} \\
|
||||
\vdots & \ddots & \vdots \\
|
||||
x_{n1} & \dots & x_{nd}
|
||||
\end{pmatrix} \text{.}
|
||||
\end{equation}
|
||||
|
||||
The multivariate kernel estimator $\hat{f}$ which defines the estimate pointwise at $\bm{x}=(x_1, \dots, x_d)^T$ is given as \cite[162]{scott2015}
|
||||
\begin{equation}
|
||||
\hat{f}_n = \frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{n} \expp{-\frac{(x-X_i)^2}{2h^2}}
|
||||
\hat{f}(\bm{x}) = \frac{1}{n} \sum_{i=1}^{n} \bm{K_h}(\bm{x} - \bm{X_i})
|
||||
\end{equation}
|
||||
where $\bm{h}$ is an arbitrary smoothing parameter called bandwidth and $\bm{K_h}$ is a multivariate kernel function.
|
||||
The bandwidth is given as a vector $\bm{h}=(h_1, \dots, h_d)$, i.e. for every dimension a different smoothing parameter is used.
|
||||
|
||||
Valid kernel functions must satisfy $\int \bm{K_h}(\bm{u}) \dop{\bm{u}} = 1$.
|
||||
A common approach to define a multivariate kernel function is to apply a univariate kernel in each dimension.
|
||||
|
||||
Univariate kernel estimators can easily be extended to multivariate distributions.
|
||||
In fact, the univariate kernel is just applied in every dimension and the individual results are combined by a product.
|
||||
Therefore, a multivariate kernel estimator which is obtained by this concept is called product kernel \cite[162]{scott2015}.
|
||||
|
||||
\begin{equation}
|
||||
\bm{K_h}(t) = \frac{1}{h_1 \dots h_d} \prod_{j=1}^{d} K\left( \frac{x_j-x_{ij}}{h_j} \right)
|
||||
\end{equation}
|
||||
|
||||
|
||||
Multivariate Gauss-Kernel
|
||||
|
||||
\begin{equation}
|
||||
K(u)=\frac{1}{(2\pi)^{d/2}} \expp{-\frac{1}{2} \bm{x}^T \bm{x}}
|
||||
\end{equation}
|
||||
|
||||
|
||||
|
||||
In summary the kernel density estimation is a powerful tool to obtain an estimate from an unknown distribution.
|
||||
It can obtain a smooth estimate from a sample without relying on assumptions about the underlying model.
|
||||
Selection of a good choice of the bandwidth is non-obvious and has a crucial impact on the quality of the estimate.
|
||||
In theory it is possible to calculate an optimal bandwidth $h*$ regarding to the asymptotic mean integrated squared error.
|
||||
However, in order to do so the to be estimated density needs to be known which is obviously unknown.
|
||||
|
||||
Any non-optimal bandwidth causes undersmoothing or oversmoothing.
|
||||
An undersmoothing estimator has a large variance and hence a small $h$ leads to undersmoothing.
|
||||
On the other hand given a large $h$ the bias increases, which leads to oversmoothing \cite[7]{Cybakov2009}.
|
||||
Clearly with an adverse choice of the bandwidth crucial information like modality might get smoothed out.
|
||||
Sophisticated methods are necessary to obtain a good bandwidth from the data.
|
||||
|
||||
|
||||
The flexibility of the KDE comes at the expense of computational efficiency, which leads to the development of more efficient computation schemes.
|
||||
The computation time depends, besides the number of calculated points, on the number of data points $n$.
|
||||
In general, reducing the size of the sample negatively affects the accuracy of the estimate.
|
||||
Still, the sample size is a suitable parameter to speedup the computation.
|
||||
To reduce the number of single data points adjacent points are combined into a bin.
|
||||
Usually the data is binned over an equidistant grid.
|
||||
A kernel estimator which computes the estimate over such a grid is called binned kernel density estimator (BKDE).
|
||||
|
||||
Due to the equally-spaced grid many kernel evaluations are almost the same and can be saved, which greatly reduces the number of evaluated kernels \cite{fan1994fast}.
|
||||
This naturally leads to a reduced computation time.
|
||||
|
||||
So far the binned estimator introduced two additional parameters which affects the accuracy of the estimation.
|
||||
The choice of the number of grid points is crucial, because it determines the trade-off between the approximation error introduced by the binning and the computational speed of the algorithm \cite{wand1994fast}.
|
||||
Also, the choice of the binning rule affects the accuracy of the approximation.
|
||||
There are several possible binning rules \cite{hall1996accuracy} but the most commonly used binning rules are the simple \eqref{eq:simpleBinning} and common linear binning rule \eqref{eq:linearBinning}.
|
||||
An advantage of these often used binning rules is that their effect on the approximation is extensively investigated and well understood \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
|
||||
|
||||
|
||||
At first the data, i.e. a random sample $X$, has to be assigned to a grid.
|
||||
A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$ for $j\in\Z$ and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$.
|
||||
For computation a finite grid is used on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$.
|
||||
|
||||
|
||||
Equipped with a method to bin the data the binned kernel estimator $\tilde{f}$ of a density $f$, based on $\bm{X}$ can be pointwise computed at the grid point $g_x$ with
|
||||
\begin{equation}
|
||||
\label{eq:binKde}
|
||||
\tilde{f}(g_x) = \frac{1}{n} \sum_{j=1}^{G} N_j K_h(g_x-g_j)
|
||||
\end{equation}
|
||||
where $G$ is the number of grid points and
|
||||
\begin{equation}
|
||||
N_j=\sum_{i=1}^{n} w_j(x_i,\delta)
|
||||
\end{equation}
|
||||
is the count at grid point $g_j$ \cite{hall1996accuracy}.
|
||||
|
||||
|
||||
Once the counts $N_j$ and kernel values are computed they need to be combined, which is, in fact, a discrete convolution \cite{wand1994fast}.
|
||||
This makes it possible to apply a wide range of well studied techniques from the DSP field.
|
||||
Often a FFT based computation scheme is used to efficiently compute the estimate \cite{silverman1982algorithm}\cite[210ff.]{scott2015}.
|
||||
As already stated the computational savings are achieved by reducing the number of evaluated kernels.
|
||||
A naive implementation of \eqref{eq:binKde} reduces the number evaluations to $\landau{G^2}$ \cite{fan1994fast}.
|
||||
Because of the fixed grid spacing $\delta$ most of the kernel evaluations are the same, as each $g_j-g_{j-k}=k\delta$ is independent of $j$ \cite{fan1994fast}.
|
||||
Therefore, many evaluated kernels can be reused, so that the number kernel evaluations are reduced to $\landau{G}$ \cite{fan1994fast}.
|
||||
|
||||
|
||||
While the estimate can be efficiently computed it is still unknown how large the grid should be chosen.
|
||||
Because the computation time heavily depends on the grid size, it is desirable to chose a grid as small as possible without losing to much accuracy.
|
||||
In general there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size.
|
||||
The roughness of the unknown density directly affects the grid size.
|
||||
Coarser grids allow a greater speedup but at the same time might conceal important details of the unknown density \cite{wand1994fast}.
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
%Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with
|
||||
%\begin{equation}
|
||||
%\label{eq:kde}
|
||||
%\hat{f}_n = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x-X_i}{h} \right) \text{,} %= \frac{1}{n} \sum_{i=1}^{n} K_h(x-x_i)
|
||||
%\end{equation}
|
||||
%where $K$ is the kernel function and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
|
||||
%While any density function can be used as the kernel function $K$ (such that $\int K(u) \dop{u} = 1$), a variety of popular choices of the kernel function $K$ exits.
|
||||
%In practice the Gaussian kernel is commonly used:
|
||||
%\begin{equation}
|
||||
%\label{eq:gausKern}
|
||||
%K(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} }
|
||||
%\end{equation}
|
||||
%
|
||||
%\begin{equation}
|
||||
%\hat{f}_n = \frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{n} \expp{-\frac{(x-X_i)^2}{2h^2}}
|
||||
%\end{equation}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user