Improved kde & mvg

This commit is contained in:
MBulli
2018-02-15 23:09:17 +01:00
parent 3b88dc614c
commit 1dff368e82
4 changed files with 116 additions and 106 deletions

View File

@@ -16,11 +16,9 @@ The KDE is often the preferred tool to estimate a density function from discrete
Given a univariate random sample $X=\{X_1, X_2, \dots, X_n\}$, the kernel estimator $\hat{f}$ which defines the estimate at the point $x$ is given as
\begin{equation}
\label{eq:kde}
\hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} K_h(x-X_i)
\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K \left(\frac{x-X_i}{h}\right)
\end{equation}
where $K_h(t)=K(t/h)/h$ is the normalized kernel \cite[138]{scott2015} and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
%, and $h=h_n$ is a function of the sample size $n$ with $h\rightarrow0$ as $n\rightarrow\infty$ \cite{rosenblatt1956remarks}.
Any function which satisfies $\int K_h(u) \dop{u} = 1$ is a valid kernel.
where $K$ is a kernel function such that $\int K(u) \dop{u} = 1$ and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth \cite[138]{scott2015}.
In general any kernel can be used, however the general advice is to chose a symmetric and low-order polynomial kernel.
Thus, several popular kernel functions are used in practice, like the Uniform, Gaussian, Epanechnikov, or Silverman kernel \cite[152.]{scott2015}.
@@ -50,72 +48,33 @@ The Gaussian kernel is given as
K_G(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} } \text{.}
\end{equation}
So far only the univariate case was considered.
This is due to the fact, that univariate kernel estimators can quite easily be extended to multivariate distributions.
A common approach is to apply an univariate kernel with a possibly different bandwidth in each dimension.
These kind of multivariate kernel is called product kernel as the multivariate kernel result is the product of each individual univariate kernel.
Given a multivariate random variable $X=(x_1,\dots ,x_d)$ in $d$ dimensions.
The sample $\bm{X}$ is a $n\times d$ matrix defined as \cite[162]{scott2015}
\begin{equation}
\bm{X}=
\begin{pmatrix}
X_1 \\
\vdots \\
X_n \\
\end{pmatrix}
=
\begin{pmatrix}
x_{11} & \dots & x_{1d} \\
\vdots & \ddots & \vdots \\
x_{n1} & \dots & x_{nd}
\end{pmatrix} \text{.}
\end{equation}
The multivariate kernel density estimator $\hat{f}$ which defines the estimate pointwise at $\bm{x}=(x_1, \dots, x_d)^T$ is given as \cite[162]{scott2015}
\begin{equation}
\hat{f}(\bm{x}) = \frac{1}{nh_1 \dots h_d} \sum_{i=1}^{n} \left[ \prod_{j=1}^{d} K\left( \frac{x_j-x_{ij}}{h_j} \right) \right] \text{.}
\end{equation}
where the bandwidth is given as a vector $\bm{h}=(h_1, \dots, h_d)$.
%Multivariate Gauss-Kernel
%\begin{equation}
%K(u)=\frac{1}{(2\pi)^{d/2}} \expp{-\frac{1}{2} \bm{x}^T \bm{x}}
%\end{equation}
The flexibility of the KDE comes at the expense of computational efficiency, which leads to the development of more efficient computation schemes.
The computation time depends, besides the number of calculated points, on the number of data points $N$.
In general, reducing the size of the sample negatively affects the accuracy of the estimate.
Still, the sample size is a suitable parameter to speedup the computation.
\todo{neu schreiben}
Silverman \cite{silverman1982algorithm} suggested to reduce the number of single data points by combining adjacent points into data bins.
This approximation is called binned kernel density estimate (BKDE) and was extensively analysed \cite{fan1994fast} \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
Usually the data is binned over an equidistant grid.
Due to the equally-spaced grid many kernel evaluations are almost the same and can be saved, which greatly reduces the number of evaluated kernels and naturally leads to a reduced computation time \cite{fan1994fast}.
\todo{bin size variable einführen}
At first the data, i.e. a random sample $X$, has to be assigned to a grid.
A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$ for $j\in\Z$ and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$.
For computation a finite grid is used on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$.
Since each single sample is combined with its adjacent samples into bins, the BKDE approximates the KDE.
Each bin represents the \qq{weight} of the sample set at a given point of a equidistant grid with spacing $\delta$.
A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$ indexed by $j\in\Z$.
% and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$.
Computation requires a finite grid on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$.
Given a binning rule $w_j$ the BKDE $\tilde{f}$ of a density $f$ computed pointwise at the grid point $g_x$ is given as
\begin{equation}
\label{eq:binKde}
\tilde{f}(g_x) = \frac{1}{n} \sum_{j=1}^{G} N_j K_h(g_x-g_j)
\tilde{f}(g_x) = \frac{1}{nh} \sum_{j=1}^{G} C_j K \left(\frac{g_x-g_j}{h}\right)
\end{equation}
where $G$ is the number of grid points and
\begin{equation}
\label{eq:gridCnts}
N_j=\sum_{i=1}^{n} w_j(x_i,\delta)
C_j=\sum_{i=1}^{n} w_j(x_i,\delta)
\end{equation}
is the count at grid point $g_j$ \cite{hall1996accuracy}.
So far the binned estimator introduced two additional parameters which affects the accuracy of the estimation.
The choice of the number of grid points is crucial, because it determines the trade-off between the approximation error introduced by the binning and the computational speed of the algorithm \cite{wand1994fast}.
Also, the choice of the binning rule affects the accuracy of the approximation.
There are several possible binning rules \cite{hall1996accuracy} but the most commonly used binning rules are the simple
\commentByMarkus{Wording: Count vs. Weight}
In theory, any function which assigns weights to grid points is a valid binning rule.
However, for many applications it is recommend to use the simple binning rule
\begin{align}
\label{eq:simpleBinning}
w_j(x,\delta) &=
@@ -124,8 +83,7 @@ There are several possible binning rules \cite{hall1996accuracy} but the most co
0 & \text{else}
\end{cases}
\end{align}
and the common linear binning rule
or the common linear binning rule which divides the sample into two fractional weights shared by the nearest grid points
\begin{align}
w_j(x,\delta) &=
\begin{cases}
@@ -133,39 +91,38 @@ and the common linear binning rule
0 & \text{else.}
\end{cases}
\end{align}
An advantage of these often used binning rules is that their effect on the approximation is extensively investigated and well understood \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
An advantage is that their impact on the approximation error is extensively investigated and well understood \cite{hall1996accuracy}.
Both methods can be computed with a fast $\landau{N}$ algorithm, as simple binning is essentially the quotient of an integer division and the fractional weights of the linear binning are given by the remainder of the division.
As linear binning is more precise it is often preferred over simple binning \cite{fan1994fast}.
\todo{textfluß}
While the estimate can be efficiently computed it is unknown how large the grid should be chosen.
Because the computation time heavily depends on the grid size, it is desirable to chose a grid as small as possible without losing to much accuracy.
In general, there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size.
The roughness of the unknown density directly affects the grid size.
Coarser grids allow a greater speedup but at the same time might conceal important details of the unknown density \cite{wand1994fast}.
While linear binning improves the accuracy of the estimate the choice of the grid size is of more importance.
The number of grid points $G$ determines the trade-off between the approximation error caused by the binning and the computational speed of the algorithm.
Clearly, a large value of $G$ produces a estimate close to the regular KDE, but requires more evaluations of the kernel compared to a coarser grid.
However, it is unknown what particular $G$ gives the best trade-off for any given sample set.
In general, there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size \cite{hall1996accuracy}.
As already stated the computational savings are achieved by reducing the number of evaluated kernels.
A naive implementation of \eqref{eq:binKde} reduces the number evaluations to $\landau{G^2}$ \cite{fan1994fast}.
Because of the fixed grid spacing $\delta$ most of the kernel evaluations are the same, as each $g_j-g_{j-k}=k\delta$ is independent of $j$ \cite{fan1994fast}.
Therefore, many evaluated kernels can be reused, so that the number kernel evaluations are reduced to $\landau{G}$ \cite{fan1994fast}.
Due to the fixed grid spacing a faster $\landau{G}$ algorithm can be used, because most of the kernel evaluations are the same and can be reused.
%, as each $g_j-g_{j-k}=k\delta$ is independent of $j$ \cite{fan1994fast}.
This is usually highlighted as the striking computational benefit of the BKDE.
However, more important for this work the fact that the BKDE can be seen as a convolution operation.
\todo{Satz}
Once the grid counts $N_j$ in \eqref{eq:gridCnts} and kernel values are computed they need to be combined, which is, in fact, a discrete convolution \cite{wand1994fast}.
This makes it possible to apply a wide range of well studied techniques from the DSP field.
Often a FFT-convolution based computation scheme is used to efficiently compute the estimate \cite{silverman1982algorithm}\cite[210ff.]{scott2015}.
Using the Gaussian kernel from \eqref{eq:gausKern} in conjunction with the BKDE results in the following equation
However, for this work it is key to recognize the discrete convolution structure of \eqref{eq:binKde}, as this allows one to interpret the computation of a density estimate as a signal filter problem.
This makes it possible to apply a wide range of well studied techniques from the broad field of digital signal processing (DSP).
Using the Gaussian kernel from \eqref{eq:gausKern} in conjunction with \eqref{eq:binKde} results in the following equation
\begin{equation}
\hat{f}(g_x)=\frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{G} N_j \expp{-\frac{(x-X_i)^2}{2h^2}} \text{.}
\label{eq:bkdeGaus}
\tilde{f}(g_x)=\frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{G} C_j \expp{-\frac{(x-X_i)^2}{2h^2}} \text{.}
\end{equation}
\todo{großes N zu großes C und im Text unten benutzen damit klarer}
As already stated the above formula is a convolution operation of the data and the kernel.
The above formula is a convolution operation of the data and the Gaussian kernel.
More precisely it is a discrete convolution of the finite data grid and the Gaussian function.
In terms of DSP this is analogous to filter the binned data with a Gaussian filter.
\commentByToni{hier vielleicht nochmal explizit erwähnen, also mit Namen, das der Gauss jetzt die BKDE approximiert und das diese erkenntniss toll und wichtig ist, weil wir so ein komplexes problem total einfach und schnell dargestellt haben.}
This finding allows to speedup the computation of the density estimate by using a fast approximation scheme based on iterated moving average filters.
\commentByToni{hier vielleicht nochmal explizit erwähnen, also mit Namen, das der Gauss jetzt die BKDE approximiert und das diese erkenntniss toll und wichtig ist, weil wir so ein komplexes problem total einfach und schnell dargestellt haben.}
\commentByMarkus{Reicht das so?}
%Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with

View File

@@ -4,38 +4,27 @@
% Gauss Blur Filter
% Repetitive Box filter to approx Gauss
% Simple multipass, n/m approach, extended box filter
\todo{normalisierungsfaktor, sigma vs. h beschreiben}
The Gaussian filter is a widely used smoothing filter.
It is defined as the convolution of an input signal and the Gaussian function
Digital filters are implemented by convolving the input signal with a filter kernel, i.e. the digital filter's impulse response.
Consequently, the filter kernel of a Gaussian filter is a Gaussian with finite support \cite[120]{dspGuide1997}.
Assuming a finite-support Gaussian filter kernel of size $M$ and a input signal $x$, discrete convolution produces the smoothed output signal
\begin{equation}
\label{eq:gausfx}
g(x) = \frac{1}{\sigma \sqrt{2\pi}} \expp{-\frac{x^2}{2\sigma^2}} \text{,}
\label{eq:gausFilt}
y[n] = \frac{1}{\sigma\sqrt{2\pi}} \sum_{k=0}^{M-1} x[k]\expp{-\frac{(n-k)^2}{2\sigma^2}} \text{,}
\end{equation}
where $\sigma$ is a smoothing parameter called standard deviation.
%In the discrete case the Gaussian filter is easily computed with the sliding window algorithm in time domain.
If the filter kernel is separable, the convolution is also separable i.e. multi-dimensional convolution can be computed as individual one-dimensional convolutions with a one-dimensional kernel.
Because of $e^{x^2+y^2} = e^{x^2}\cdot e^{y^2}$ the Gaussian filter is separable and can be easily applied to multi-dimensional signals. \todo{quelle}
Note that \eqref{eq:bkdeGaus} has the same structure as \eqref{eq:gausFilt}, except the varying notational symbol of the smoothing parameter and the different factor in front of the sum.
While in both equations the constant factor of the Gaussian is removed of the inner sum, \eqref{eq:bkdeGaus} has an additional normalization factor $N^{-1}$.
This factor is necessary to in order to ensure that the estimate is a valid density function, i.e. that it integrates to one.
Such a restriction is superfluous in the context of digital filters, so the normalization factor is omitted.
% TODO ähnlichkeit Gauss und KDE -> schneller Gaus = schnelle KDE
Computation of a filter using the a naive implementation of the discrete convolution algorithm yields $\landau{NK}$, where $N$ is the length of the input signal and $K$ is the size of the filter kernel.
Note that in the case of the Gaussian filter $K$ depends on $\sigma$.
In order to capture all significant values of the Gaussian function the kernel size $K$ must be adopted to the standard deviation of the Gaussian.
%A popular approach to efficiently compute a filter result is the FFT-convoultion algorithm which is $\landau{N\log(N)}$.
%For large values of $\sigma$ the computation time of the Gaussian filter might be reduced by applying the filter in frequency domain.
%In order to do so, both signals are transformed into frequency domain using the FFT.
%The convoluted time signal is equal to the point-wise multiplication of the signals in frequency domain.
%In case of the Gaussian filter the computation of the Fourier transform of the kernel can be saved, as the Gaussian is a eigenfunction for the Fourier transform \cite{?}.
%While the FFT-convolution algorithm poses an efficient algorithm for large signals, it adds an noticeable overhead for small signals.
%While the above mentions algorithms poses efficient computations schemes to compute an exact filter result, approximative algorithms can further speed up the computation.
\todo{o(nk) ist scheiße und wir wollen o(n) haben, deshalb box filter boy}
A well-known rapid approximation of the Guassian filter is given by the moving average filter.
Computation of a digital filter using the a naive implementation of the discrete convolution algorithm yields $\landau{NM}$, where $N$ is the length of the input signal and $M$ is the size of the filter kernel.
In order to capture all significant values of the Gaussian function the kernel size $M$ must be adopted to the standard deviation of the Gaussian.
A faster $\landau{N}$ algorithm is given by the well-known approximation scheme based on iterated moving average filters.
While reducing the algorithmic complexity this approximation also reduces computational time significantly due to the simplistic computation scheme of the moving average filter.
Following that, a implementation of the iterated moving average filter is one of the fastest ways to obtain an approximative Gaussian filtered signal.
% TODO mit einem geringen Fehler
% TODO und somit kann der mvg filter auch die BKDE approxen?
\subsection{Moving Average Filter}
The moving average filter is a simplistic filter which takes an input function $x$ and produces a second function $y$.

View File

@@ -1,4 +1,47 @@
\section{Usage}
\subsection{Extension to multi-dimensional data}
\todo{Absatz zum Thema 2D - Extension to multi-dimensional data}
% KDE:
%So far only the univariate case was considered.
%This is due to the fact, that univariate kernel estimators can quite easily be extended to multivariate distributions.
%A common approach is to apply an univariate kernel with a possibly different bandwidth in each dimension.
%These kind of multivariate kernel is called product kernel as the multivariate kernel result is the product of each individual univariate kernel.
%
%Given a multivariate random variable $X=(x_1,\dots ,x_d)$ in $d$ dimensions.
%The sample $\bm{X}$ is a $n\times d$ matrix defined as \cite[162]{scott2015}
%\begin{equation}
% \bm{X}=
% \begin{pmatrix}
% X_1 \\
% \vdots \\
% X_n \\
% \end{pmatrix}
% =
% \begin{pmatrix}
% x_{11} & \dots & x_{1d} \\
% \vdots & \ddots & \vdots \\
% x_{n1} & \dots & x_{nd}
% \end{pmatrix} \text{.}
%\end{equation}
%
%The multivariate kernel density estimator $\hat{f}$ which defines the estimate pointwise at $\bm{x}=(x_1, \dots, x_d)^T$ is given as \cite[162]{scott2015}
%\begin{equation}
% \hat{f}(\bm{x}) = \frac{1}{nh_1 \dots h_d} \sum_{i=1}^{n} \left[ \prod_{j=1}^{d} K\left( \frac{x_j-x_{ij}}{h_j} \right) \right] \text{.}
%\end{equation}
%where the bandwidth is given as a vector $\bm{h}=(h_1, \dots, h_d)$.
%Multivariate Gauss-Kernel
%\begin{equation}
%K(u)=\frac{1}{(2\pi)^{d/2}} \expp{-\frac{1}{2} \bm{x}^T \bm{x}}
%\end{equation}
% Gaus:
%If the filter kernel is separable, the convolution is also separable i.e. multi-dimensional convolution can be computed as individual one-dimensional convolutions with a one-dimensional kernel.
%Because of $e^{x^2+y^2} = e^{x^2}\cdot e^{y^2}$ the Gaussian filter is separable and can be easily applied to multi-dimensional signals. \todo{quelle}
%wie benutzen wir das ganze jetzt? auf was muss ich achten?
% Am Beispiel 2D Daten
@@ -9,7 +52,8 @@
% - separiert in jeder dim einzeln
% Maximum aus Filter ergebnis nehmen
\todo{Absatz zum Thema 2D}
The objective of our method is to allow a reliable recover of the most probable state from a time-sequential Monte Carlo sensor fusion system.
Assuming a sample based representation, our method allows to estimate the density of the unknown distribution of the state space in a narrow time frame.

View File

@@ -2945,4 +2945,24 @@ year = {2003}
pages={403--433},
year={2013},
publisher={Springer}
}
@article{fan1994fast,
title={Fast implementations of nonparametric curve estimators},
author={Fan, Jianqing and Marron, James S.},
journal={Journal of computational and graphical statistics},
volume={3},
number={1},
pages={35--56},
year={1994},
publisher={Taylor \& Francis Group}
}
@book{dspGuide1997,
Author = {Steven W. Smith},
Title = {The Scientist and Engineer's Guide to Digital Signal Processing},
Edition = 2,
Year = {1999},
Isbn = {978-0-9660176-7-0},
Publisher = {California Technical Pub}
}