|
|
|
|
@@ -16,11 +16,9 @@ The KDE is often the preferred tool to estimate a density function from discrete
|
|
|
|
|
Given a univariate random sample $X=\{X_1, X_2, \dots, X_n\}$, the kernel estimator $\hat{f}$ which defines the estimate at the point $x$ is given as
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\label{eq:kde}
|
|
|
|
|
\hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} K_h(x-X_i)
|
|
|
|
|
\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K \left(\frac{x-X_i}{h}\right)
|
|
|
|
|
\end{equation}
|
|
|
|
|
where $K_h(t)=K(t/h)/h$ is the normalized kernel \cite[138]{scott2015} and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
|
|
|
|
|
%, and $h=h_n$ is a function of the sample size $n$ with $h\rightarrow0$ as $n\rightarrow\infty$ \cite{rosenblatt1956remarks}.
|
|
|
|
|
Any function which satisfies $\int K_h(u) \dop{u} = 1$ is a valid kernel.
|
|
|
|
|
where $K$ is a kernel function such that $\int K(u) \dop{u} = 1$ and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth \cite[138]{scott2015}.
|
|
|
|
|
In general any kernel can be used, however the general advice is to chose a symmetric and low-order polynomial kernel.
|
|
|
|
|
Thus, several popular kernel functions are used in practice, like the Uniform, Gaussian, Epanechnikov, or Silverman kernel \cite[152.]{scott2015}.
|
|
|
|
|
|
|
|
|
|
@@ -50,72 +48,33 @@ The Gaussian kernel is given as
|
|
|
|
|
K_G(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} } \text{.}
|
|
|
|
|
\end{equation}
|
|
|
|
|
|
|
|
|
|
So far only the univariate case was considered.
|
|
|
|
|
This is due to the fact, that univariate kernel estimators can quite easily be extended to multivariate distributions.
|
|
|
|
|
A common approach is to apply an univariate kernel with a possibly different bandwidth in each dimension.
|
|
|
|
|
These kind of multivariate kernel is called product kernel as the multivariate kernel result is the product of each individual univariate kernel.
|
|
|
|
|
|
|
|
|
|
Given a multivariate random variable $X=(x_1,\dots ,x_d)$ in $d$ dimensions.
|
|
|
|
|
The sample $\bm{X}$ is a $n\times d$ matrix defined as \cite[162]{scott2015}
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\bm{X}=
|
|
|
|
|
\begin{pmatrix}
|
|
|
|
|
X_1 \\
|
|
|
|
|
\vdots \\
|
|
|
|
|
X_n \\
|
|
|
|
|
\end{pmatrix}
|
|
|
|
|
=
|
|
|
|
|
\begin{pmatrix}
|
|
|
|
|
x_{11} & \dots & x_{1d} \\
|
|
|
|
|
\vdots & \ddots & \vdots \\
|
|
|
|
|
x_{n1} & \dots & x_{nd}
|
|
|
|
|
\end{pmatrix} \text{.}
|
|
|
|
|
\end{equation}
|
|
|
|
|
|
|
|
|
|
The multivariate kernel density estimator $\hat{f}$ which defines the estimate pointwise at $\bm{x}=(x_1, \dots, x_d)^T$ is given as \cite[162]{scott2015}
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\hat{f}(\bm{x}) = \frac{1}{nh_1 \dots h_d} \sum_{i=1}^{n} \left[ \prod_{j=1}^{d} K\left( \frac{x_j-x_{ij}}{h_j} \right) \right] \text{.}
|
|
|
|
|
\end{equation}
|
|
|
|
|
where the bandwidth is given as a vector $\bm{h}=(h_1, \dots, h_d)$.
|
|
|
|
|
|
|
|
|
|
%Multivariate Gauss-Kernel
|
|
|
|
|
%\begin{equation}
|
|
|
|
|
%K(u)=\frac{1}{(2\pi)^{d/2}} \expp{-\frac{1}{2} \bm{x}^T \bm{x}}
|
|
|
|
|
%\end{equation}
|
|
|
|
|
|
|
|
|
|
The flexibility of the KDE comes at the expense of computational efficiency, which leads to the development of more efficient computation schemes.
|
|
|
|
|
The computation time depends, besides the number of calculated points, on the number of data points $N$.
|
|
|
|
|
In general, reducing the size of the sample negatively affects the accuracy of the estimate.
|
|
|
|
|
Still, the sample size is a suitable parameter to speedup the computation.
|
|
|
|
|
\todo{neu schreiben}
|
|
|
|
|
Silverman \cite{silverman1982algorithm} suggested to reduce the number of single data points by combining adjacent points into data bins.
|
|
|
|
|
This approximation is called binned kernel density estimate (BKDE) and was extensively analysed \cite{fan1994fast} \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
|
|
|
|
|
|
|
|
|
|
Usually the data is binned over an equidistant grid.
|
|
|
|
|
Due to the equally-spaced grid many kernel evaluations are almost the same and can be saved, which greatly reduces the number of evaluated kernels and naturally leads to a reduced computation time \cite{fan1994fast}.
|
|
|
|
|
|
|
|
|
|
\todo{bin size variable einführen}
|
|
|
|
|
At first the data, i.e. a random sample $X$, has to be assigned to a grid.
|
|
|
|
|
A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$ for $j\in\Z$ and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$.
|
|
|
|
|
For computation a finite grid is used on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$.
|
|
|
|
|
|
|
|
|
|
Since each single sample is combined with its adjacent samples into bins, the BKDE approximates the KDE.
|
|
|
|
|
Each bin represents the \qq{weight} of the sample set at a given point of a equidistant grid with spacing $\delta$.
|
|
|
|
|
A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$ indexed by $j\in\Z$.
|
|
|
|
|
% and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$.
|
|
|
|
|
Computation requires a finite grid on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$.
|
|
|
|
|
|
|
|
|
|
Given a binning rule $w_j$ the BKDE $\tilde{f}$ of a density $f$ computed pointwise at the grid point $g_x$ is given as
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\label{eq:binKde}
|
|
|
|
|
\tilde{f}(g_x) = \frac{1}{n} \sum_{j=1}^{G} N_j K_h(g_x-g_j)
|
|
|
|
|
\tilde{f}(g_x) = \frac{1}{nh} \sum_{j=1}^{G} C_j K \left(\frac{g_x-g_j}{h}\right)
|
|
|
|
|
\end{equation}
|
|
|
|
|
where $G$ is the number of grid points and
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\label{eq:gridCnts}
|
|
|
|
|
N_j=\sum_{i=1}^{n} w_j(x_i,\delta)
|
|
|
|
|
C_j=\sum_{i=1}^{n} w_j(x_i,\delta)
|
|
|
|
|
\end{equation}
|
|
|
|
|
is the count at grid point $g_j$ \cite{hall1996accuracy}.
|
|
|
|
|
|
|
|
|
|
So far the binned estimator introduced two additional parameters which affects the accuracy of the estimation.
|
|
|
|
|
The choice of the number of grid points is crucial, because it determines the trade-off between the approximation error introduced by the binning and the computational speed of the algorithm \cite{wand1994fast}.
|
|
|
|
|
Also, the choice of the binning rule affects the accuracy of the approximation.
|
|
|
|
|
There are several possible binning rules \cite{hall1996accuracy} but the most commonly used binning rules are the simple
|
|
|
|
|
\commentByMarkus{Wording: Count vs. Weight}
|
|
|
|
|
|
|
|
|
|
In theory, any function which assigns weights to grid points is a valid binning rule.
|
|
|
|
|
However, for many applications it is recommend to use the simple binning rule
|
|
|
|
|
\begin{align}
|
|
|
|
|
\label{eq:simpleBinning}
|
|
|
|
|
w_j(x,\delta) &=
|
|
|
|
|
@@ -124,8 +83,7 @@ There are several possible binning rules \cite{hall1996accuracy} but the most co
|
|
|
|
|
0 & \text{else}
|
|
|
|
|
\end{cases}
|
|
|
|
|
\end{align}
|
|
|
|
|
and the common linear binning rule
|
|
|
|
|
|
|
|
|
|
or the common linear binning rule which divides the sample into two fractional weights shared by the nearest grid points
|
|
|
|
|
\begin{align}
|
|
|
|
|
w_j(x,\delta) &=
|
|
|
|
|
\begin{cases}
|
|
|
|
|
@@ -133,39 +91,38 @@ and the common linear binning rule
|
|
|
|
|
0 & \text{else.}
|
|
|
|
|
\end{cases}
|
|
|
|
|
\end{align}
|
|
|
|
|
An advantage of these often used binning rules is that their effect on the approximation is extensively investigated and well understood \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
|
|
|
|
|
An advantage is that their impact on the approximation error is extensively investigated and well understood \cite{hall1996accuracy}.
|
|
|
|
|
Both methods can be computed with a fast $\landau{N}$ algorithm, as simple binning is essentially the quotient of an integer division and the fractional weights of the linear binning are given by the remainder of the division.
|
|
|
|
|
As linear binning is more precise it is often preferred over simple binning \cite{fan1994fast}.
|
|
|
|
|
|
|
|
|
|
\todo{textfluß}
|
|
|
|
|
While the estimate can be efficiently computed it is unknown how large the grid should be chosen.
|
|
|
|
|
Because the computation time heavily depends on the grid size, it is desirable to chose a grid as small as possible without losing to much accuracy.
|
|
|
|
|
In general, there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size.
|
|
|
|
|
The roughness of the unknown density directly affects the grid size.
|
|
|
|
|
Coarser grids allow a greater speedup but at the same time might conceal important details of the unknown density \cite{wand1994fast}.
|
|
|
|
|
While linear binning improves the accuracy of the estimate the choice of the grid size is of more importance.
|
|
|
|
|
The number of grid points $G$ determines the trade-off between the approximation error caused by the binning and the computational speed of the algorithm.
|
|
|
|
|
Clearly, a large value of $G$ produces a estimate close to the regular KDE, but requires more evaluations of the kernel compared to a coarser grid.
|
|
|
|
|
However, it is unknown what particular $G$ gives the best trade-off for any given sample set.
|
|
|
|
|
In general, there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size \cite{hall1996accuracy}.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
As already stated the computational savings are achieved by reducing the number of evaluated kernels.
|
|
|
|
|
A naive implementation of \eqref{eq:binKde} reduces the number evaluations to $\landau{G^2}$ \cite{fan1994fast}.
|
|
|
|
|
Because of the fixed grid spacing $\delta$ most of the kernel evaluations are the same, as each $g_j-g_{j-k}=k\delta$ is independent of $j$ \cite{fan1994fast}.
|
|
|
|
|
Therefore, many evaluated kernels can be reused, so that the number kernel evaluations are reduced to $\landau{G}$ \cite{fan1994fast}.
|
|
|
|
|
Due to the fixed grid spacing a faster $\landau{G}$ algorithm can be used, because most of the kernel evaluations are the same and can be reused.
|
|
|
|
|
%, as each $g_j-g_{j-k}=k\delta$ is independent of $j$ \cite{fan1994fast}.
|
|
|
|
|
This is usually highlighted as the striking computational benefit of the BKDE.
|
|
|
|
|
|
|
|
|
|
However, more important for this work the fact that the BKDE can be seen as a convolution operation.
|
|
|
|
|
\todo{Satz}
|
|
|
|
|
Once the grid counts $N_j$ in \eqref{eq:gridCnts} and kernel values are computed they need to be combined, which is, in fact, a discrete convolution \cite{wand1994fast}.
|
|
|
|
|
This makes it possible to apply a wide range of well studied techniques from the DSP field.
|
|
|
|
|
Often a FFT-convolution based computation scheme is used to efficiently compute the estimate \cite{silverman1982algorithm}\cite[210ff.]{scott2015}.
|
|
|
|
|
|
|
|
|
|
Using the Gaussian kernel from \eqref{eq:gausKern} in conjunction with the BKDE results in the following equation
|
|
|
|
|
However, for this work it is key to recognize the discrete convolution structure of \eqref{eq:binKde}, as this allows one to interpret the computation of a density estimate as a signal filter problem.
|
|
|
|
|
This makes it possible to apply a wide range of well studied techniques from the broad field of digital signal processing (DSP).
|
|
|
|
|
Using the Gaussian kernel from \eqref{eq:gausKern} in conjunction with \eqref{eq:binKde} results in the following equation
|
|
|
|
|
\begin{equation}
|
|
|
|
|
\hat{f}(g_x)=\frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{G} N_j \expp{-\frac{(x-X_i)^2}{2h^2}} \text{.}
|
|
|
|
|
\label{eq:bkdeGaus}
|
|
|
|
|
\tilde{f}(g_x)=\frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{G} C_j \expp{-\frac{(x-X_i)^2}{2h^2}} \text{.}
|
|
|
|
|
\end{equation}
|
|
|
|
|
|
|
|
|
|
\todo{großes N zu großes C und im Text unten benutzen damit klarer}
|
|
|
|
|
As already stated the above formula is a convolution operation of the data and the kernel.
|
|
|
|
|
The above formula is a convolution operation of the data and the Gaussian kernel.
|
|
|
|
|
More precisely it is a discrete convolution of the finite data grid and the Gaussian function.
|
|
|
|
|
In terms of DSP this is analogous to filter the binned data with a Gaussian filter.
|
|
|
|
|
\commentByToni{hier vielleicht nochmal explizit erwähnen, also mit Namen, das der Gauss jetzt die BKDE approximiert und das diese erkenntniss toll und wichtig ist, weil wir so ein komplexes problem total einfach und schnell dargestellt haben.}
|
|
|
|
|
This finding allows to speedup the computation of the density estimate by using a fast approximation scheme based on iterated moving average filters.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\commentByToni{hier vielleicht nochmal explizit erwähnen, also mit Namen, das der Gauss jetzt die BKDE approximiert und das diese erkenntniss toll und wichtig ist, weil wir so ein komplexes problem total einfach und schnell dargestellt haben.}
|
|
|
|
|
\commentByMarkus{Reicht das so?}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
%Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with
|
|
|
|
|
|