kde & moving avg draft

2018-02-11 22:19:58 +01:00
parent 651674d813
commit 4aa3ff5e30
6 changed files with 205 additions and 66 deletions
--- a/tex/bare_conf.tex
+++ b/tex/bare_conf.tex
@@ -109,6 +109,9 @@
 \newcommand{\Z}        {\ensuremath{ \mathbf{Z}                            }}
 \newcommand{\expp}  [1]{\ensuremath{ \exp \left( #1 \right)                }}
 \newcommand{\landau}[1]{\ensuremath{ \mathcal{O}\left( #1 \right)          }}
 \newcommand{\wideal}   {\ensuremath{ w_{\text{ideal}}                      }}
 \newcommand{\floor} [1]{\ensuremath{ \lfloor #1 \rfloor                    }}
 \newcommand{\etal}  [1]{#1~et~al.}
 \newcommand{\qq}    [1]{``#1''}
--- a/tex/chapters/introduction.tex
+++ b/tex/chapters/introduction.tex
@@ -13,6 +13,7 @@ While such methods are computational fast and suitable most of the time, it is n
 Especially time-sequential, non-linear and non-Gaussian state spaces, depending upon a high number of different sensor types, frequently suffer from a multimodal representation of the posterior distribution.  
 As a result, those techniques are not able to provide an accurate statement about the most probable state, rather causing misleading or false outcomes. 
 For example in a localization scenario where a bimodal distribution represents the current posterior, a reliable position estimation is more likely to be at one of the modes, instead of somewhere in-between.
 \commentByMarkus{Vlt. noch drauf eingehen, dass avg. eben in die Mitte geht?}  
 Additionally, in most practical scenarios the sample size and therefore the resolution is limited, causing the variance of the sample based estimate to be high \cite{Verma2003}.
 It is obvious, that a computation of the full posterior could solve the above, but finding such an analytical solution is an intractable problem, what is the reason for applying a sample representation in the first place. 
--- a/tex/chapters/kde.tex
+++ b/tex/chapters/kde.tex
@@ -12,6 +12,48 @@ The histogram is a simple and for a long time the most used non-parametric estim
 However, its inability to produce a continuous estimate dismisses it for many applications where a smooth distribution is assumed.
 In contrast, KDE is often the preferred tool because of its ability to produce a continuous estimate and its flexibility.
 Given a univariate random sample $X=\{X_1, X_2, \dots, X_n\}$, the kernel estimator $\hat{f}$ which defines the estimate at the point $x$ is given as
 \begin{equation}
 \label{eq:kde}
 \hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} K_h(x-X_i)
 \end{equation}
 where $K_h(t)=K(t/h)/h$ is the normalized kernel \cite[138]{scott2015} and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
 %, and $h=h_n$ is a function of the sample size $n$ with $h\rightarrow0$ as $n\rightarrow\infty$ \cite{rosenblatt1956remarks}.
 Any function which satisfy $\int K_h(u) \dop{u} = 1$ is a valid kernel.
 In general any kernel can be used, however the general advice is to chose a symmetric and low-order polynomial kernel.
 Thus, several popular kernel functions are used in practice, like the Uniform, Gaussian, Epanechnikov, or Silverman kernel \cite[152.]{scott2015}.
 While the kernel estimate inherits all the properties of the kernel, usually it is not of crucial matter if a non-optimal kernel was chosen \cite[151f.]{scott2015}.
 As a matter of fact, the quality of the kernel estimate is primarily determined by the smoothing parameter $h$ \cite[145]{scott2015}.
 In theory it is possible to calculate an optimal bandwidth $h^*$ regarding to the asymptotic mean integrated squared error.
 However, in order to do so the density function to be estimated needs to be known which is obviously unknown in practice.
 Any non-optimal bandwidth causes undersmoothing or oversmoothing.
 An undersmoothing estimator has a large variance and hence a small $h$ leads to undersmoothing.
 On the other hand given a large $h$ the bias increases, which leads to oversmoothing \cite[7]{Cybakov2009}.
 Clearly with an adverse choice of the bandwidth crucial information like modality might get smoothed out.
 All in all it is not obvious to determine a good choice of the bandwidth.
 This is aggravated by the fact that the structure of the data may vary significantly.
 Given such a situation it is beneficial to adapt the bandwidth to the neighbourhood of the given data point.
 As a result, a lot of research is put into developing data-driven bandwidth selections algorithms to obtain an adequate value of $h$ directly from the data.
 % TODO aus gründen wird hier die Bandbreite als gegeben angenommen
 As mentioned above the particular choice of the kernel is only of minor importance as it affects the overall result in an negligible way.
 It is common practice to suspect that the data is approximately Gaussian, and therefore the Gaussian kernel is frequently used.
 Note that this assumption is different compared to assuming a concrete distribution family like a Gaussian distribution or mixture distribution.
 In this work we choose the Gaussian kernel in favour of computational efficiency as our approach is based on the approximation of the Gaussian filter.
 The Gaussian kernel is given as
 \begin{equation}
 \label{eq:gausKern}
 K_G(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} } \text{.}
 \end{equation}
 So far only the univariate case was considered.
 This is due to the fact, that univariate kernel estimators can quite easily be extended to multivariate distributions.
 A common approach is to apply an univariate kernel with a possibly different bandwidth in each dimension.
 These kind of multivariate kernel is called product kernel as the multivariate kernel result is the product of each individual univariate kernel.
 Given a multivariate random variable $X=(x_1,\dots ,x_d)$ in $d$ dimensions.
 The sample $\bm{X}$ is a $n\times d$ matrix defined as \cite[162]{scott2015}
 \begin{equation}
@@ -29,101 +71,91 @@ The sample $\bm{X}$ is a $n\times d$ matrix defined as \cite[162]{scott2015}
    \end{pmatrix} \text{.}
 \end{equation}
-The multivariate kernel estimator $\hat{f}$ which defines the estimate pointwise at $\bm{x}=(x_1, \dots, x_d)^T$ is given as \cite[162]{scott2015}
+The multivariate kernel density estimator $\hat{f}$ which defines the estimate pointwise at $\bm{x}=(x_1, \dots, x_d)^T$ is given as \cite[162]{scott2015}
 \begin{equation}
-\hat{f}(\bm{x}) = \frac{1}{n} \sum_{i=1}^{n} \bm{K_h}(\bm{x} - \bm{X_i})
+    \hat{f}(\bm{x}) = \frac{1}{nh_1 \dots h_d} \sum_{i=1}^{n} \left[  \prod_{j=1}^{d} K\left( \frac{x_j-x_{ij}}{h_j} \right)  \right]  \text{.}
 \end{equation}
-where $\bm{h}$ is an arbitrary smoothing parameter called bandwidth and $\bm{K_h}$ is a multivariate kernel function.
+where the bandwidth is given as a vector $\bm{h}=(h_1, \dots, h_d)$.
 The bandwidth is given as a vector $\bm{h}=(h_1, \dots, h_d)$, i.e. for every dimension a different smoothing parameter is used.
 Valid kernel functions must satisfy $\int \bm{K_h}(\bm{u}) \dop{\bm{u}} = 1$.
 A common approach to define a multivariate kernel function is to apply a univariate kernel in each dimension.
 Univariate kernel estimators can easily be extended to multivariate distributions.
 In fact, the univariate kernel is just applied in every dimension and the individual results are combined by a product.
 Therefore, a multivariate kernel estimator which is obtained by this concept is called product kernel \cite[162]{scott2015}.
 \begin{equation}
 \bm{K_h}(t) = \frac{1}{h_1 \dots h_d} \prod_{j=1}^{d} K\left( \frac{x_j-x_{ij}}{h_j} \right)
 \end{equation}
 Multivariate Gauss-Kernel
 \begin{equation}
 K(u)=\frac{1}{(2\pi)^{d/2}} \expp{-\frac{1}{2} \bm{x}^T \bm{x}}
 \end{equation}
 In summary the kernel density estimation is a powerful tool to obtain an estimate from an unknown distribution.
 It can obtain a smooth estimate from a sample without relying on assumptions about the underlying model.
 Selection of a good choice of the bandwidth is non-obvious and has a crucial impact on the quality of the estimate.
 In theory it is possible to calculate an optimal bandwidth $h*$ regarding to the asymptotic mean integrated squared error.
 However, in order to do so the to be estimated density needs to be known which is obviously unknown.
 Any non-optimal bandwidth causes undersmoothing or oversmoothing.
 An undersmoothing estimator has a large variance and hence a small $h$ leads to undersmoothing.
 On the other hand given a large $h$ the bias increases, which leads to oversmoothing \cite[7]{Cybakov2009}.
 Clearly with an adverse choice of the bandwidth crucial information like modality might get smoothed out.
 Sophisticated methods are necessary to obtain a good bandwidth from the data.
 %Multivariate Gauss-Kernel
 %\begin{equation}
 %K(u)=\frac{1}{(2\pi)^{d/2}} \expp{-\frac{1}{2} \bm{x}^T \bm{x}}
 %\end{equation}
 The flexibility of the KDE comes at the expense of computational efficiency, which leads to the development of more efficient computation schemes.
 The computation time depends, besides the number of calculated points, on the number of data points $n$.
 In general, reducing the size of the sample negatively affects the accuracy of the estimate.
 Still, the sample size is a suitable parameter to speedup the computation.
-To reduce the number of single data points adjacent points are combined into a bin.
+Silverman \cite{silverman1982algorithm} suggested to reduce the number of single data points by combining adjacent points into data bins.
 This approximation is called binned kernel density estimate (BKDE) and was extensively analysed \cite{fan1994fast} \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
 Usually the data is binned over an equidistant grid.
-A kernel estimator which computes the estimate over such a grid is called binned kernel density estimator (BKDE).
+Due to the equally-spaced grid many kernel evaluations are almost the same and can be saved, which greatly reduces the number of evaluated kernels and naturally leads to a reduced computation time \cite{fan1994fast}.
 Due to the equally-spaced grid many kernel evaluations are almost the same and can be saved, which greatly reduces the number of evaluated kernels \cite{fan1994fast}.
 This naturally leads to a reduced computation time.
 So far the binned estimator introduced two additional parameters which affects the accuracy of the estimation.
 The choice of the number of grid points is crucial, because it determines the trade-off between the approximation error introduced by the binning and the computational speed of the algorithm \cite{wand1994fast}.
 Also, the choice of the binning rule affects the accuracy of the approximation.
 There are several possible binning rules \cite{hall1996accuracy} but the most commonly used binning rules are the simple \eqref{eq:simpleBinning} and common linear binning rule \eqref{eq:linearBinning}.
 An advantage of these often used binning rules is that their effect on the approximation is extensively investigated and well understood \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
 At first the data, i.e. a random sample $X$, has to be assigned to a grid.
 A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$ for $j\in\Z$ and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$.
 For computation a finite grid is used on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$.
 While the estimate can be efficiently computed it is unknown how large the grid should be chosen.
 Because the computation time heavily depends on the grid size, it is desirable to chose a grid as small as possible without losing to much accuracy.
 In general, there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size.
 The roughness of the unknown density directly affects the grid size.
 Coarser grids allow a greater speedup but at the same time might conceal important details of the unknown density \cite{wand1994fast}.
-Equipped with a method to bin the data the binned kernel estimator $\tilde{f}$ of a density $f$, based on $\bm{X}$ can be pointwise computed at the grid point $g_x$ with 
+
 Given a binning rule $w_j$ the BKDE $\tilde{f}$ of a density $f$ computed pointwise at the grid point $g_x$ is given as
 \begin{equation}
 \label{eq:binKde}
 \tilde{f}(g_x) = \frac{1}{n} \sum_{j=1}^{G} N_j K_h(g_x-g_j)
 \end{equation}
 where $G$ is the number of grid points and
 \begin{equation}
 \label{eq:gridCnts}
    N_j=\sum_{i=1}^{n} w_j(x_i,\delta)
 \end{equation}
 is the count at grid point $g_j$ \cite{hall1996accuracy}.
 So far the binned estimator introduced two additional parameters which affects the accuracy of the estimation.
 The choice of the number of grid points is crucial, because it determines the trade-off between the approximation error introduced by the binning and the computational speed of the algorithm \cite{wand1994fast}.
 Also, the choice of the binning rule affects the accuracy of the approximation.
 There are several possible binning rules \cite{hall1996accuracy} but the most commonly used binning rules are the simple 
 \begin{align}
 \label{eq:simpleBinning}
    w_j(x,\delta) &=
    \begin{cases}
        1 & \text{if } x \in ((j-\frac{1}{2})\delta,  (j-\frac{1}{2})\delta ] \\
        0 & \text{else}
    \end{cases}
 \end{align}
 and the common linear binning rule 
 \begin{align}
    w_j(x,\delta) &=
    \begin{cases}
        1-|\delta^{-1}x-j| & \text{if } |\delta^{-1}x-j|\le1 \\
        0				   & \text{else.}
    \end{cases}
 \end{align}
 An advantage of these often used binning rules is that their effect on the approximation is extensively investigated and well understood \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
 Once the counts $N_j$ and kernel values are computed they need to be combined, which is, in fact, a discrete convolution \cite{wand1994fast}.
 This makes it possible to apply a wide range of well studied techniques from the DSP field.
 Often a FFT based computation scheme is used to efficiently compute the estimate \cite{silverman1982algorithm}\cite[210ff.]{scott2015}.
 As already stated the computational savings are achieved by reducing the number of evaluated kernels.
 A naive implementation of \eqref{eq:binKde} reduces the number evaluations to $\landau{G^2}$ \cite{fan1994fast}.
 Because of the fixed grid spacing $\delta$ most of the kernel evaluations are the same, as each $g_j-g_{j-k}=k\delta$ is independent of $j$ \cite{fan1994fast}.
 Therefore, many evaluated kernels can be reused, so that the number kernel evaluations are reduced to $\landau{G}$ \cite{fan1994fast}.
 However, more important for this work the fact that the BKDE can be seen as a convolution operation.
 Once the grid counts $N_j$ in \eqref{eq:gridCnts} and kernel values are computed they need to be combined, which is, in fact, a discrete convolution \cite{wand1994fast}.
 This makes it possible to apply a wide range of well studied techniques from the DSP field.
 Often a FFT-convolution based computation scheme is used to efficiently compute the estimate \cite{silverman1982algorithm}\cite[210ff.]{scott2015}.
-While the estimate can be efficiently computed it is still unknown how large the grid should be chosen.
+Using the Gaussian kernel from \eqref{eq:gausKern} in conjunction with the BKDE results in the following equation
-Because the computation time heavily depends on the grid size, it is desirable to chose a grid as small as possible without losing to much accuracy.
+\begin{equation}
-In general there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size.
+\hat{f}(g_x)=\frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{G} N_j \expp{-\frac{(x-X_i)^2}{2h^2}} \text{.}
-The roughness of the unknown density directly affects the grid size.
+\end{equation}
 Coarser grids allow a greater speedup but at the same time might conceal important details of the unknown density \cite{wand1994fast}.
 As already stated the above formula is a convolution operation of the data and the kernel.
 More precisely it is a discrete convolution of the finite data grid and the Gaussian function.
 In terms of DSP this is analogous to filter the binned data with a Gaussian filter.
--- a/tex/chapters/mvg.tex
+++ b/tex/chapters/mvg.tex
@@ -4,6 +4,38 @@
 % Gauss Blur Filter
 % Repetitive Box filter to approx Gauss
 % Simple multipass, n/m approach, extended box filter
 The Gaussian filter is a popular filter to smooth a signal by convoluting an input signal with the Gaussian function 
 \begin{equation}
 \label{eq:gausfx}
 g(x) = \frac{1}{\sigma \sqrt{2\pi}} \expp{-\frac{x^2}{2\sigma^2}} \text{,}
 \end{equation}
 where $\sigma$ is a smoothing parameter called standard deviation.
 In the discrete case the Gaussian filter is easily computed with the sliding window algorithm in time domain.
 Convolution is separable if the filter kernel is separable, i.e. multidimensional convolution can be computed as individual one-dimensional convolutions with a one-dimensional kernel.
 Because of $\operatorname{e}^{x^2+y^2} = \operatorname{e}^{x^2}\cdot\operatorname{e}^{y^2}$ the Gaussian filter is separable and can be easily applied to multi-dimensional signals.
 Separability of a filter can also be used to reduce the number of required operations to compute the filter result.
 % TODO ähnlichkeit Gauss und KDE -> schneller Gaus = schnelle KDE
 Computation of a filter using the a naive implementation of the sliding window algorithm yields $\landau{NK}$, where $N$ is the length of the input signal and $K$ is the size of the filter kernel.
 Note that in the case of the Gaussian filter $K$ depends on $\sigma$.
 In order to capture all significant values of the Gaussian function the kernel size $K$ must be adopted to the standard deviation of the Gaussian.
 Another approach to efficiently compute a filter result is the FFT-convoultion algorithm which is a $\landau{N\log(N)}$ operation.
 For large values of $\sigma$ the computation time of the Gaussian filter might be reduced by applying the filter in frequency domain.
 Both signals are transformed into frequency domain using the FFT.
 The filtered result is equal to the point-wise multiplication of the transformed signals.
 In case of the Gaussian filter the Fourier transform of the kernel can be saved, as the Gaussian is a eigenfunction for the Fourier transform.
 While the above mentions algorithms poses efficient computations schemes to compute an exact filter result, approximative algorithms can further speed up the computation.
 A well-known rapid approximation of the Guassian filter is given by the moving average filter.
 \subsection{Moving Average Filter}
 The moving average filter is a simplistic filter which takes an input function $x$ and produces a second function $y$.
 A single output value is computed by taking the average of a number of values symmetrical around a single point in the input.
 The number of values in the average can also be seen as the width $w=2r+1$, where $r$ is the \qq{radius} of the filter.
@@ -13,10 +45,65 @@ The computation of an output value using a moving average filter of radius $r$ i
 y[i]=\frac{1}{2r+1} \sum_{j=-r}^{r}x[i+j] \text{.}
 \end{equation}
 % TODO O(N) complexity
 It is well-known that a moving average filter can approximate a Gaussian filter by repetitive recursive computations.
 \eqref{eq:symMovAvg} is equal to convolution of the input signal and the rectangular function.
 Given by the central theorem of probabilistic repetitive convolution of a rectangular function with itself will yield a Gaussian in the limit.
 Filtering a signal with the moving average filter in several passes approximately converges to a Gaussian filter.
 In practice three or five iterations are most likely enough to obtain a reasonable close Gaussian approximation.
 This allows rapid approximations of the Gaussian filter as the moving average can be computed with a few additions and multiplications.
 Opposed to the Gaussian function were exponential functions need to be evaluated for every output value.
 As given in \eqref{eq:gausfx} the solely parameter of the Gaussian kernel is the standard deviation $\sigma$.
 In contrast the moving average filter is parametrized by its width $w$.
 Therefore, in order to approximate the Gaussian filter of a given $\sigma$ a corresponding value of $w$ must be found.
 Given $n$ iterations of moving average filters with identical widths the ideal width $\wideal$, as suggested by Wells~\cite{wells1986efficient}, is
 \begin{equation}
 \label{eq:boxidealwidth}
    \wideal = \sqrt{\frac{12\sigma^2}{n}+1} \text{.}
 \end{equation}
 In general $\wideal$ can be any real number but the moving average filter in \eqref{eq:symMovAvg} is restricted to odd integer values.
 Hence the set of possible approximated standard deviations is limited, because the ideal width has to be rounded to the next valid value.
 In order to reduce the rounding error Kovesi~\cite{kovesi2010fast} proposes to perform two box filters with different widths
 \begin{align}
 \label{eq:boxwidthtwo}
 \begin{split}
    w_l &= 
    \begin{cases} 
        \floor{w_{\text{ideal}}} - 1 & \text{if } \floor{w_{\text{ideal}}} \text{ is odd}  \\
        \floor{w_{\text{ideal}}}     & \text{else }
    \end{cases}  \\
    w_u &= w_l + 2  \text{.}
 \end{split}
 \end{align}
 Given $w_l$ and $w_u$ the approximation is done by computing $m$ box filters of width $w_l$ followed by $(n-m)$ box filters of size $w_u$ where $0\le m\le n$.
 As all other values are known $m$ can be computed with
 \begin{equation}
 \label{eq:boxrepeatm}
    m=\frac{12\sigma^2-nw_l^2-4nw_l-3n}{-4w_l-4} \text{.}
 \end{equation}
 The approximated sigma as a function of the integer width has staircase shaped graph.
 By reducing the rounding error the step size of the function is reduced.
 However, the overall shape will not change.
 \etal{Gwosdek}~\cite{gwosdek2011theoretical} proposed an approach which allows to approximate any real-valued value of $\sigma$.
 This is achieved by 
 As is known the Gaussian filter is parametrized by its standard deviation $\sigma$.
 To approximate a Gaussian filter one needs to express a given $\sigma$ in terms of moving average filters.
--- a/tex/chapters/usage.tex
+++ b/tex/chapters/usage.tex
@@ -1,2 +1,10 @@
 \section{Usage}
 %wie benutzen wir das ganze jetzt? auf was muss ich achten?
 % Am Beispiel 2D Daten
 % Histogram erzeugen (== data binnen)
 % Hierzu wird min/max benötigt
 % Anschließend Filterung per Box Filter über das Histogram
 % - Wenn möglich parallel (SIMD, GPU)
 % - separiert in jeder dim einzeln
 % Maximum aus Filter ergebnis nehmen
--- a/tex/misc/functions.tex
+++ b/tex/misc/functions.tex
@@ -112,7 +112,15 @@
 		}%
 	}%
 }
-
+\newcommand{\commentByMarkus}[1]{%
 	\noindent%
 	\fcolorbox{black}{red}{%
 		\parbox[position]{0.45\textwidth}{%
 			\footnotesize%
 			{\bf Markus:} #1%
 		}%
 	}%
 }
 \newcommand{\docRSSI}{RSSI}
 \newcommand{\docTX}{TX}
 \newcommand{\docLogDist}{log-distance}