diff --git a/tex/chapters/abstract.tex b/tex/chapters/abstract.tex index 5c5c6fe..3f741fb 100644 --- a/tex/chapters/abstract.tex +++ b/tex/chapters/abstract.tex @@ -1,7 +1,7 @@ \begin{abstract} It is common practice to use a sample-based representation to solve problems having a probabilistic interpretation. In many real world scenarios one is then interested in finding a \qq{best estimate} of the underlying problem, e.g. the position of a robot. -This is often done by means of simple parametric point estimator, providing the sample statistics. +This is often done by means of simple parametric point estimators, providing the sample statistics. However, in complex scenarios this frequently results in a poor representation, due to multimodal densities and limited sample sizes. Recovering the probability density function using a kernel density estimation yields a promising approach to solve the state estimation problem i.e. finding the \qq{real} most probable state, but comes with high computational costs. diff --git a/tex/chapters/experiments.tex b/tex/chapters/experiments.tex index 5eafd9b..187282e 100644 --- a/tex/chapters/experiments.tex +++ b/tex/chapters/experiments.tex @@ -4,7 +4,7 @@ We now empirically evaluate the accuracy of our boxKDE method, using the mean integrated squared error (MISE). -The ground truth is given as $N=1000$ synthetic samples drawn from a bivariate mixture normal density $f$ +The ground truth is given with $N=1000$ synthetic samples drawn from a bivariate mixture normal density $f$ \begin{equation} \begin{split} \bm{X} \sim & ~\G{\VecTwo{0}{0}}{0.5\bm{I}} + \G{\VecTwo{3}{0}}{\bm{I}} + \G{\VecTwo{0}{3}}{\bm{I}} \\ @@ -21,7 +21,7 @@ Therefore, the particular choice of the ground truth is only of minor importance \end{figure} Evaluated at $50^2$ points the exact KDE is compared to the BKDE, boxKDE, and extended box filter approximation, which are evaluated at a smaller grid with $30^2$ points. -The MISE between $f$ and the estimates as a function of $h$ are evaluated, and the resulting plot is given in figure~\ref{fig:errorBandwidth}. +The MISE between $f$ and the estimates as a function of $h$ are evaluated, and the resulting plot is given in fig.~\ref{fig:errorBandwidth}. A minimum error is obtained with $h=0.35$, for larger oversmoothing occurs and the modes gradually fuse together. Both the BKDE and the extended box filter estimate resemble the error curve of the KDE quite well and stable. @@ -42,7 +42,7 @@ However, both cases do not give a deeper insight of the error behavior of our me \begin{figure}[t] %\includegraphics[width=\textwidth,height=6cm]{gfx/tmpPerformance.png} \input{gfx/perf.tex} - \caption{Logarithmic plot of the runtime performance with increasing grid size $G$ and bivariate data. The weighted average estimate (blue) performs fastest followed by the boxKDE (orange) approximation. Both the BKDE (red), and the fastKDE (green) are magnitudes slower, especially for $G<10^4$.}\label{fig:performance} + \caption{Logarithmic plot of the runtime performance with increasing grid size $G$ and bivariate data. The weighted-average estimate (blue) performs fastest followed by the boxKDE (orange) approximation. Both the BKDE (red) and the fastKDE (green) are magnitudes slower, especially for $G<10^3$.}\label{fig:performance} \end{figure} % kde, box filter, exbox in abhänigkeit von h (bild) @@ -53,18 +53,18 @@ However, both cases do not give a deeper insight of the error behavior of our me \subsection{Performance} In the following, we underpin the promising theoretical linear time complexity of our method with empirical time measurements compared to other methods. All tests are performed on a Intel Core \mbox{i5-7600K} CPU with a frequency of \SI{4.2}{\giga\hertz}, and \SI{16}{\giga\byte} main memory. -We compare our C++ implementation of the box filter based KDE approximation based on algorithm~\ref{alg:boxKDE} to the \texttt{ks} R package and the fastKDE Python implementation \cite{oBrien2016fast}. -The \texttt{ks} packages provides a FFT-based BKDE implementation based on optimized C functions at its core. -With state estimation problems in mind, we additionally provide a C++ implementation of a weighted average estimator. -As both methods are not using a grid, an equivalent input sample set was used for the weighted average and the fastKDE. +We compare our C++ implementation of the boxKDE approximation as shown in algorithm~\ref{alg:boxKDE} to the \texttt{ks} R package and the fastKDE Python implementation \cite{oBrien2016fast}. +The \texttt{ks} package provides a FFT-based BKDE implementation based on optimized C functions at its core. +With state estimation problems in mind, we additionally provide a C++ implementation of a weighted-average estimator. +As both methods are not using a grid, an equivalent input sample set was used for the weighted-average and the fastKDE. -The results for performance comparison are presented in plot \ref{fig:performance}. +The results for performance comparison are presented in fig.~\ref{fig:performance}. % O(N) gut erkennbar für box KDE und weighted average The linear complexity of the boxKDE and the weighted average is clearly visible. % Gerade bei kleinen G bis 10^3 ist die box KDE schneller als R und fastKDE, aber das WA deutlich schneller als alle anderen Especially for small $G$ up to $10^3$ the boxKDE is much faster compared to BKDE and fastKDE. % Bei zunehmend größeren G wird der Abstand zwischen box KDE und WA größer. -Nevertheless, the simple weighted average approach performs the fastest and with increasing $G$ the distance to the boxKDE grows constantly. +Nevertheless, the simple weighted-average approach performs the fastest and with increasing $G$ the distance to the boxKDE grows constantly. However, it is obvious that this comes with major disadvantages, like being prone to multimodalities, as discussed in section \ref{sec:intro}. % (Das kann auch daran liegen, weil das Binning mit größeren G langsamer wird, was ich mir aber nicht erklären kann! Vlt Cache Effekte) @@ -74,7 +74,7 @@ Further looking at fig. \ref{fig:performance}, the runtime performance of the BK % Dies kommt durch die FFT. Der Input in für die FFT muss immer auf die nächste power of two gerundet werden. This behavior is caused by the underlying FFT algorithm. % Daher wird die Laufzeit sprunghaft langsamer wenn auf eine neue power of two aufgefüllt wird, ansonsten bleibt sie konstant. -The FFT approach requires the input to be always rounded up to a power of two, what then causes a constant runtime behavior within those boundaries and a strong performance deterioration at corresponding manifolds. +The FFT approach requires the input to be always rounded up to a power of two, what then causes a constant runtime behaviour within those boundaries and a strong performance deterioration at corresponding manifolds. % Der Abbruch bei G=4406^2 liegt daran, weil für größere Gs eine out of memory error ausgelöst wird. The termination of BKDE graph at $G=4406^2$ is caused by an out of memory error for even bigger $G$ in the \texttt{ks} package. @@ -85,10 +85,10 @@ Both discussed Gaussian filter approximations, namely box filter and extended bo While the average runtime over all values of $G$ for the standard box filter is \SI{0.4092}{\second}, the extended one provides an average of \SI{0.4169}{\second}. To keep the arrangement of fig. \ref{fig:performance} clear, we only illustrated the results of the boxKDE with the regular box filter. -The weighted average has the great advantage of being independent of the dimensionality of the input and effortlessly implemented. +The weighted-average has the great advantage of being independent of the dimensionality of the input and can be implemented effortlessly. In contrast, the computation of the boxKDE approach increases exponentially with increasing number of dimensions. However, due to the linear time complexity and the very simple computation scheme, the overall computation time is still sufficient fast for many applications and much smaller compared to other methods. -The boxKDE approach presents a reasonable alternative to the weighted average and is easily integrated into existing systems. +The boxKDE approach presents a reasonable alternative to the weighted-average and is easily integrated into existing systems. In addition, modern CPUs do benefit from the recursive computation scheme of the box filter, as the data exhibits a high degree of spatial locality in memory and the accesses are reliable predictable. Furthermore, the computation is easily parallelized, as there is no data dependency between the one-dimensional filter passes in algorithm~\ref{alg:boxKDE}. diff --git a/tex/chapters/introduction.tex b/tex/chapters/introduction.tex index 181b574..f915131 100644 --- a/tex/chapters/introduction.tex +++ b/tex/chapters/introduction.tex @@ -4,8 +4,7 @@ Sensor fusion approaches are often based upon probabilistic descriptions like particle filters, using samples to represent the distribution of a dynamical system. To update the system recursively in time, probabilistic sensor models process the noisy measurements and a state transition function provides the system's dynamics. Therefore a sample or particle is a representation of one possible system state, e.g. the position of a pedestrian within a building. -In most real world scenarios one is then interested in finding the most probable state within the state space, to provide the \qq{best estimate} of the underlying problem. -Generally speaking, solving the state estimation problem. +In most real world scenarios one is then interested in finding the most probable state within the state space, to provide the best estimate of the underlying problem, generally speaking, solving the state estimation problem. In the discrete manner of a sample representation this is often done by providing a single value, also known as sample statistic, to serve as a \qq{best guess}. This value is then calculated by means of simple parametric point estimators, e.g. the weighted-average over all samples, the sample with the highest weight or by assuming other parametric statistics like normal distributions \cite{Fetzer2016OMC}. %da muss es doch noch andere methoden geben... verflixt und zugenäht... aber grundsätzlich ist ein weighted average doch ein point estimator? (https://www.statlect.com/fundamentals-of-statistics/point-estimation) @@ -17,9 +16,9 @@ As a result, those techniques are not able to provide an accurate statement abou For example, in a localization scenario where a bimodal distribution represents the current posterior, a reliable position estimation is more likely to be at one of the modes, instead of somewhere in-between, like provided by a simple weighted-average estimation. Additionally, in most practical scenarios the sample size and therefore the resolution is limited, causing the variance of the sample based estimate to be high \cite{Verma2003}. -It is obvious, that a computation of the full posterior could solve the above, but finding such an analytical solution is an intractable problem, what is the reason for applying a sample representation in the first place. +It is obvious, that a computation of the full posterior could solve the above, but finding such an analytical solution is an intractable problem, which is the reason for applying a sample representation in the first place. Another promising way is to recover the probability density function from the sample set itself, by using a non-parametric estimator like a kernel density estimation (KDE). -With this, it is easy to find the \qq{real} most probable state and thus to avoid the aforementioned drawbacks. +With this, it is easy to recover the \qq{real} most probable state and thus to avoid the aforementioned drawbacks. However, non-parametric estimators tend to consume a large amount of computational time, which renders them unpractical for real time scenarios. Nevertheless, the availability of a fast processing density estimate might improve the accuracy of today's sensor fusion systems without sacrificing their real time capability. @@ -34,7 +33,7 @@ By the central limit theorem, multiple recursion of a box filter yields an appro This process converges quite fast to a reasonable close approximation of the ideal Gaussian. In addition, a box filter can be computed extremely fast by a computer, due to its intrinsic simplicity. While the idea to use several box filter passes to approximate a Gaussian has been around for a long time, the application to obtain a fast KDE is new. -Especially in time critical and time sequential sensor fusion scenarios, the here presented approach outperforms other state of the art solutions, due to a fully linear complexity \landau{N} and a negligible overhead, even for small sample sets. +Especially in time critical and time sequential sensor fusion scenarios, the here presented approach outperforms other state of the art solutions, due to a fully linear complexity and a negligible overhead, even for small sample sets. In addition, it requires only a few elementary operations and is highly parallelizable. diff --git a/tex/chapters/kde.tex b/tex/chapters/kde.tex index 5e49921..00f035b 100644 --- a/tex/chapters/kde.tex +++ b/tex/chapters/kde.tex @@ -1,4 +1,4 @@ -\section{Kernel Density Estimation} +\section{Kernel Density Estimator} % KDE by rosenblatt and parzen % general KDE % Gauss Kernel @@ -11,17 +11,17 @@ %The histogram is a simple and for a long time the most used non-parametric estimator. %However, its inability to produce a continuous estimate dismisses it for many applications where a smooth distribution is assumed. %In contrast, -The KDE is often the preferred tool to estimate a density function from discrete data samples because of its ability to produce a continuous estimate and its flexibility. +The KDE is often the preferred tool to estimate a density function from discrete data samples because of its flexibility and ability to produce a continuous estimate. % Given a univariate random sample set $X=\{X_1, \dots, X_N\}$, where $X$ has the density function $f$ and let $w_1, \dots w_N$ be associated weights. The kernel estimator $\hat{f}$ which estimates $f$ at the point $x$ is given as \begin{equation} \label{eq:kde} -\hat{f}(x) = \frac{1}{W} \sum_{i=1}^{N} \frac{w_i}{h} K \left(\frac{x-X_i}{h}\right) +\hat{f}(x) = \frac{1}{W} \sum_{i=1}^{N} \frac{w_i}{h} K \left(\frac{x-X_i}{h}\right) \text{,} \end{equation} where $W=\sum_{i=1}^{N}w_i$ and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth. $K$ is a kernel function such that $\int K(u) \dop{u} = 1$. -In general any kernel can be used, however the general advice is to chose a symmetric and low-order polynomial kernel. +In general, any kernel can be used, however a common advice is to chose a symmetric and low-order polynomial kernel. Thus, several popular kernel functions are used in practice, like the Uniform, Gaussian, Epanechnikov, or Silverman kernel \cite{scott2015}. While the kernel estimate inherits all the properties of the kernel, usually it is not of crucial matter if a non-optimal kernel was chosen. @@ -51,25 +51,25 @@ K_G(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} } \text{.} \end{equation} The flexibility of the KDE comes at the expense of computational efficiency, which leads to the development of more efficient computation schemes. -The computation time depends, besides the number of calculated points, on the number of data points $N$. +The computation time depends, besides the number of calculated points $M$, on the input size, namely the number of data points $N$. In general, reducing the size of the sample negatively affects the accuracy of the estimate. -Still, the sample size is a suitable parameter to speedup the computation. +Still, the sample size is a suitable parameter to speed up the computation. Since each single sample is combined with its adjacent samples into bins, the BKDE approximates the KDE. -Each bin represents the count of the sample set at a given point of a equidistant grid with spacing $\delta$. -A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$, indexed by $j\in\Z$. +Each bin represents the count of the sample set at a given point of an equidistant grid with spacing $\delta$. +A binning rule distributes a sample among the grid points $g_j=j\delta$, indexed by $j\in\Z$. % and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$. Computation requires a finite grid on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$. Given a binning rule $r_j$ the BKDE $\tilde{f}$ of a density $f$ computed pointwise at the grid point $g_x$ is given as \begin{equation} \label{eq:binKde} -\tilde{f}(g_x) = \frac{1}{W} \sum_{j=1}^{G} \frac{C_j}{h} K \left(\frac{g_x-g_j}{h}\right) +\tilde{f}(g_x) = \frac{1}{W} \sum_{j=1}^{G} \frac{C_j}{h} K \left(\frac{g_x-g_j}{h}\right) \text{,} \end{equation} where $G$ is the number of grid points and \begin{equation} \label{eq:gridCnts} - C_j=\sum_{i=1}^{n} r_j(x_i,\delta) + C_j=\sum_{i=1}^{N} r_j(x_i,\delta) \end{equation} is the count at grid point $g_j$, such that $\sum_{j=1}^{G} C_j = W$ \cite{hall1996accuracy}. @@ -83,7 +83,7 @@ However, for many applications it is recommend to use the simple binning rule 0 & \text{else} \end{cases} \end{align} -or the common linear binning rule which divides the sample into two fractional weights shared by the nearest grid points +or the common linear binning rule, which divides the sample into two fractional weights shared by the nearest grid points \begin{align} \label{eq:linearBinning} r_j(x,\delta) &= @@ -94,32 +94,32 @@ or the common linear binning rule which divides the sample into two fractional w \end{align} An advantage is that their impact on the approximation error is extensively investigated and well understood \cite{hall1996accuracy}. Both methods can be computed with a fast $\landau{N}$ algorithm, as simple binning is essentially the quotient of an integer division and the fractional weights of the linear binning are given by the remainder of the division. -As linear binning is more precise it is often preferred over simple binning \cite{fan1994fast}. +As linear binning is more precise, it is often preferred over simple binning \cite{fan1994fast}. -While linear binning improves the accuracy of the estimate the choice of the grid size is of more importance. +While linear binning improves the accuracy of the estimate, the choice of the grid size is of more importance. The number of grid points $G$ determines the trade-off between the approximation error caused by the binning and the computational speed of the algorithm. -Clearly, a large value of $G$ produces a estimate close to the regular KDE, but requires more evaluations of the kernel compared to a coarser grid. +Clearly, a large value of $G$ produces an estimate close to the regular KDE, but requires more evaluations of the kernel compared to a coarser grid. However, it is unknown what particular $G$ gives the best trade-off for any given sample set. In general, there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size \cite{hall1996accuracy}. A naive implementation of \eqref{eq:binKde} reduces the number of kernel evaluations to $\landau{G^2}$, assuming that $G Fourier transfom -Kernel density estimation is well known non-parametric estimator, originally described independently by Rosenblatt \cite{rosenblatt1956remarks} and Parzen \cite{parzen1962estimation}. +The Kernel density estimator is a well known non-parametric estimator, originally described independently by Rosenblatt \cite{rosenblatt1956remarks} and Parzen \cite{parzen1962estimation}. It was subject to extensive research and its theoretical properties are well understood. A comprehensive reference is given by Scott \cite{scott2015}. Although classified as non-parametric, the KDE depends on two free parameters, the kernel function and its bandwidth. @@ -24,7 +24,7 @@ Various methods have been proposed, which can be clustered based on different te % k-nearest neighbor searching An obvious way to speed up the computation is to reduce the number of evaluated kernel functions. -One possible optimization is based on k-nearest neighbour search performed on spatial data structures. +One possible optimization is based on k-nearest neighbour search, performed on spatial data structures. These algorithms reduce the number of evaluated kernels by taking the distance between clusters of data points into account \cite{gray2003nonparametric}. % fast multipole method & Fast Gaus Transform @@ -38,16 +38,16 @@ They define a Fourier-based filter on the empirical characteristic function of a The computation time was further reduced by \etal{O'Brien} using a non-uniform fast Fourier transform (FFT) algorithm to efficiently transform the data into Fourier space \cite{oBrien2016fast}. % binning => FFT -In general, it is desirable to omit a grid, as the data points do not necessary fall onto equally spaced points. -However, reducing the sample size by distributing the data on a equidistant grid can significantly reduce the computation time, if an approximative KDE is acceptable. +In general, it is desirable to omit a grid, as the data points do not necessarily fall onto equally spaced points. +However, reducing the sample size by distributing the data on an equidistant grid can significantly reduce the computation time, if an approximative KDE is acceptable. Silverman \cite{silverman1982algorithm} originally suggested to combine adjacent data points into data bins, which results in a discrete convolution structure of the KDE. Allowing to efficiently compute the estimate using a FFT algorithm. This approximation scheme was later called binned KDE (BKDE) and was extensively studied \cite{fan1994fast} \cite{wand1994fast} \cite{hall1996accuracy}. -While the FFT algorithm poses an efficient algorithm for large sample sets, it adds an noticeable overhead for smaller ones. +While the FFT algorithm constitutes an efficient algorithm for large sample sets, it adds an noticeable overhead for smaller ones. The idea to approximate a Gaussian filter using several box filters was first formulated by Wells \cite{wells1986efficient}. Kovesi \cite{kovesi2010fast} suggested to use two box filters with different widths to increase accuracy maintaining the same complexity. -To eliminate the approximation error completely \etal{Gwosdek} \cite{gwosdek2011theoretical} proposed a new approach called extended box filter. +To eliminate the approximation error completely, \etal{Gwosdek} \cite{gwosdek2011theoretical} proposed a new approach called extended box filter. This work highlights the discrete convolution structure of the BKDE and elaborates its connection to digital signal processing, especially the Gaussian filter. Accordingly, this results in an equivalence relation between BKDE and Gaussian filter.