KDE text

2018-02-06 17:14:08 +01:00
parent 00754a00da
commit a7fcf64cfe
2 changed files with 133 additions and 11 deletions
--- a/tex/bare_conf.tex
+++ b/tex/bare_conf.tex
@@ -82,6 +82,7 @@
 \usepackage{algorithm}
 \usepackage{algpseudocode}
 \usepackage{subcaption}
+\usepackage{bm} % bold math symbols with \bm


 % replacement for the SI package
@@ -105,7 +106,9 @@

 \newcommand{\dop}   [1]{\ensuremath{ \mathop{\mathrm{d}#1}                 }}
 \newcommand{\R}        {\ensuremath{ \mathbf{R}                            }}
+\newcommand{\Z}        {\ensuremath{ \mathbf{Z}                            }}
 \newcommand{\expp}  [1]{\ensuremath{ \exp \left( #1 \right)                }}
+\newcommand{\landau}[1]{\ensuremath{ \mathcal{O}\left( #1 \right)          }}

 \newcommand{\qq}    [1]{``#1''}

--- a/tex/chapters/kde.tex
+++ b/tex/chapters/kde.tex
@@ -10,20 +10,139 @@

 The histogram is a simple and for a long time the most used non-parametric estimator.
 However, its inability to produce a continuous estimate dismisses it for many applications where a smooth distribution is assumed.
-In contrast, the KDE is often the preferred tool because of its ability to produce a continuous estimate and its flexibility.
-Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with
+In contrast, KDE is often the preferred tool because of its ability to produce a continuous estimate and its flexibility.
+
+Given a multivariate random variable $X=(x_1,\dots ,x_d)$ in $d$ dimensions.
+The sample $\bm{X}$ is a $n\times d$ matrix defined as \cite[162]{scott2015}
 \begin{equation}
-\label{eq:kde}
-\hat{f}_n = \frac{1}{nh} \sum_{i=1}^{n} K \left(  \frac{x-X_i}{h} \right)  \text{,} %= \frac{1}{n} \sum_{i=1}^{n} K_h(x-x_i)
-\end{equation}
-where $K$ is the kernel function and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
-While any density function can be used as the kernel function $K$ (such that $\int K(u) \dop{u} = 1$), a variety of popular choices of the kernel function $K$ exits.
-In practice the Gaussian kernel is commonly used:
-\begin{equation}
-K(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} }
+    \bm{X}=
+    \begin{pmatrix}
+        X_1    \\
+        \vdots \\
+        X_n    \\
+    \end{pmatrix}
+    =
+    \begin{pmatrix}
+        x_{11} & \dots & x_{1d} \\
+        \vdots & \ddots & \vdots \\
+        x_{n1} & \dots & x_{nd}
+    \end{pmatrix} \text{.}
 \end{equation}

+The multivariate kernel estimator $\hat{f}$ which defines the estimate pointwise at $\bm{x}=(x_1, \dots, x_d)^T$ is given as \cite[162]{scott2015}
 \begin{equation}
-\hat{f}_n = \frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{n} \expp{-\frac{(x-X_i)^2}{2h^2}}
+\hat{f}(\bm{x}) = \frac{1}{n} \sum_{i=1}^{n} \bm{K_h}(\bm{x} - \bm{X_i})
+\end{equation}
+where $\bm{h}$ is an arbitrary smoothing parameter called bandwidth and $\bm{K_h}$ is a multivariate kernel function.
+The bandwidth is given as a vector $\bm{h}=(h_1, \dots, h_d)$, i.e. for every dimension a different smoothing parameter is used.
+
+Valid kernel functions must satisfy $\int \bm{K_h}(\bm{u}) \dop{\bm{u}} = 1$.
+A common approach to define a multivariate kernel function is to apply a univariate kernel in each dimension.
+
+Univariate kernel estimators can easily be extended to multivariate distributions.
+In fact, the univariate kernel is just applied in every dimension and the individual results are combined by a product.
+Therefore, a multivariate kernel estimator which is obtained by this concept is called product kernel \cite[162]{scott2015}.
+
+\begin{equation}
+\bm{K_h}(t) = \frac{1}{h_1 \dots h_d} \prod_{j=1}^{d} K\left( \frac{x_j-x_{ij}}{h_j} \right)
 \end{equation}

+
+Multivariate Gauss-Kernel
+
+\begin{equation}
+K(u)=\frac{1}{(2\pi)^{d/2}} \expp{-\frac{1}{2} \bm{x}^T \bm{x}}
+\end{equation}
+
+
+
+In summary the kernel density estimation is a powerful tool to obtain an estimate from an unknown distribution.
+It can obtain a smooth estimate from a sample without relying on assumptions about the underlying model.
+Selection of a good choice of the bandwidth is non-obvious and has a crucial impact on the quality of the estimate.
+In theory it is possible to calculate an optimal bandwidth $h*$ regarding to the asymptotic mean integrated squared error.
+However, in order to do so the to be estimated density needs to be known which is obviously unknown.
+
+Any non-optimal bandwidth causes undersmoothing or oversmoothing.
+An undersmoothing estimator has a large variance and hence a small $h$ leads to undersmoothing.
+On the other hand given a large $h$ the bias increases, which leads to oversmoothing \cite[7]{Cybakov2009}.
+Clearly with an adverse choice of the bandwidth crucial information like modality might get smoothed out.
+Sophisticated methods are necessary to obtain a good bandwidth from the data.
+
+
+The flexibility of the KDE comes at the expense of computational efficiency, which leads to the development of more efficient computation schemes.
+The computation time depends, besides the number of calculated points, on the number of data points $n$.
+In general, reducing the size of the sample negatively affects the accuracy of the estimate.
+Still, the sample size is a suitable parameter to speedup the computation.
+To reduce the number of single data points adjacent points are combined into a bin.
+Usually the data is binned over an equidistant grid.
+A kernel estimator which computes the estimate over such a grid is called binned kernel density estimator (BKDE).
+
+Due to the equally-spaced grid many kernel evaluations are almost the same and can be saved, which greatly reduces the number of evaluated kernels \cite{fan1994fast}.
+This naturally leads to a reduced computation time.
+
+So far the binned estimator introduced two additional parameters which affects the accuracy of the estimation.
+The choice of the number of grid points is crucial, because it determines the trade-off between the approximation error introduced by the binning and the computational speed of the algorithm \cite{wand1994fast}.
+Also, the choice of the binning rule affects the accuracy of the approximation.
+There are several possible binning rules \cite{hall1996accuracy} but the most commonly used binning rules are the simple \eqref{eq:simpleBinning} and common linear binning rule \eqref{eq:linearBinning}.
+An advantage of these often used binning rules is that their effect on the approximation is extensively investigated and well understood \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}.
+
+
+At first the data, i.e. a random sample $X$, has to be assigned to a grid.
+A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$ for $j\in\Z$ and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$.
+For computation a finite grid is used on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$.
+
+
+Equipped with a method to bin the data the binned kernel estimator $\tilde{f}$ of a density $f$, based on $\bm{X}$ can be pointwise computed at the grid point $g_x$ with 
+\begin{equation}
+\label{eq:binKde}
+\tilde{f}(g_x) = \frac{1}{n} \sum_{j=1}^{G} N_j K_h(g_x-g_j)
+\end{equation}
+where $G$ is the number of grid points and
+\begin{equation}
+    N_j=\sum_{i=1}^{n} w_j(x_i,\delta)
+\end{equation}
+is the count at grid point $g_j$ \cite{hall1996accuracy}.
+
+
+Once the counts $N_j$ and kernel values are computed they need to be combined, which is, in fact, a discrete convolution \cite{wand1994fast}.
+This makes it possible to apply a wide range of well studied techniques from the DSP field.
+Often a FFT based computation scheme is used to efficiently compute the estimate \cite{silverman1982algorithm}\cite[210ff.]{scott2015}.
+As already stated the computational savings are achieved by reducing the number of evaluated kernels.
+A naive implementation of \eqref{eq:binKde} reduces the number evaluations to $\landau{G^2}$ \cite{fan1994fast}.
+Because of the fixed grid spacing $\delta$ most of the kernel evaluations are the same, as each $g_j-g_{j-k}=k\delta$ is independent of $j$ \cite{fan1994fast}.
+Therefore, many evaluated kernels can be reused, so that the number kernel evaluations are reduced to $\landau{G}$ \cite{fan1994fast}.
+
+
+While the estimate can be efficiently computed it is still unknown how large the grid should be chosen.
+Because the computation time heavily depends on the grid size, it is desirable to chose a grid as small as possible without losing to much accuracy.
+In general there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size.
+The roughness of the unknown density directly affects the grid size.
+Coarser grids allow a greater speedup but at the same time might conceal important details of the unknown density \cite{wand1994fast}.
+
+
+
+
+
+
+
+
+
+
+
+%Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with
+%\begin{equation}
+%\label{eq:kde}
+%\hat{f}_n = \frac{1}{nh} \sum_{i=1}^{n} K \left(  \frac{x-X_i}{h} \right)  \text{,} %= \frac{1}{n} \sum_{i=1}^{n} K_h(x-x_i)
+%\end{equation}
+%where $K$ is the kernel function and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth.
+%While any density function can be used as the kernel function $K$ (such that $\int K(u) \dop{u} = 1$), a variety of popular choices of the kernel function $K$ exits.
+%In practice the Gaussian kernel is commonly used:
+%\begin{equation}
+%\label{eq:gausKern}
+%K(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} }
+%\end{equation}
+%
+%\begin{equation}
+%\hat{f}_n = \frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{n} \expp{-\frac{(x-X_i)^2}{2h^2}}
+%\end{equation}
+