From a7fcf64cfe01fc5ab56b9f246380d682d8009cd9 Mon Sep 17 00:00:00 2001 From: Markus Bullmann Date: Tue, 6 Feb 2018 17:14:08 +0100 Subject: [PATCH] KDE text --- tex/bare_conf.tex | 3 + tex/chapters/kde.tex | 141 +++++++++++++++++++++++++++++++++++++++---- 2 files changed, 133 insertions(+), 11 deletions(-) diff --git a/tex/bare_conf.tex b/tex/bare_conf.tex index 14da051..7114ac5 100644 --- a/tex/bare_conf.tex +++ b/tex/bare_conf.tex @@ -82,6 +82,7 @@ \usepackage{algorithm} \usepackage{algpseudocode} \usepackage{subcaption} +\usepackage{bm} % bold math symbols with \bm % replacement for the SI package @@ -105,7 +106,9 @@ \newcommand{\dop} [1]{\ensuremath{ \mathop{\mathrm{d}#1} }} \newcommand{\R} {\ensuremath{ \mathbf{R} }} +\newcommand{\Z} {\ensuremath{ \mathbf{Z} }} \newcommand{\expp} [1]{\ensuremath{ \exp \left( #1 \right) }} +\newcommand{\landau}[1]{\ensuremath{ \mathcal{O}\left( #1 \right) }} \newcommand{\qq} [1]{``#1''} diff --git a/tex/chapters/kde.tex b/tex/chapters/kde.tex index b4e988f..c6724a7 100644 --- a/tex/chapters/kde.tex +++ b/tex/chapters/kde.tex @@ -10,20 +10,139 @@ The histogram is a simple and for a long time the most used non-parametric estimator. However, its inability to produce a continuous estimate dismisses it for many applications where a smooth distribution is assumed. -In contrast, the KDE is often the preferred tool because of its ability to produce a continuous estimate and its flexibility. -Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with +In contrast, KDE is often the preferred tool because of its ability to produce a continuous estimate and its flexibility. + +Given a multivariate random variable $X=(x_1,\dots ,x_d)$ in $d$ dimensions. +The sample $\bm{X}$ is a $n\times d$ matrix defined as \cite[162]{scott2015} \begin{equation} -\label{eq:kde} -\hat{f}_n = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x-X_i}{h} \right) \text{,} %= \frac{1}{n} \sum_{i=1}^{n} K_h(x-x_i) -\end{equation} -where $K$ is the kernel function and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth. -While any density function can be used as the kernel function $K$ (such that $\int K(u) \dop{u} = 1$), a variety of popular choices of the kernel function $K$ exits. -In practice the Gaussian kernel is commonly used: -\begin{equation} -K(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} } + \bm{X}= + \begin{pmatrix} + X_1 \\ + \vdots \\ + X_n \\ + \end{pmatrix} + = + \begin{pmatrix} + x_{11} & \dots & x_{1d} \\ + \vdots & \ddots & \vdots \\ + x_{n1} & \dots & x_{nd} + \end{pmatrix} \text{.} \end{equation} +The multivariate kernel estimator $\hat{f}$ which defines the estimate pointwise at $\bm{x}=(x_1, \dots, x_d)^T$ is given as \cite[162]{scott2015} \begin{equation} -\hat{f}_n = \frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{n} \expp{-\frac{(x-X_i)^2}{2h^2}} +\hat{f}(\bm{x}) = \frac{1}{n} \sum_{i=1}^{n} \bm{K_h}(\bm{x} - \bm{X_i}) +\end{equation} +where $\bm{h}$ is an arbitrary smoothing parameter called bandwidth and $\bm{K_h}$ is a multivariate kernel function. +The bandwidth is given as a vector $\bm{h}=(h_1, \dots, h_d)$, i.e. for every dimension a different smoothing parameter is used. + +Valid kernel functions must satisfy $\int \bm{K_h}(\bm{u}) \dop{\bm{u}} = 1$. +A common approach to define a multivariate kernel function is to apply a univariate kernel in each dimension. + +Univariate kernel estimators can easily be extended to multivariate distributions. +In fact, the univariate kernel is just applied in every dimension and the individual results are combined by a product. +Therefore, a multivariate kernel estimator which is obtained by this concept is called product kernel \cite[162]{scott2015}. + +\begin{equation} +\bm{K_h}(t) = \frac{1}{h_1 \dots h_d} \prod_{j=1}^{d} K\left( \frac{x_j-x_{ij}}{h_j} \right) \end{equation} + +Multivariate Gauss-Kernel + +\begin{equation} +K(u)=\frac{1}{(2\pi)^{d/2}} \expp{-\frac{1}{2} \bm{x}^T \bm{x}} +\end{equation} + + + +In summary the kernel density estimation is a powerful tool to obtain an estimate from an unknown distribution. +It can obtain a smooth estimate from a sample without relying on assumptions about the underlying model. +Selection of a good choice of the bandwidth is non-obvious and has a crucial impact on the quality of the estimate. +In theory it is possible to calculate an optimal bandwidth $h*$ regarding to the asymptotic mean integrated squared error. +However, in order to do so the to be estimated density needs to be known which is obviously unknown. + +Any non-optimal bandwidth causes undersmoothing or oversmoothing. +An undersmoothing estimator has a large variance and hence a small $h$ leads to undersmoothing. +On the other hand given a large $h$ the bias increases, which leads to oversmoothing \cite[7]{Cybakov2009}. +Clearly with an adverse choice of the bandwidth crucial information like modality might get smoothed out. +Sophisticated methods are necessary to obtain a good bandwidth from the data. + + +The flexibility of the KDE comes at the expense of computational efficiency, which leads to the development of more efficient computation schemes. +The computation time depends, besides the number of calculated points, on the number of data points $n$. +In general, reducing the size of the sample negatively affects the accuracy of the estimate. +Still, the sample size is a suitable parameter to speedup the computation. +To reduce the number of single data points adjacent points are combined into a bin. +Usually the data is binned over an equidistant grid. +A kernel estimator which computes the estimate over such a grid is called binned kernel density estimator (BKDE). + +Due to the equally-spaced grid many kernel evaluations are almost the same and can be saved, which greatly reduces the number of evaluated kernels \cite{fan1994fast}. +This naturally leads to a reduced computation time. + +So far the binned estimator introduced two additional parameters which affects the accuracy of the estimation. +The choice of the number of grid points is crucial, because it determines the trade-off between the approximation error introduced by the binning and the computational speed of the algorithm \cite{wand1994fast}. +Also, the choice of the binning rule affects the accuracy of the approximation. +There are several possible binning rules \cite{hall1996accuracy} but the most commonly used binning rules are the simple \eqref{eq:simpleBinning} and common linear binning rule \eqref{eq:linearBinning}. +An advantage of these often used binning rules is that their effect on the approximation is extensively investigated and well understood \cite{wand1994fast} \cite{hall1996accuracy} \cite{holmstrom2000accuracy}. + + +At first the data, i.e. a random sample $X$, has to be assigned to a grid. +A binning rule distributes a sample $x$ among the grid points $g_j=j\delta$ for $j\in\Z$ and can be represented as a set of functions $\{ w_j(x,\delta), j\in\Z \}$. +For computation a finite grid is used on the interval $[a,b]$ containing the data, thus the number of grid points is $G=(b-a)/\delta+1$. + + +Equipped with a method to bin the data the binned kernel estimator $\tilde{f}$ of a density $f$, based on $\bm{X}$ can be pointwise computed at the grid point $g_x$ with +\begin{equation} +\label{eq:binKde} +\tilde{f}(g_x) = \frac{1}{n} \sum_{j=1}^{G} N_j K_h(g_x-g_j) +\end{equation} +where $G$ is the number of grid points and +\begin{equation} + N_j=\sum_{i=1}^{n} w_j(x_i,\delta) +\end{equation} +is the count at grid point $g_j$ \cite{hall1996accuracy}. + + +Once the counts $N_j$ and kernel values are computed they need to be combined, which is, in fact, a discrete convolution \cite{wand1994fast}. +This makes it possible to apply a wide range of well studied techniques from the DSP field. +Often a FFT based computation scheme is used to efficiently compute the estimate \cite{silverman1982algorithm}\cite[210ff.]{scott2015}. +As already stated the computational savings are achieved by reducing the number of evaluated kernels. +A naive implementation of \eqref{eq:binKde} reduces the number evaluations to $\landau{G^2}$ \cite{fan1994fast}. +Because of the fixed grid spacing $\delta$ most of the kernel evaluations are the same, as each $g_j-g_{j-k}=k\delta$ is independent of $j$ \cite{fan1994fast}. +Therefore, many evaluated kernels can be reused, so that the number kernel evaluations are reduced to $\landau{G}$ \cite{fan1994fast}. + + +While the estimate can be efficiently computed it is still unknown how large the grid should be chosen. +Because the computation time heavily depends on the grid size, it is desirable to chose a grid as small as possible without losing to much accuracy. +In general there is no definite answer because the amount of binning depends on the structure of the unknown density and the sample size. +The roughness of the unknown density directly affects the grid size. +Coarser grids allow a greater speedup but at the same time might conceal important details of the unknown density \cite{wand1994fast}. + + + + + + + + + + + +%Given $n$ independently observed realizations of the observation set $X=(x_1,\dots,x_n)$, the kernel density estimate $\hat{f}_n$ of the density function $f$ of the underlying distribution is given with +%\begin{equation} +%\label{eq:kde} +%\hat{f}_n = \frac{1}{nh} \sum_{i=1}^{n} K \left( \frac{x-X_i}{h} \right) \text{,} %= \frac{1}{n} \sum_{i=1}^{n} K_h(x-x_i) +%\end{equation} +%where $K$ is the kernel function and $h\in\R^+$ is an arbitrary smoothing parameter called bandwidth. +%While any density function can be used as the kernel function $K$ (such that $\int K(u) \dop{u} = 1$), a variety of popular choices of the kernel function $K$ exits. +%In practice the Gaussian kernel is commonly used: +%\begin{equation} +%\label{eq:gausKern} +%K(u)=\frac{1}{\sqrt{2\pi}} \expp{- \frac{u^2}{2} } +%\end{equation} +% +%\begin{equation} +%\hat{f}_n = \frac{1}{nh\sqrt{2\pi}} \sum_{i=1}^{n} \expp{-\frac{(x-X_i)^2}{2h^2}} +%\end{equation} +