) = \mysum_{i\eq 1}^n - P_i \log_2 P_i
\]}
(also called \defn{entropy} of the prior)
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Information contd.}
Suppose we have \mat{$p$} positive and \mat{$n$} negative examples at the root\nl
\mat{$\implies$} \mat{$H(\)$} bits needed to classify a new example\\
E.g., for 12 restaurant examples, \mat{$p\eq n\eq 6$} so we need 1 bit
An attribute splits the examples \mat{$E$} into subsets \mat{$E_i$}, each of which
(we hope) needs less information to complete the classification
Let \mat{$E_i$} have \mat{$p_i$} positive and \mat{$n_i$} negative examples\al
\mat{$\implies$} \mat{$H(\)$} bits needed to classify a new example\al
\mat{$\implies$} \emph{expected} number of bits per example over all branches is
\mat{\[
\mysum_i \ \ \frac{p_i+n_i}{p+n} \ H(\)
\]}
For \mat{$Patrons?$}, this is 0.459 bits, for \mat{$Type$} this is (still) 1 bit
\mat{$\implies$} choose the attribute that minimizes the remaining information needed
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Example contd.}
Decision tree learned from the 12 examples:
\vspace*{0.2in}
\epsfxsize=0.6\textwidth
\fig{\file{figures}{induced-restaurant-tree.ps}}
Substantially simpler than ``true'' tree---a more complex hypothesis
isn't justified by small amount of data
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Performance measurement}
How do we know that \mat{$h\approx f$}? (Hume's \emph{Problem of Induction})
1) Use theorems of computational/statistical learning theory
2) Try \mat{$h$} on a new \defn{test set} of examples \al
(use \emph{same distribution over example space} as training set)
\defn{Learning curve} = \% correct on test set as a function of training set size
\vspace*{-0.2in}
\epsfxsize=0.6\textwidth
\fig{\file{graphs}{restaurant-dtl-curve.ps}}
% Points to make:
% 1) Restaurant data; graph averaged over 20 trials
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Performance measurement contd.}
Learning curve depends on\al
-- \defn{realizable} (can express target function) vs. \defn{non-realizable}\nl
non-realizability can be due to missing attributes \nl
or restricted hypothesis class (e.g., thresholded linear function)\al
-- redundant expressiveness (e.g., loads of irrelevant attributes)
\vspace*{0.2in}
\epsfxsize=0.95\textwidth
\fig{\file{figures}{learning-curves.ps}}
%%%%%%%%%%%% Slide %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\heading{Summary}
Learning needed for unknown environments, lazy designers
Learning agent = performance element + learning element
Learning method depends on type of performance element, available\\
feedback, type of component to be improved, and its representation
For supervised learning, the aim is to find a simple hypothesis\\
that is approximately consistent with training examples
Decision tree learning using information gain
Learning performance = prediction accuracy measured on test set
\end{huge}
\end{document}