ML02-NaiveBayes_LogisticRegression

- 10 mins

Machine Learning Study

Week2, 2020.01.28

3. Naive bayes Classifier

3.1 What is the optimal classifier?

Recall that the goal of machine learning is to “find a function $f$ whose error is within PAC bounds”.

Definition of Bayes Classifier : The aim of Bayes Classifier is function approximation by error minimization. Note that optimal classifier SHOULD have some error, $R(f^*)>0$. We may make decision with a decision boundary. The area under the probability curve of classes which was not chosen by decision is Bayes Risk. The formula above is a form of function-approximation, and we now use a form of decision. That means, And by using Bayes’ theorem, we can obtain the formula as follows: However, it is impossible to know what $P(X=x\mid Y=y)$ is. So, we now apply some relaxed assumption, that is,

All independent variables $x_1,x_2,\cdot\cdot\cdot,x_d$ are conditionally independent with given $y$

3.2 Naive Bayes Classifier

Quick Note for conditionally independent vs marginal independence

By the assumption, we can approximate the model like following: What is the purpose of this assumption? That is, we can reduce the number of parameter. Before the assumption, our parameter space was $O(k\cdot2^d)$ where $k$ is the number of classes and $b$ is the number of independent variables(the dimension of feature space). After the assumption, our parameter space reduced to $O(k\cdot d)$. Then, we now want to formulate Naive Bayes Classifier.


Given :

The Naive Bayes Classifier Function is,


If the following two conditions are met, the Naive Bayes classifier is the best classifier:

3.3 Problem of Naive Bayes Classifier

Bag Of Words?

4. Logistic Regression

Recall the plot of the decision boundary and Bayes Risk. For the reason of range(linear function will violate the probability axiom) and risk optimization, we want to use curve rather than linear functions. The curve is required the following properties:

Sigmoid function satsatisfies all these properties, and Logistic function is a special case of sigmoid function.

4.1 Logistic function



The inverse of the logistic function is a logit function. Start from logit, The purpose of this transformation is following:

Given the Bernoulli experiment, and where $\mu(x)$ is the probability of $y=1$ given $x$. Then, set $\mu(x) = {1\over 1+e^{-\theta^Tx}}={e^{X\theta}\over 1+e^{X\theta}}$.

Finally, the goal becomes finding the parameter $\theta$.

4.2 Finding the parameter $\theta$

We will use MCLE(Maximum Conditional Likelihood Estimation). That is, From the previous formula, $P(Y_i\mid X_i;\theta)=\mu(X_i)^{Y_i}(1-\mu(X_i)^{Y_i})$, therefore Then, We now use partial derivative to find $\theta$ We cannot derive further. There is no closed form solution. Therefore, we just use open form solution and do approximation.

4.3 Gradient Method

Now, we will use gradient method to approximate the parameter $\theta$ . Before talking about gradient descent/ascent algorithm, let’s do quick recap about Taylor Expansion(Taylor Series).

Quick Note for Taylor expansion

A Taylor series is a representation of a function as an infinite sum of terms that are calculated from the values of the function’s derivatives at a single (fixed) point.

That is, we can represent a function $f(x)$ as follows: The function MUST be infinitely differentiable at a real or complex number of $a$ for taylor expansion.

In fact, Taylor expansion is a process of generating the Taylor series.

Gradient Descent/Ascent

Gradient descent/ascent algorithm is,


Algorithm



The idea of the algorithm is, we can represent $f(x)$ as,

Quick Note about Big-O Notation

Definition


For $f, g : \N\rightarrow\R^+$, if there exist constants $k$ and $C$ such that $f(n)\le C\cdot g(n)$ for all $n>k$, we say $f(n)=O(g(n))$

c.f.

For $f,g : \N\rightarrow \R^+$, if there exist constants $k$ and $C$ such that $f(n)\ge C\cdot g(n)$ for all $n>k$ , we say $f(n) = \Omega(g(n))$

If $f(n)=O(g(n))$ and $f(n)=\Omega(g(n))$, we say $f(n)=\Theta(g(n))$


Let me give you some example :

$2n = O(n^2)$, $2n\neq\Omega(n^2)$, $2n\neq\Theta(n^2)$

$2n = O(n)$, $ 2n=\Omega(n)$, $2n=\Theta(n)$

$2n^2 = O(n^2)$, $2n^2=\Omega(n^2)$, $2n^2=\Theta(n^2)$

$2n^2 \neq O(n)$, $2n^2=\Omega(n)$, $2n^2\neq\Theta(n)$

Plus, the nice thing of Big-O notation is that it can be used to the term that “dominates” the speed of growth. It is called “asymptotic notation”. So, we now treat $2n$ and $3n$ “equally” because $2n=O(n),\;\;3n=O(n)$

What we want to know is the direction of move! Set $a=x_1$ and $x=x_1+h\mathbf{u}$, where $\mathbf{u}$ is the unit direction vector for the partial derivative. Then, we can substitue this into the formula above. Let’s assume that $h$ is a small value, then we can erase the last term of the equation. That is, Then, we want to find the direction of move, $\mathbf{u}$. Our desired direction is to minimize the difference between $f(x_1+h\mathbf{u})$ and $f(x_1)$. Therefore, The reason $\mathbf{u}^*$ becomes $-{f’(x_1)\over\mid f’(x_1)\mid}$ is as follows:

So Gradient descent algorithm update $x_{t+1}\leftarrow x_t+h\mathbf{u}=x_t-h{f’(x_1)\over \mid f’(x_1)\mid}$

For approximation of parameter $\theta$ in Logistic Regression which is find $\hat{\theta}$, we need to use Gradient ascent algorithm. That is just, taking $f(\theta)$ as $f(\theta)=\sum_{1\le i \le N}\log(P(Y_i\mid X_i;\theta))$ which means taking the direction of the positive gradient of $f(\theta_t)$

4.4 Find $\theta$ with Gradient Ascent Algorithm

Then the algorithm of apporoximation is as follows:


Find $\theta$ of Linear Regression with Gradient Descent!

In lieu of using least square solution($\hat{\theta}=(X^TX)^{-1}X^TY$), we can use gradient descent algorithm to find the solution. Set $f(\theta) =\sum_{1\le i\le N}(Y^i-\sum_{i\le j\le d}X^i_j-\theta_j)^2$, then Therefore, we can iteratively move $\theta_k^t$ to the lower value of $f(\theta^t)$ by taking the direction of the negetive gradient of $f(\theta^t)$ as follows:

4.5 The difference between the naive Bayes and the Logistic Regression

Naive Bayse to Logistic Regression

Since we cover the naive Bayes with categorical independent variables, we now expand to the continuous feature space for comparing to the logistic regression. For this, we assume that independent variables follow the Gaussian Distribution with the parameter $\mu, \sigma^2$. That is, Let $\pi_1$ be the prior of $Y=y$, ($\text{i.e.}\;\;P(Y=y)=\pi_1$), then the Gaussian Naive Bayes model is where $k$ means class and $C=\sqrt{2\pi}$.

Recall that the naive Bayes assumption is, $P(X\mid Y=y) = \prod_{1\le i \le d}P(X_i\mid Y=y).$ With these two equation, we can convert the posterior as follows : and we now add one more assumption that two classes have the same variance $\sigma_1^i=\sigma_2^i$, Then, Finally, we can obtaion which is the logistic functioin!

Naive Bayes versus Logistic Regression

Naive Bayes Classifier

Finally, Gaussian Naive Bayes becomes, This equation can be hold with following assumption:

And the number of parameters to estimate is $2kd+1$, where $k=\text{number of classes}, d=\text{number of feature variables}$

Logistic Regression

The number of parameters to estimate is $d+1$

It seems that logistic regression is more efficient, but we need to examine the trade-off between two models(in fact, between all possible classifier!)

Generative - Discriminative Pair

Generative Model

The characteristics of generative model is modeling the joint probability. That is, generative model expand the posterior to following form: then, model $P(X\mid Y), P(Y)$ respectively(i.e. estimate the parameter of $P(X\mid Y), P(Y)$)

Ex) Naive Bayes Classifier

Discriminative Model

The characteristics of discriminative model is modeling the conditional probability. That is, discriminative model directly estimate the parameter of $P(Y\mid X)$

Ex) Logistic Regression

Reference

comments powered by Disqus
rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora