import pandas as pd
import numpy as np
import requests as rq
from scipy.io import arff
from io import StringIO
url_data = rq.get('http://www.apkc.net/data/weka/contact-lenses.arff').text
data = arff.loadarff(StringIO(url_data))
df = pd.DataFrame(data[0], index=pd.Index(np.arange(24) + 1), dtype='object')
# Convert all data in the columns to strings instead of binary objects.
string_df = df.select_dtypes([np.object]).stack().str.decode('UTF-8').unstack()
for col in string_df:
df[col] = string_df[col]
df
(4 points) Classify using Naïve Bayes method (on contact lenses data) the data item: pre-presbyopic, hypermetrope, yes, reduced, ? Then, check your solution with Weka (the data file is included with Weka).

Class
Attribute soft hard none
(0.22) (0.19) (0.59)
==========================================
age
young 3.0 3.0 5.0
pre-presbyopic 3.0 2.0 6.0
presbyopic 2.0 2.0 7.0
[total] 8.0 7.0 18.0
spectacle-prescrip
myope 3.0 4.0 8.0
hypermetrope 4.0 2.0 9.0
[total] 7.0 6.0 17.0
astigmatism
no 6.0 1.0 8.0
yes 1.0 5.0 9.0
[total] 7.0 6.0 17.0
tear-prod-rate
reduced 1.0 1.0 13.0
normal 6.0 5.0 4.0
[total] 7.0 6.0 17.0
- 3 significant digits are used for all results.
- results are rounded up if 4th significant digit is >= 5.
- Laplace normalization should be applied to avoid
*zero-frequency*problems.

C = Class to predict; f = Feature to use

The formula above works for only one feature/attribute. We need a formula to allow multiple attributes. So in Naïve Bayes we applied the intersection of many features **given** a certain class, over the evidence (or normalization factor), which is negligible, so we replace it with $\alpha = \frac{1}{E}$ and store it for later, when we need to compute the probability of each class **given** the features.

**Notation:**
The coma (,) in Bayes' rule can be used as the AND operator, or the intersection of two or more events. i.e. $P(A,B) = P(B,A) = P(A \mid B)P(B) = P(B \mid A)P(A)$

$$P(C_{k} \mid f_{1},...,f_{n}) \propto P(f_{1},...,f_{n},C_{k})$$

Now the numerator $P(f_{1},...,f_{n},C_{k})$ (Likelihood) can be expanded using the chain rule into:

$$P(f_{1},...,f_{n},C_{k}) = P(f_{i} \mid f_{i+1},...,f_n,C_{k})$$

$$P(f_{1},...,f_{n},C_{k}) = P(f_{1} \mid f_{n+1},...,f_{n},C_{k}) ... P(f_{n-1} \mid f_{n},C_{k}) P(f_{n} \mid C_{k}) P(C_{k})$$

Now if we consider the independence of events, as in: $P(A,B) = P(A)P(B)$, we have that:

$$P(f_{i} \mid f_{i+1},...,f_{n},C_{k}) = P(f_{i} \mid C_{k})$$

Hence:

$$P(C_{k} \mid f_{1},...,f_{n}) = \frac{1}{E} \times P(C_{k}) \prod_{i=1}^{n} P(f_{i} \mid C_{k}) = P(C_{k}) \prod_{i=1}^{n} P(f_{i} \mid C_{k}) \times \alpha$$

Where $E$ is the normalising factor computed using the Law of Total Probability:

$$E = \sum_{k}^{} P(\textbf{f} \mid C_{k}) P(C_{k})$$

The discussion so far has derived the independent feature model, that is, the naive Bayes probability model. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable; this is known as the *maximum a posteriori* or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label $\hat{y} = C_k$ for some k as follows:

$$\hat{y} = {\underset{k \in \{1, \dots ,K\}}{\operatorname{argmax}} P(C_k) \prod_{i=1}^{n} P(x_i \mid C_k). \quad\quad (1)}$$

$P(\text{contact-lenses=none} \mid \text{E}) = P(\text{age=pre-presbyopic} \mid \text{contact-lenses=none}) \times P(\text{spectacle-prescrip=hypermetrope} \mid \text{contact-lenses=none}) \times P(\text{astigmatism=yes} \mid \text{contact-lenses=none}) \times P(\text{tear-prod-rate=reduced} \mid \text{contact-lenses=none}) \times P(\text{contact-lenses=none}) \times \alpha$

$P(\text{contact-lenses=none} \mid \text{E}) = \frac{5+1}{15+3} \times \frac{8+1}{15+2} \times \frac{8+1}{15+2} \times \frac{12+1}{15+2} \times \frac{15+1}{24+3} \times \alpha = 0.04\alpha$

$P(\text{contact-lenses=soft} \mid \text{E}) = P(\text{age=pre-presbyopic} \mid \text{contact-lenses=soft}) \times P(\text{spectacle-prescrip=hypermetrope} \mid \text{contact-lenses=soft}) \times P(\text{astigmatism=yes} \mid \text{contact-lenses=soft}) \times P(\text{tear-prod-rate=reduced} \mid \text{contact-lenses=soft}) \times P(\text{contact-lenses=soft}) \times \alpha$

$P(\text{contact-lenses=soft} \mid \text{E}) = \frac{2+1}{5+3} \times \frac{3+1}{5+2} \times \frac{1}{5+2} \times \frac{1}{5+2} \times \frac{5+1}{24+3} \times \alpha = 0.001\alpha$

$P(\text{contact-lenses=hard} \mid \text{E}) = P(\text{age=pre-presbyopic} \mid \text{contact-lenses=hard}) \times P(\text{spectacle-prescrip=hypermetrope} \mid \text{contact-lenses=hard}) \times P(\text{astigmatism=yes} \mid \text{contact-lenses=hard}) \times P(\text{tear-prod-rate=reduced} \mid \text{contact-lenses=hard}) \times P(\text{contact-lenses=hard}) \times \alpha$

$P(\text{contact-lenses=hard} \mid \text{E}) = \frac{1+1}{4+3} \times \frac{1+1}{4+2} \times \frac{4+1}{4+2} \times \frac{1}{4+2} \times \frac{4+1}{24+3} \times \alpha = 0.002\alpha$

Now if $\alpha = \frac{1}{P(E)}$ then:

$\frac{(0.001 + 0.002 + 0.04)}{P(E)} = 1.0 \implies P(E) = (0.001 + 0.002 + 0.04) = 0.043$

Now we calculate each individual probability and pick the greatest probability according to (1):

$P(\text{contact-lenses=none} \mid \text{E}) = \frac{0.04}{0.043} = 93\%$

$P(\text{contact-lenses=soft} \mid \text{E}) = \frac{0.001}{0.043} = 2.3\%$

$P(\text{contact-lenses=hard} \mid \text{E}) = \frac{0.002}{0.043} = 4.7\%$

Weka classifies the entry *pre-presbyopic,hypermetrope,yes,reduced* as being of class **none**, with a probability of 92.5% (0.925).

According to this computation, the instance *pre-presbyopic,hypermetrope,yes,reduced* would be classified as belonging to class **none** with a probability of ~93%.