CSC 578D / Data Mining / Fall 2018 / University of Victoria

Python Notebook explaining Assignment 01 / Problem 02

The dataset for the Assignment #1 is the following:

The Weka datasets can be found at my personal Website at www.apkc.net.

Author: Andreas P. Koenzen akoenzen@uvic.ca
Version: 0.1

In [1]:
import pandas as pd
import numpy as np
import requests as rq

from scipy.io import arff
from io import StringIO
In [3]:
url_data = rq.get('http://www.apkc.net/data/weka/weather.nominal.arff').text
data = arff.loadarff(StringIO(url_data))
df = pd.DataFrame(data[0], index=pd.Index(np.arange(14) + 1), dtype='object')

# Convert all data in the columns to strings instead of binary objects.
string_df = df.select_dtypes([np.object]).stack().str.decode('UTF-8').unstack()
for col in string_df:
    df[col] = string_df[col]
df
Out[3]:
outlook temperature humidity windy play
1 sunny hot high FALSE no
2 sunny hot high TRUE no
3 overcast hot high FALSE yes
4 rainy mild high FALSE yes
5 rainy cool normal FALSE yes
6 rainy cool normal TRUE no
7 overcast cool normal TRUE yes
8 sunny mild high FALSE no
9 sunny cool normal FALSE yes
10 rainy mild normal FALSE yes
11 sunny mild normal TRUE yes
12 overcast mild high TRUE yes
13 overcast hot normal FALSE yes
14 rainy mild high TRUE no

Solution to Problem #2 of Assignment #1:

The problem #2 states the following:

(4 points) Construct two rules using PRISM for the weather data. Show the details of your construction. Then, check your solution with Weka (the data file is included with Weka).

The full set of rules for this exercise is the following:

IF (outlook=overcast)                       THEN yes
IF (humidity=normal)  AND (windy=FALSE)     THEN yes
IF (temperature=mild) AND (humidity=normal) THEN yes
IF (outlook=rainy)    AND (windy=FALSE)     THEN yes
IF (outlook=sunny)    AND (humidity=high)   THEN no
IF (outlook=rainy)    AND (windy=TRUE)      THEN no

Notes:

  • 3 significant digits are used for all results.
  • results are rounded up if 4th significant digit is >= 5.

Step #1:

We construct an empty condition (no antecedent) rule for a random class, and list all possible test for that class.

Current State:

IF (?) THEN no

Possible Tests/Conditions:

outlook=sunny    3/5
outlook=overcast 0/4
outlook=rainy    2/5
temperature=hot  2/4
temperature=mild 2/6
temperature=cold 1/4
humidity=high    4/7  => HIGHEST ACCURACY
humidity=normal  1/7
windy=FALSE      2/8
windy=TRUE       3/6

From this list we select the the condition with the highest probability of occurrence GIVEN that the class is no or the highest accuracy. In this case we select humidity=high has the highest accuracy from the initial list.

$Coverage(\text{humidity=high}) = 7$

$Accuracy(\text{humidity=high | play=no}) = 4/7 = 0.57$

We can see that the accuracy is not very high so we refine some more.


Step #2:

We add a new test to the rule to increase the accuracy.

Current State:

IF (humidity=high) AND (?) THEN no

Possible Test/Conditions:

humidity=high AND outlook=sunny    3/3  => HIGHEST ACCURACY
humidity=high AND outlook=overcast 0/2
humidity=high AND outlook=rainy    1/2
humidity=high AND temperature=hot  2/3
humidity=high AND temperature=mild 2/4
humidity=high AND temperature=cold 0/0
humidity=high AND windy=FALSE      2/4
humidity=high AND windy=TRUE       2/3

Again we select the test with the highest accuracy and add it to the rule. In this case we select outlook=sunny.

$Coverage(\text{humidity=high AND outlook=sunny}) = 3$

$Accuracy(\text{humidity=high AND outlook=sunny | play=no}) = 3/3 = 1$

We've reached an accuracy of 1.0. So we stop here, because the rule is already refined to the maximum.

Rule #1 is:

IF (humidity=high) AND (outlook=sunny) THEN no

Step #3:

We continue building rules until we have covered every attribute-value combination OR until we have the perfect set of rules.

The dataset looks like this after we exclude records that are covered by the rule #1.

In [12]:
new_df = df.loc[(df['humidity'] != 'high') | (df['outlook'] != 'sunny')]
new_df
Out[12]:
outlook temperature humidity windy play
3 overcast hot high FALSE yes
4 rainy mild high FALSE yes
5 rainy cool normal FALSE yes
6 rainy cool normal TRUE no
7 overcast cool normal TRUE yes
9 sunny cool normal FALSE yes
10 rainy mild normal FALSE yes
11 sunny mild normal TRUE yes
12 overcast mild high TRUE yes
13 overcast hot normal FALSE yes
14 rainy mild high TRUE no

Step #4:

We construct an empty condition (no antecedent) rule for a class no again, and list all possible test for that class, excluding the tests that are covered by rule #1.

Current State:

IF (?) THEN no

Possible Tests/Conditions:

outlook=overcast 0/4
outlook=rainy    2/5  => HIGHEST ACCURACY
temperature=hot  0/2
temperature=mild 1/5
temperature=cold 1/4
humidity=normal  1/7
windy=FALSE      0/6
windy=TRUE       2/5  => HIGHEST ACCURACY

From this list we select the the condition with the highest probability of occurrence GIVEN that the class is no or the highest accuracy. In this case we select one of two possible attribute=values, let's select outlook=rainy has the highest accuracy from the initial list.

$Coverage(\text{outlook=rainy}) = 5$

$Accuracy(\text{outlook=rainy | play=no}) = 2/5 = 0.40$

We can see that the accuracy is not very high so we refine some more.


Step #5:

We add a new test to the rule to increase the accuracy.

Current State:

IF (outlook=rainy) AND (?) THEN no

Possible Test/Conditions:

outlook=rainy AND temperature=hot  0/0
outlook=rainy AND temperature=mild 1/3
outlook=rainy AND temperature=cold 1/2
outlook=rainy AND humidity=normal  1/3
outlook=rainy AND windy=FALSE      0/3
outlook=rainy AND windy=TRUE       2/2  => HIGHEST ACCURACY

Again we select the test with the highest accuracy and add it to the rule. In this case we select windy=TRUE.

$Coverage(\text{outlook=rainy AND windy=TRUE}) = 2$

$Accuracy(\text{outlook=rainy AND windy=TRUE | play=no}) = 2/2 = 1$

We've reached an accuracy of 1.0. So we stop here, because the rule is already refined to the maximum.

Rule #2 is:

IF (outlook=rainy) AND (windy=true) THEN no

Final solution:

IF (humidity=high) AND (outlook=sunny) THEN no
IF (outlook=rainy) AND (windy=true) THEN no
...

The dataset will look like this after we exclude records that are covered by both rules. We need to keep creating rules until we have rules that cover all instances. And after that we need to create a default rule or catch all rule for instances that can't be covered by our rule set.

In [15]:
final_df = new_df.loc[(new_df['outlook'] != 'rainy') | (new_df['windy'] != 'TRUE')]
final_df
Out[15]:
outlook temperature humidity windy play
3 overcast hot high FALSE yes
4 rainy mild high FALSE yes
5 rainy cool normal FALSE yes
7 overcast cool normal TRUE yes
9 sunny cool normal FALSE yes
10 rainy mild normal FALSE yes
11 sunny mild normal TRUE yes
12 overcast mild high TRUE yes
13 overcast hot normal FALSE yes

Observation:

With the two rules that we previously created, we covered all instances of class no.


END