# CSC 578D / Data Mining / Fall 2018 / University of Victoria¶

## Python Notebook explaining Assignment 01 / Problem 04¶

### The dataset for the Assignment #1 is the following:¶

The Weka datasets can be found at my personal Website at www.apkc.net.

Author: Andreas P. Koenzen akoenzen@uvic.ca

Version: 0.1

In [22]:
import pandas as pd
import requests as rq

from colorama import Back, Style


### Solution to Problem #4 of Assignment #1:¶

#### The problem #4 states the following:¶

(10 points) Implement a Naive Bayes classifier for text classification. This classifier will be used to classify fortune cookie messages into two classes: messages that predict what will happen in the future and messages that just contain a wise saying. We will label messages that predict what will happen in the future as class 1 and messages that contain a wise saying as class 0. For example,

• "Never go in against a Sicilian when death is on the line" would be a message in class 0.
• "You will get an A in SENG 474" would be a message in class 1.

You can use any language you wish. There are two sets of data files provided:

1. The training data:
• traindata.txt: This is the training data consisting of fortune cookie messages. o trainlabels.txt: This file contains the class labels for the training data.
2. The testing data:
• testdata.txt: This is the testing data consisting of fortune cookie messages.
• testlabels.txt: This file contains the class labels for the testing data. These are only used to determine the accuracy of the classifier.

Your results must be stored in a file called results.txt.

1. Run your classifier by training on traindata.txt and trainlabels.txt then testing on traindata.txt and trainlabels.txt. Report the accuracy in results.txt (along with a comment saying what files you used for the training and testing data). In this situation, you are training and testing on the same data. This is a sanity check: your accuracy should be very high i.e. > 90%
2. Run your classifier by training on traindata.txt and trainlabels.txt then testing on testdata.txt and testlabels.txt. Report the accuracy in results.txt (along with a comment saying what files you used for the training and testing data). We will not be letting you know beforehand what your performance on the test set should be.

Submit your source code and the results.txt file.

### Helper Classes:¶

In [23]:
class NBClassification(object):
"""
Class to denote a classification result.
"""

def __init__(self, label: str, value: float = 0.0):
self.label: str = label
self.value: float = value

def __repr__(self):
return "{0}<{1}>".format(self.label, self.value)

class NBTerm(object):
"""
Class to denote a term.
"""

def __init__(self, term: str, likelihood: float = 0.0):
self.term: str = term.lower().strip()
self.likelihood: float = likelihood

def __repr__(self):
return "{0}<{1}>".format(self.term, self.likelihood)

class NBDocument(object):
"""
Class to denote a document.
"""

USE_FILTERED: bool = False
"""
boolean: Discontinued option to enable the use of stopwords.
"""

def __init__(self, raw_terms: [NBTerm], filtered_terms: [NBTerm]):
self.raw_terms: [NBTerm] = raw_terms  # stopwords included
self.filtered_terms: [NBTerm] = filtered_terms  # stopwords removed

def __repr__(self):
str = "\t\t\tTerms: {}\n".format(len(self.get_terms()))
for t in self.get_terms():
str += "\t\t\t{}\n".format(t)

return str

def get_terms(self):
"""
Retrieves all terms in a document.

:return: A List containing ALL terms in a document, including duplicates.
"""
if NBDocument.USE_FILTERED:
return self.filtered_terms
else:
return self.raw_terms

class NBClass(object):
"""
Class to denote a classification class.
"""

def __init__(self, label: str):
self.label: str = label
self.documents: [NBDocument] = []
self.prior: float = 0.0
self.likelihoods: [NBTerm] = []
self.name: str = ""
if self.label == '0':
self.name = 'Wise Saying'
elif self.label == '1':
self.name = 'Future'

def __repr__(self):
str = "\tClass Label: {}\n".format(self.label)
str += "\tDocuments: {}\n".format(len(self.documents))
for d in self.documents:
str += "\t\t{}\n".format(d)
str += "\tPrior: {}\n".format(self.prior)
str += "\tLikelihoods: {}\n".format(len(self.likelihoods))
for l in self.likelihoods:
str += "\t\t{}\n".format(l)

return str

def add_create_document(self, message: str) -> None:
"""
Create and add a document to this class.

:param message: The message/document to parse and add to this class.

:return: None
"""
terms = message.split(' ')  # break the document into terms
raw_terms = [NBTerm(term=t) for t in terms]
filtered_terms = raw_terms  # legacy, no use
self.documents.append(NBDocument(raw_terms=raw_terms, filtered_terms=filtered_terms))

def compute_likelihood(self, lexicon: [str]) -> None:
"""
Compute the likelihood for ALL terms in this class and then also for the terms
that are not in this class, assigning to them a zero-frequency score. For this
we use the lexicon, which contains UNIQUE terms in all classes.

:param lexicon: A List containing ALL UNIQUE terms in all classes. No duplicates \
are allowed.

:return: None
"""
# this will include ALL terms in the class, INCLUDED repeated terms!!!
class_terms = [t.term for d in self.documents for t in d.get_terms()]  # ALL TERMS!!!

# now for each term in lexicon compute its likelihood and add to the list of likelihoods
# likelihood = occurrences of term / all terms
for t in lexicon:
# compute numerator. add 1 to avoid the zero-frequency problem
numerator = class_terms.count(t) + 1
# compute denominator. add count of lexicon to avoid zero-frequency problem
denominator = len(class_terms) + len(lexicon)
# add to the likelihood list IF not present
flag = False
for e in self.likelihoods:
if e.term == t:
flag = True

if not flag:
self.likelihoods.append(NBTerm(term=t, likelihood=(numerator / denominator)))

def get_likelihood(self, term: str) -> None:
"""
Returns the likelihood for a particular term.

:param term: The needle.

:return: None if needle is not found, likelihood as a float if found.
"""
for e in self.likelihoods:
if e.term == term:
return e.likelihood

def get_class_lexicon(self) -> [str]:
"""
Returns the lexicon for a particular class.

:return: A List of strings containing the lexicon for a class. Remember that in the lexicon \
the terms are UNIQUE.
"""
lexicon = []
for d in self.documents:
for t in d.get_terms():
if t.term not in lexicon:
lexicon.append(t.term)

return lexicon

@staticmethod
def get_class_name(label: str):
"""
Returns the name of a class.

:return: A string containing the name of the class.
"""
if label == '0':
return 'Wise Saying'
elif label == '1':
return 'Future'

return 'None'

class NBModel(object):
"""
Class to denote a model.

Diagram of a model using encapsulation:
MODEL
|-- CLASS 1
|   |-- DOCUMENT 1
|   |   |-- TERM 1
|   |   |-- ...
|   |   |-- TERM N
|   |
|   |-- DOCUMENT N
|
|-- CLASS N

The model was built using encapsulation/objects and lists.
"""

DEBUG = False
"""
boolean: Enable/Disable debug info.
"""

def __init__(self):
self.classes: [NBClass] = []
self.lexicon: [str] = []  # vocabulary of UNIQUE words in ALL documents

def __repr__(self):
str = "Classes: {}\n".format(len(self.classes))
for c in self.classes:
str += "{}\n".format(c)
str += "Lexicon: {}\n".format(len(self.lexicon))
str += "{}".format(sorted(self.lexicon))

return str

def get_class(self, label: str) -> NBClass:
"""
Return a particular class from the model.

:param label: The label of the class.

:return: A class object matching the label. None if no class is found.
"""
for c in self.classes:
if c.label == label:
return c

return None

def calculate_and_update_prior(self, label: str) -> None:
"""
Compute and update the PRIOR probabilities for a particular class.

:param label: The label of the class.

:return: None
"""
N_c = float(len(self.get_class(label=label).documents))  # number of docs in class
N = 0.0  # number of docs in all classes
for c in self.classes:
N += len(c.documents)

# update prior
self.get_class(label=label).prior = N_c / N

# +++ DEBUG
if NBModel.DEBUG:
print("PRIOR for class {0} is {1}.".format(label, N_c / N))
print("N_c: {0}, N: {1}".format(N_c, N))

def compute_lexicon(self) -> None:
"""
Create the lexicon for this class.

:return: None
"""
# vocabulary should NOT contain duplicates
for c in self.classes:
for d in c.documents:
for t in d.get_terms():
if t.term not in self.lexicon:
self.lexicon.append(t.term)

def compute_likelihood(self) -> None:
"""
Wrapper function to compute likelihoods. Calls the compute_likelihood() function for each class.

:return: None
"""
for c in self.classes:
c.compute_likelihood(lexicon=self.lexicon)


### Classification class:¶

In [24]:
class NaiveBayesTextClassifier(object):
"""
Text classifier using the Naïve Bayes Classifier. This classifier supports only 2 classes, so it's a
binary classifier.
"""

DEBUG = False
"""
boolean: Enable/Disable debug info.
"""
SHOW_MODEL = False
"""
boolean: Enable/Disable printing of the model. LOTS OF INFO!
"""
MAKE_SUBSET_FOR_TRAINING = False
"""
boolean: Make a subset of the training dataset for testing purposes.
"""
TRAINING_SUBSET_SIZE = 2
"""
integer: Size of the training subset.
"""
MAKE_SUBSET_FOR_TESTING = False
"""
boolean: Make a subset of the testing dataset for testing purposes.
"""
TESTING_SUBSET_SIZE = 2
"""
integer: Size of the testing subset.
"""
USE_TRAINING_SET_FOR_TESTING = False
"""
boolean: If the testing should be done using the training dataset. If True expect high accuracy!
"""

def __init__(self):
self.model: NBModel = NBModel()
pass

def train(self, training_set: [str] = [], debug: bool = False) -> NBModel:
"""
Train a statistical model.

:param training_set: A List of training sets. Not in use.
:param debug: Flag to enable debug info.

:return: The trained model as a NBModel object.
"""
# parse the training data and labels and convert them into pandas Series
training_data = rq.get('http://www.apkc.net/data/csc_578d/assignment01/problem04/traindata.txt').text.splitlines()
if training_data is not None:
t_data_series = pd.Series(training_data)

training_labels = rq.get('http://www.apkc.net/data/csc_578d/assignment01/problem04/trainlabels.txt').text.splitlines()
if training_labels is not None:
t_labels_series = pd.Series(training_labels)

# combine both series into a DataFrame
t_data_matrix = pd.DataFrame({
'message': t_data_series,
'label': t_labels_series
})

# make a custom subset of the entire training set for debugging purposes
if NaiveBayesTextClassifier.MAKE_SUBSET_FOR_TRAINING:
_0_messages = t_data_matrix.loc[t_data_matrix.label == '0', 'message'][0:NaiveBayesTextClassifier.TRAINING_SUBSET_SIZE]
_0_labels = ['0' for _ in _0_messages]
_1_messages = t_data_matrix.loc[t_data_matrix.label == '1', 'message'][0:NaiveBayesTextClassifier.TRAINING_SUBSET_SIZE]
_1_labels = ['1' for _ in _1_messages]
# replace the DataFrame
t_data_matrix = pd.DataFrame({
'message': pd.concat([
pd.Series(list(_0_messages)),
pd.Series(list(_1_messages))
]),
'label': pd.concat([
pd.Series(_0_labels),
pd.Series(_1_labels)
])
})

# +++ DEBUG
if NaiveBayesTextClassifier.DEBUG:
print("DataFrame: (Future: Class 1, Wise Saying: Class 0)")
print(t_data_matrix)

# construct the model
# 1. save classes, documents, terms
for label in t_data_matrix.label.unique():  # this returns an ndarray
self.model.classes.append(NBClass(label=label))

# save all messages for each class
tmp = t_data_matrix.loc[t_data_matrix.label == label, 'message']
cls = self.model.get_class(label)
for _, m in tmp.items():

# 2. calculate priors
for label in t_data_matrix.label.unique():  # this returns an ndarray
self.model.calculate_and_update_prior(label)

# 3. compute lexicon
self.model.compute_lexicon()

# 4. compute likelihoods
self.model.compute_likelihood()

# +++ DEBUG
if NaiveBayesTextClassifier.SHOW_MODEL:
print('')
print('++++++')
print(self.model)

return self.model

def classify(self, model: NBModel, testing_set: [str] = [], debug: bool = False) -> None:
"""
Classify instances using a statistical model.

:param model: The statistical model.
:param testing_set: A List of testing sets. Not in use.
:param debug: Flag to enable debug info.

:return: None
"""
# parse the training data and labels and convert them into pandas Series
testing_data = rq.get(
"http://www.apkc.net/data/csc_578d/assignment01/problem04/{}.txt".format(
'traindata' if NaiveBayesTextClassifier.USE_TRAINING_SET_FOR_TESTING else 'testdata'
)
).text.splitlines()
if testing_data is not None:
t_data_series = pd.Series(testing_data)

testing_labels = rq.get(
"http://www.apkc.net/data/csc_578d/assignment01/problem04/{}.txt".format(
'trainlabels' if NaiveBayesTextClassifier.USE_TRAINING_SET_FOR_TESTING else 'testlabels'
)
).text.splitlines()
if testing_labels is not None:
t_labels_series = pd.Series(testing_labels)

# combine both series into a DataFrame
t_data_matrix = pd.DataFrame({
'message': t_data_series,
'label': t_labels_series
})

# make a subset of the entire training set for debugging purposes
if NaiveBayesTextClassifier.MAKE_SUBSET_FOR_TESTING:
_0_messages = t_data_matrix.loc[t_data_matrix.label == '0', 'message'][0:NaiveBayesTextClassifier.TESTING_SUBSET_SIZE]
_0_labels = ['0' for _ in _0_messages]
_1_messages = t_data_matrix.loc[t_data_matrix.label == '1', 'message'][0:NaiveBayesTextClassifier.TESTING_SUBSET_SIZE]
_1_labels = ['1' for _ in _1_messages]
# replace the DataFrame
t_data_matrix = pd.DataFrame({
'message': pd.concat([
pd.Series(list(_0_messages)),
pd.Series(list(_1_messages))
]),
'label': pd.concat([
pd.Series(_0_labels),
pd.Series(_1_labels)
])
})

print(
Style.BRIGHT +
"""
==--------------------------------------------==
== SigmaProject v0.1                          ==
== https://github.com/k-zen/SigmaProject      ==
== Author: Andreas Koenzen <akoenzen@uvic.ca> ==
== -------------------------------------------==
== Disclaimer: NOT to be used in production.  ==
==             ONLY for educational purposes. ==
== Belongs to the project SigmaProject autho- ==
== red by me.                                 ==
==--------------------------------------------==
""" +
Style.RESET_ALL
)

# compute the odds for each class
correct_instances = 0
for _, r in t_data_matrix.iterrows():
document = str(r['message'])
vocabulary = document.split(' ')
label = r['label']

# compute probability for each class
argmax = []
for c in model.classes:
factors: str = ""
v = c.prior
factors += "{} *".format(v)
for t in vocabulary:
likelihood = c.get_likelihood(term=t)
if likelihood is not None:
v *= likelihood
factors += " {} *".format(likelihood)

if len(vocabulary) == 0:
v = 0

argmax.append(NBClassification(label=c.label, value=v))

if NaiveBayesTextClassifier.DEBUG:
print("Class {2} => {0} = {1}".format(factors.strip('*'), v, c.label))

# compute accuracy
max_label = max(argmax, key=lambda e: e.value).label
result = Style.BRIGHT + Back.RED + 'INCORRECT' + Style.RESET_ALL
if max_label == label:
correct_instances += 1
result = Style.BRIGHT + Back.GREEN + 'CORRECT' + Style.RESET_ALL

txt = ''
txt += "- {} ".format(document)
txt += Style.BRIGHT + " [{0}:{1}] ".format(NBClass.get_class_name(max_label), max_label) + Style.RESET_ALL
txt += " {} ".format(result)
print(txt)

print(Style.BRIGHT + "=======" + Style.RESET_ALL)
print(Style.BRIGHT + "RESULT:" + Style.RESET_ALL)
print(Style.BRIGHT + "> Classifier Accuracy: \"{0}%\"".format((correct_instances / t_data_matrix.shape[0]) * 100) + Style.RESET_ALL)
print(Style.BRIGHT + "=======" + Style.RESET_ALL)


### Run the classifier:¶

The classifier will be trained using the traindata.txt and trainlabels.txt provided with the assignment, and the testing is done with the testdata.txt and testlabels.txt. Both datasets can be found here:

### Notes:¶

1. The accuracy if we use the same set for training and testing is ~ 96.5% and if we test using the testing set is only ~ 80.19%.
In [26]:
classifier = NaiveBayesTextClassifier()
classifier.classify(model=classifier.train())


==--------------------------------------------==
== SigmaProject v0.1                          ==
== https://github.com/k-zen/SigmaProject      ==
== Author: Andreas Koenzen <akoenzen@uvic.ca> ==
== -------------------------------------------==
== Disclaimer: NOT to be used in production.  ==
==             ONLY for educational purposes. ==
== Belongs to the project SigmaProject autho- ==
== red by me.                                 ==
==--------------------------------------------==

- your nurturing instincts will expand to include many people  [Future:1]  CORRECT
- an interesting musical opportunity is in your near future  [Future:1]  CORRECT
- the best way to get rid of an enemy is to make him a friend  [Wise Saying:0]  CORRECT
- pleasures await you by the seashore  [Future:1]  CORRECT
- you have a keen sense of humor and bring out the best in others  [Wise Saying:0]  CORRECT
- your dearest wish will come true within the month  [Future:1]  CORRECT
- you will soon be part of a team working cooperatively for success  [Future:1]  CORRECT
- you are going to have a very comfortable retirement  [Future:1]  CORRECT
- family is more valuable than money  but you will have both  [Wise Saying:0]  INCORRECT
- this month you will do well to open yourself to new ideas  [Future:1]  CORRECT
- appreciate the caring people who surround you  flowers are good  [Wise Saying:0]  CORRECT
- a four-wheeled adventure will soon bring you happiness  [Future:1]  CORRECT
- the near future holds a gift of contentment  [Future:1]  CORRECT
- a small lucky package is on its way to you soon  [Future:1]  CORRECT
- listen these next few days to your friends to get answers you seek  [Future:1]  CORRECT
- you will do well at making money and holding on to it  [Future:1]  CORRECT
- your talents will be recognized and rewarded  [Future:1]  CORRECT
- now is the best time for you to be spontaneous serendipity  [Wise Saying:0]  CORRECT
- a chance meeting with a stranger may soon change your life  [Future:1]  CORRECT
- you will be transforming a situation in your life now with a positive attitude  [Future:1]  CORRECT
- your hard work is about to pay off congratulations  [Future:1]  CORRECT
- an admirer is too shy to greet you  [Future:1]  INCORRECT
- a pleasant surprise is in store for you tonight  [Future:1]  CORRECT
- investigate new possibilities with friends now is the time  [Wise Saying:0]  CORRECT
- this year your highest priority will be your family  [Future:1]  CORRECT
- use your abilities at this time to stay focused on your goal you will succeed  [Future:1]  CORRECT
- a bold and dashing adventure is in your future within the year  [Future:1]  CORRECT
- what doesn't destroy you makes you stronger  [Future:1]  INCORRECT
- you are a dreamer and your thinking is inspirational  [Wise Saying:0]  CORRECT
- focus on your long-term goal good things will soon happen  [Future:1]  CORRECT
- time heals all wounds keep your chin up  [Wise Saying:0]  CORRECT
- time makes one wise ask advice from someone older than you  [Wise Saying:0]  CORRECT
- new financial resources will soon become available to you  [Future:1]  CORRECT
- the rainbow's treasures will soon belong to you  [Future:1]  CORRECT
- you are cheerful and well-liked  [Future:1]  INCORRECT
- you will soon be receiving sound spoken advice listen  [Future:1]  CORRECT
- you will be proud in manner but tolerant and generous  [Future:1]  CORRECT
- you'll accomplish more later if you have a little fun this weekend  [Future:1]  CORRECT
- you will be showered with good luck before your next birthday  [Future:1]  CORRECT
- you have an unusually magnetic personality  [Wise Saying:0]  CORRECT
- patience is the answer to success  [Wise Saying:0]  CORRECT
- the person you are thinking of is also thinking of you  [Wise Saying:0]  CORRECT
- listen these next few days to your friends to get answers you seek  [Future:1]  CORRECT
- three months from this date your lucky star will be shining  [Future:1]  CORRECT
- friendship is the key to finding the answer you're looking for  [Wise Saying:0]  CORRECT
- where there is no love nothing is possible  [Wise Saying:0]  CORRECT
- rely on long time friends to give you advice  [Wise Saying:0]  INCORRECT
- you will soon be reunited with an old friend  [Future:1]  CORRECT
- opportunity awaits you next monday  [Future:1]  CORRECT
- original ideas allow you to meet talented people  [Future:1]  INCORRECT
- you will be fortunate in the opportunities presented to you  [Future:1]  CORRECT
- wisdom is acquired by experience not just by age  [Future:1]  INCORRECT
- a small vocabulary doesn't necessarily indicate a small mind  [Future:1]  INCORRECT
- now is the time to call loved ones at a distance share your news  [Future:1]  INCORRECT
- you will have much to be thankful for in the coming year  [Future:1]  CORRECT
- no obstacles will stand in the way of your success this month  [Future:1]  CORRECT
- you would do well in the field of computer technology  [Future:1]  CORRECT
- family ties will be reestablished in the near future  [Future:1]  CORRECT
- be content with your lot one cannot be first in everything  [Wise Saying:0]  CORRECT
- you will win success in whatever you attempt  [Future:1]  CORRECT
- you have the strength to overcome obstacles on your way to success  [Future:1]  INCORRECT
- pleasures await you by the seashore  [Future:1]  CORRECT
- the only way to have a friend is to be one  [Wise Saying:0]  CORRECT
- opportunity awaits you on next tuesday  [Future:1]  CORRECT
- you are able to juggle many tasks  [Future:1]  INCORRECT
- you will have many friends when you need them  [Future:1]  CORRECT
- this year your highest priority will be your family  [Future:1]  CORRECT
- an unexpected payment is coming your way  [Future:1]  CORRECT
- your talents will prove to be especially useful this week  [Future:1]  CORRECT
- a man who dares to waste an hour of time hasn't discovered the value of life  [Wise Saying:0]  CORRECT
- your lucky number for this week is nine  [Future:1]  CORRECT
- you will attend an unusual party and meet someone important  [Future:1]  CORRECT
- you add an aesthetic quality to everything you do  [Future:1]  INCORRECT
- you have a charming way with words  write a letter this week  [Future:1]  INCORRECT
- you have sound business sense  [Future:1]  INCORRECT
- flowers would brighten the day of your close friend tomorrow  [Wise Saying:0]  INCORRECT
- you will find your solution where you least expect it  [Future:1]  CORRECT
- a long lost relative will soon come along to your benefit  [Future:1]  CORRECT
- your respect for others will be your ticket to success  [Future:1]  CORRECT
- you should do well at making money and holding on to it  [Future:1]  CORRECT
- the star of riches is shining on you this month  [Wise Saying:0]  INCORRECT
- you should enhance your feminine side at this time  [Future:1]  INCORRECT
- you will make many changes before settling satisfactorily  [Future:1]  CORRECT
- your dreams will bring you into a profitable venture  [Future:1]  CORRECT
- you have an active mind and a keen imagination apply your ideas  [Future:1]  INCORRECT
- you shouldn't overspend at the moment frugality is important  [Future:1]  CORRECT
- an unexpected event will soon make your life more exciting  [Future:1]  CORRECT
- the time is right for you to make new friends  [Future:1]  CORRECT
- you have an ambitious nature and your reward is coming soon  [Future:1]  CORRECT
- you will be involved in many humanitarian projects  [Future:1]  CORRECT
- a friend or partner will be giving you needed information listen  [Future:1]  CORRECT
- your troubles will cease and fortune will smile upon you  [Future:1]  CORRECT
- a handful of patience is worth more than a bushel of brains  [Wise Saying:0]  CORRECT
- words must be weighed and not counted  [Wise Saying:0]  CORRECT
- don't give up the best is yet to come  [Future:1]  CORRECT
- you deserve to have a good time after a hard day's work or school  [Future:1]  INCORRECT
- behind an able man there are always other able men  [Wise Saying:0]  CORRECT
- the project you have in mind will soon gain momentum  [Future:1]  CORRECT
- sometimes the wisest person is dressed in the rudest clothing  [Wise Saying:0]  CORRECT
=======
RESULT:
> Classifier Accuracy: "80.19801980198021%"
=======