Thursday, October 20, 2011

Sentiment analysis using Naive Bayes Algorithm

Experimented with simple Naive Bayes for sentiment classification.

Naive Bayes code is available  here chatper6/docclass.py and training data is available here

Changed the getwords() function in docclass.py

- to remove special characters like single-quote, comma, full stop from text
- to split based on white spaces instead of non word character because it ignored emots with non word   character split and
- included nltk stopwords corpus check.

[sourcecode language="python"]

def getwords(doc):
doc=re.sub('\.+|,+|!+|\'','',doc)
splitter=re.compile('\\s+')
#print doc
# Split the words by non-alpha characters
words=[s.lower().strip() for s in splitter.split(doc)
if s.lower().strip() not in nltk.corpus.stopwords.words('english') ]
print words
# Return the unique set of words only
return dict([(w,1) for w in words])
[/sourcecode]

For training data, converted ';;' separated data file to '\t' separated file because csv.reader() function
was not accepting  two symbol delimiters.

Changed sampletrain function to train classifier on training data file "testdata.manual.2009.05.25".

[sourcecode language="python"]
def sampletrain(cl):
read = csv.reader(open('pos 1', 'rb'), delimiter='\t')
cnt = 1
for row in read:
if row[0] == 0:
sent = 'bad'
else:
sent = 'pos'
data = row[5]
cl.train(data,sent)
cnt = cnt+1
print cnt
[/sourcecode]

No comments:

Post a Comment