Experimented with simple Naive Bayes for sentiment classification.
Naive Bayes code is available here chatper6/docclass.py and training data is available here
Changed the getwords() function in docclass.py
- to remove special characters like single-quote, comma, full stop from text
- to split based on white spaces instead of non word character because it ignored emots with non word character split and
- included nltk stopwords corpus check.
[sourcecode language="python"]
def getwords(doc):
doc=re.sub('\.+|,+|!+|\'','',doc)
splitter=re.compile('\\s+')
#print doc
# Split the words by non-alpha characters
words=[s.lower().strip() for s in splitter.split(doc)
if s.lower().strip() not in nltk.corpus.stopwords.words('english') ]
print words
# Return the unique set of words only
return dict([(w,1) for w in words])
[/sourcecode]
For training data, converted ';;' separated data file to '\t' separated file because csv.reader() function
was not accepting two symbol delimiters.
Changed sampletrain function to train classifier on training data file "testdata.manual.2009.05.25".
[sourcecode language="python"]
def sampletrain(cl):
read = csv.reader(open('pos 1', 'rb'), delimiter='\t')
cnt = 1
for row in read:
if row[0] == 0:
sent = 'bad'
else:
sent = 'pos'
data = row[5]
cl.train(data,sent)
cnt = cnt+1
print cnt
[/sourcecode]
No comments:
Post a Comment