Lazzy Scientist: September 2011

Thursday, September 22, 2011

Funny Incident @Work

Today, my manager asked me to send him my profile in one-page ppt. After I sent him the document he replied back saying 'You haven't included number of years of experience in it. Update it & send again' .
But I included that in document. So, to clarify, I call him.

Me: Hi <manager> , its regarding the profile. Years of experience, I have mentioned it in the document.
Manger: Where is it? I don't see it.
Me: Its in first sentence of document itself.
Manager: You mentioned it in letters. ah.. you should have mentioned it in digits.
Me: Okay.Anything else.
Manager: No, Nothing else. Just update it, change it to digit and resend me the document.

He wanted me to replace the word with a single digit and resend him the document. Amazing!!
Such incidents really provoke laughter and provide amusement at work.

Monday, September 5, 2011

Extracting movie title from torrent file name using Regular Expression

Movie files downloaded from torrent sites has file name which contains format types (like dvdrip, dvdscr, xvid etc), year, comments , user names and of course movie name. We want to extract movie name from this file names.

	Input File names	Output Achieved
	countdown.to.zero.2010.xvid-submerge.avi	countdown to zero
	DrJn.2010.BRRip_mediafiremoviez.com.mkv	drjn
	Nim's.Island[2008]DvDrip-aXXo.avi	nim's island
	Invictus.DVDSCR.xViD-xSCR.CD1.avi	invictus
	Invictus.DVDSCR.xViD-xSCR.CD2.avi	invictus
	20000 Leagues Under The Sea.avi
	Across The Universe.MoZinRaT CD1.avi	across the universe mozinrat
	Adoration 2008 DvdRip ExtraScene RG.avi	adoration
	Amelie(English Dubbed).avi	amelie
	America.2009.STV.DVDRip.XviD-ViSiON.avi	america
	VTS_02_1.avi	vts_02_1
	VTS_02_2.avi	vts_02_2
	Antibodies.2005.GERMAN.DVDRip.XviD.AC3.CD1-AFO.avi	antibodies
	arranged.xvid-reserved.avi	arranged
	badder.santa.dvdrip.xvid-deity.avi	badder santa
	Balls of Fury[2007]DvDrip[Eng]-FXG.avi	balls of fury
	Bruno (2009) DVDRip-MAXSPEED www.torentz.3xforum.ro.avi	bruno
	Defiance DvDSCR[2009] ( 10rating ).avi	defiance
	Down With Love (cute romantic comedy).avi	down with love
	Einstein.And.Eddington.2008.DVDRip.XviD.avi	einstein and eddington
	ENEMY_OF_THE_STATE..DVDrip(vice).avi	enemy_of_the_state

Based on the observation that

Most of the file name contains the format like 'dvdrip', 'xvid', 'brrip','dvdscr' or other words like 'CD1','(<year>)','[<year>]' specified in the name and everything after any of this words doesnot contains any useful data.

Sometimes extra information is added to file name inside bracket like
Defiance DvDSCR[2009] ( 10rating ).avi
Down With Love (cute romantic comedy).avi
so we can also ignore the part including and after brackets as movie names doesn't have brackets in them and have no useful information after it.

'(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{$\[]?[0-9]{4}).*'

This regular expression will find file names where we have dvdrip, brrip, xvid(we can specify any number of values here) or year and finds first of any one of the appearing patterns because we have used lazy parsing here using .*?. We then extract the first back referenced part \1.

Secondly, to remove the part within brackets we use

'(.*?)\(.*$(.*)' regular expression.

Following code snippet gets the movie names(to an extent) from file name.

import re
fr = open('filenameslist.txt', 'r')
fw = open('movienames.txt', 'w')
for line in fr:
    text = line.strip()
text1 = re.search('([^\\\]+)\.(avi|mkv|mpeg|mpg|mov|mp4)$', text)
if text1:
    text = text1.group(1)
text = text.replace('.', ' ').lower()
text2 = re.search('(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{$\[]?[0-9]{4}).*', text)
if text2:
    text = text2.group(1)
text3 = re.search('(.*?)\(.*$(.*)', text)
if text3:
    text = text3.group(1)
# print text
fw.write(text + '\n')
fr.close()
fw.close()

Output can be improved further observing things like we can replace characters like underscore with space, we can check for only four digits where next character is non word character ..

Sunday, September 4, 2011

Extracting IMDB data using Python

Sample code for extracting IMDB data using python BeautifulSoup package.

Requirement : To extract all the feature movie names and their ratings from IMDB database for a particular year.

Parameters used in query were identified using IMDB advanced search function. Start, count and year parameters were used in this case for querying.The url is queried for 100 records at a time since more than that is not allowed. After extracting movie names and rating for 100 records, the url is queried for next 100 records and so on.

Two files 'imdb_conf' and 'ratings' are created by the code. 'imdb_conf' file keeps track of the record number last read and 'ratings' file stores the movie name, rating and year.

Web scrapping the IMDB website for required data.

[sourcecode language="python"]
from BeautifulSoup import BeautifulSoup
import os
import re
import urllib2

def get_start_pos_yr(fimdb_config):
#for starting after last fetched record
#last line contains the last record fetched
nlines = fimdb_config.readlines()
startfrom = -1
year = None
if len(nlines) > 1:
list_num =re.search('[^\t]+',nlines[-1])
if list_num:
startfrom = int(list_num.group())+1
year =re.search('\t[0-9]+',nlines[-1]).group().strip()
return startfrom,year

def get_soup(url):
#get soup object for the url
try:
page = urllib2.urlopen(url)
except urllib2.URLError, e:
print 'Failed to fetch ' + url
raise e

try:
soup = BeautifulSoup(page)
except HTMLParser.HTMLParseError, e:
print 'Failed to parse ' + url
raise e
return soup

def get_ntotal(soup):
#fetch total number of records present for particular query
total_count=1
for div in soup.findAll('div', {'id':'left'}):
#print ivd.contents[0]
total_count = re.search('[ ]+[0-9,]+',div.contents[0])
if total_count:
total_count=total_count.group().replace(',','').strip()
#print "total"+total_count
return total_count

def set_rating(soup,fwimdb_config,frating,year,startfrom):
cond = True
count_rec=0
total_res=get_ntotal(soup)
year=str(year)
#total_res=100
while cond:
for tr in soup.findAll('tr', {'class':re.compile('(odd|even)[ a-zA-Z]*')}): #each row
for td in tr.findAll('td', {'class':'title'} ):
for link in td.findAll('a',{'href':re.compile('/title/tt[^/]+/$')}):
movie_name=link.contents[0] #title name
for rating in td.findAll('div',{'class':'rating rating-list'}):
count_rec=count_rec+1
if rating.has_key('title'):
#print "hurray"
rt = re.search('[0-9]+[^(]+',rating['title']) #rating
if rt:
frating.write(movie_name+"\t"+rt.group().strip()+"\t"+year+"\n")
else:
frating.write(movie_name+"\t--\t"+year+"\n")
else:
frating.write(movie_name+"\t--\t"+year+"\n")
#print movie_name+"\t"+rt.group()
fwimdb_config.write(str(count_rec)+"\t"+year+"\n")
if startfrom == 0:
startfrom = 101 #second run
else:
startfrom = startfrom + 100

if startfrom >= int(total_res):
cond=False
fwimdb_config.write("-1"+"\t"+str(int(year)-1)+"\n")
print str(startfrom)+" "+str(total_res)
soup=get_soup("http://www.imdb.com/search/title?languages=en&title_type=feature&count=100&sort=num_votes,desc&start="+str(startfrom)+"&year="+year)

def main():

fwimdb_conf=open("imdb_conf","r+")
frating = open("ratings","a") #ratings
fwimdb_conf.write("LastreadLine\tYear\n")
startfrom,year = get_start_pos_yr(fwimdb_conf)

if startfrom == -1:
startfrom = 0
if year == None:
year="2010"
print startfrom

soup=get_soup("http://www.imdb.com/search/title?languages=en&title_type=feature&sort=num_votes,desc&count=100&start="+str(startfrom)+"&year="+year)
set_rating(soup,fwimdb_conf,frating,year,startfrom)

frating.close()
fwimdb_conf.close()

if __name__ == '__main__':
main()
[/sourcecode]

For demonstration purposes only. If you plan to use IMDB data beyond personal usage, you should contact IMDB Licensing department.