Thursday, September 22, 2011

Funny Incident @Work

Today, my manager asked me to send him my profile in one-page ppt. After I sent him the document he replied back saying 'You haven't included number of years of experience in it. Update it & send again' .
But I included that in document. So, to clarify, I call him.


Me: Hi <manager> , its regarding the profile. Years of experience, I have mentioned it in the document.
Manger: Where is it? I don't see it.
Me: Its in first sentence of document itself.
Manager: You mentioned it in letters. ah.. you should have mentioned it in digits.
Me: Okay.Anything else.
Manager: No, Nothing else. Just update it, change it to digit and resend me the document.

He wanted me to replace the word with a single digit and resend him the document. Amazing!!
Such incidents really provoke laughter and provide amusement at work.

Monday, September 5, 2011

Extracting movie title from torrent file name using Regular Expression

Movie files downloaded from torrent sites has file name which contains format types (like dvdrip, dvdscr, xvid etc), year, comments , user names and of course movie name. We want to extract movie name from this file names.



Input File namesOutput Achieved

countdown.to.zero.2010.xvid-submerge.avicountdown to zero

DrJn.2010.BRRip_mediafiremoviez.com.mkvdrjn

Nim's.Island[2008]DvDrip-aXXo.avinim's island

Invictus.DVDSCR.xViD-xSCR.CD1.aviinvictus

Invictus.DVDSCR.xViD-xSCR.CD2.aviinvictus

20000 Leagues Under The Sea.avi

Across The Universe.MoZinRaT CD1.aviacross the universe mozinrat

Adoration 2008 DvdRip ExtraScene RG.aviadoration

Amelie(English Dubbed).aviamelie

America.2009.STV.DVDRip.XviD-ViSiON.aviamerica

VTS_02_1.avivts_02_1

VTS_02_2.avivts_02_2

Antibodies.2005.GERMAN.DVDRip.XviD.AC3.CD1-AFO.aviantibodies

arranged.xvid-reserved.aviarranged

badder.santa.dvdrip.xvid-deity.avibadder santa

Balls of Fury[2007]DvDrip[Eng]-FXG.aviballs of fury

Bruno (2009) DVDRip-MAXSPEED www.torentz.3xforum.ro.avibruno

Defiance DvDSCR[2009] ( 10rating ).avidefiance

Down With Love (cute romantic comedy).avidown with love

Einstein.And.Eddington.2008.DVDRip.XviD.avieinstein and eddington

ENEMY_OF_THE_STATE..DVDrip(vice).avienemy_of_the_state

Based on the observation that

  • Most of the file name contains the format like 'dvdrip', 'xvid', 'brrip','dvdscr' or other words like 'CD1','(<year>)','[<year>]'  specified in the name and everything after any of this  words doesnot contains any useful data.


  • Sometimes extra information is added to file name inside bracket  like
    Defiance DvDSCR[2009] ( 10rating ).avi
    Down With Love (cute romantic comedy).avi
    so we can also ignore the part including and after brackets as movie names doesn't have brackets in them and have no useful information after it.

'(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*'

This regular expression will find file names where we have dvdrip, brrip, xvid(we can specify any number of values here) or  year and finds first of any one of the appearing patterns because we have used lazy parsing here using .*?.  We then extract the first back referenced part \1.

Secondly, to remove the part within brackets we use

'(.*?)\(.*\)(.*)' regular expression.

Following code snippet gets the movie names(to an extent) from file name.





import re
fr = open('filenameslist.txt', 'r')
fw = open('movienames.txt', 'w')
for line in fr:
    text = line.strip()
text1 = re.search('([^\\\]+)\.(avi|mkv|mpeg|mpg|mov|mp4)$', text)
if text1:
    text = text1.group(1)
text = text.replace('.', ' ').lower()
text2 = re.search('(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*', text)
if text2:
    text = text2.group(1)
text3 = re.search('(.*?)\(.*\)(.*)', text)
if text3:
    text = text3.group(1)
# print text
fw.write(text + '\n')
fr.close()
fw.close()




Output can be improved further observing things like we can replace characters like underscore with space, we can check for only four digits where next character is non word character ..

Sunday, September 4, 2011

Extracting IMDB data using Python

Sample code for extracting IMDB data using python BeautifulSoup package.

Requirement : To extract all the feature movie names and their ratings from IMDB database for a particular year.

Parameters used in query were identified using IMDB advanced search function. Start, count and year parameters were used in this case for querying.The url is queried for 100 records at a time since more than that is not allowed. After extracting movie names and rating for 100 records, the url is queried for next 100 records and so on.

Two files 'imdb_conf' and 'ratings' are created by the code. 'imdb_conf' file keeps track of the record number last read and 'ratings' file stores the movie name, rating and year.

Web scrapping the IMDB website for required data.

[sourcecode language="python"]
from BeautifulSoup import BeautifulSoup
import os
import re
import urllib2


def get_start_pos_yr(fimdb_config):
#for starting after last fetched record
#last line contains the last record fetched
nlines = fimdb_config.readlines()
startfrom = -1
year = None
if len(nlines) > 1:
list_num =re.search('[^\t]+',nlines[-1])
if list_num:
startfrom = int(list_num.group())+1
year =re.search('\t[0-9]+',nlines[-1]).group().strip()
return startfrom,year


def get_soup(url):
#get soup object for the url
try:
page = urllib2.urlopen(url)
except urllib2.URLError, e:
print 'Failed to fetch ' + url
raise e

try:
soup = BeautifulSoup(page)
except HTMLParser.HTMLParseError, e:
print 'Failed to parse ' + url
raise e
return soup



def get_ntotal(soup):
#fetch total number of records present for particular query
total_count=1
for div in soup.findAll('div', {'id':'left'}):
#print ivd.contents[0]
total_count = re.search('[ ]+[0-9,]+',div.contents[0])
if total_count:
total_count=total_count.group().replace(',','').strip()
#print "total"+total_count
return total_count


def set_rating(soup,fwimdb_config,frating,year,startfrom):
cond = True
count_rec=0
total_res=get_ntotal(soup)
year=str(year)
#total_res=100
while cond:
for tr in soup.findAll('tr', {'class':re.compile('(odd|even)[ a-zA-Z]*')}): #each row
for td in tr.findAll('td', {'class':'title'} ):
for link in td.findAll('a',{'href':re.compile('/title/tt[^/]+/$')}):
movie_name=link.contents[0] #title name
for rating in td.findAll('div',{'class':'rating rating-list'}):
count_rec=count_rec+1
if rating.has_key('title'):
#print "hurray"
rt = re.search('[0-9]+[^(]+',rating['title']) #rating
if rt:
frating.write(movie_name+"\t"+rt.group().strip()+"\t"+year+"\n")
else:
frating.write(movie_name+"\t--\t"+year+"\n")
else:
frating.write(movie_name+"\t--\t"+year+"\n")
#print movie_name+"\t"+rt.group()
fwimdb_config.write(str(count_rec)+"\t"+year+"\n")
if startfrom == 0:
startfrom = 101 #second run
else:
startfrom = startfrom + 100

if startfrom >= int(total_res):
cond=False
fwimdb_config.write("-1"+"\t"+str(int(year)-1)+"\n")
print str(startfrom)+" "+str(total_res)
soup=get_soup("http://www.imdb.com/search/title?languages=en&title_type=feature&count=100&sort=num_votes,desc&start="+str(startfrom)+"&year="+year)


def main():

fwimdb_conf=open("imdb_conf","r+")
frating = open("ratings","a") #ratings
fwimdb_conf.write("LastreadLine\tYear\n")
startfrom,year = get_start_pos_yr(fwimdb_conf)

if startfrom == -1:
startfrom = 0
if year == None:
year="2010"
print startfrom

soup=get_soup("http://www.imdb.com/search/title?languages=en&title_type=feature&sort=num_votes,desc&count=100&start="+str(startfrom)+"&year="+year)
set_rating(soup,fwimdb_conf,frating,year,startfrom)

frating.close()
fwimdb_conf.close()


if __name__ == '__main__':
main()
[/sourcecode]

For demonstration purposes only. If you plan to use IMDB data beyond personal usage, you should contact IMDB Licensing department.