Today, my manager asked me to send him my profile in one-page ppt. After I sent him the document he replied back saying 'You haven't included number of years of experience in it. Update it & send again' .
But I included that in document. So, to clarify, I call him.
Me: Hi <manager> , its regarding the profile. Years of experience, I have mentioned it in the document.
Manger: Where is it? I don't see it.
Me: Its in first sentence of document itself.
Manager: You mentioned it in letters. ah.. you should have mentioned it in digits.
Me: Okay.Anything else.
Manager: No, Nothing else. Just update it, change it to digit and resend me the document.
He wanted me to replace the word with a single digit and resend him the document. Amazing!!
Such incidents really provoke laughter and provide amusement at work.
Thursday, September 22, 2011
Funny Incident @Work
Monday, September 5, 2011
Extracting movie title from torrent file name using Regular Expression
Movie files downloaded from torrent sites has file name which contains format types (like dvdrip, dvdscr, xvid etc), year, comments , user names and of course movie name. We want to extract movie name from this file names.
Based on the observation that
'(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*'
This regular expression will find file names where we have dvdrip, brrip, xvid(we can specify any number of values here) or year and finds first of any one of the appearing patterns because we have used lazy parsing here using .*?. We then extract the first back referenced part \1.
Secondly, to remove the part within brackets we use
'(.*?)\(.*\)(.*)' regular expression.
Following code snippet gets the movie names(to an extent) from file name.
import re
fr = open('filenameslist.txt', 'r')
fw = open('movienames.txt', 'w')
for line in fr:
text = line.strip()
text1 = re.search('([^\\\]+)\.(avi|mkv|mpeg|mpg|mov|mp4)$', text)
if text1:
text = text1.group(1)
text = text.replace('.', ' ').lower()
text2 = re.search('(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*', text)
if text2:
text = text2.group(1)
text3 = re.search('(.*?)\(.*\)(.*)', text)
if text3:
text = text3.group(1)
# print text
fw.write(text + '\n')
fr.close()
fw.close()
Output can be improved further observing things like we can replace characters like underscore with space, we can check for only four digits where next character is non word character ..
Input File names | Output Achieved | |
---|---|---|
countdown.to.zero.2010.xvid-submerge.avi | countdown to zero | |
DrJn.2010.BRRip_mediafiremoviez.com.mkv | drjn | |
Nim's.Island[2008]DvDrip-aXXo.avi | nim's island | |
Invictus.DVDSCR.xViD-xSCR.CD1.avi | invictus | |
Invictus.DVDSCR.xViD-xSCR.CD2.avi | invictus | |
20000 Leagues Under The Sea.avi | ||
Across The Universe.MoZinRaT CD1.avi | across the universe mozinrat | |
Adoration 2008 DvdRip ExtraScene RG.avi | adoration | |
Amelie(English Dubbed).avi | amelie | |
America.2009.STV.DVDRip.XviD-ViSiON.avi | america | |
VTS_02_1.avi | vts_02_1 | |
VTS_02_2.avi | vts_02_2 | |
Antibodies.2005.GERMAN.DVDRip.XviD.AC3.CD1-AFO.avi | antibodies | |
arranged.xvid-reserved.avi | arranged | |
badder.santa.dvdrip.xvid-deity.avi | badder santa | |
Balls of Fury[2007]DvDrip[Eng]-FXG.avi | balls of fury | |
Bruno (2009) DVDRip-MAXSPEED www.torentz.3xforum.ro.avi | bruno | |
Defiance DvDSCR[2009] ( 10rating ).avi | defiance | |
Down With Love (cute romantic comedy).avi | down with love | |
Einstein.And.Eddington.2008.DVDRip.XviD.avi | einstein and eddington | |
ENEMY_OF_THE_STATE..DVDrip(vice).avi | enemy_of_the_state |
Based on the observation that
- Most of the file name contains the format like 'dvdrip', 'xvid', 'brrip','dvdscr' or other words like 'CD1','(<year>)','[<year>]' specified in the name and everything after any of this words doesnot contains any useful data.
- Sometimes extra information is added to file name inside bracket like
Defiance DvDSCR[2009] ( 10rating ).avi
Down With Love (cute romantic comedy).avi
so we can also ignore the part including and after brackets as movie names doesn't have brackets in them and have no useful information after it.
'(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*'
This regular expression will find file names where we have dvdrip, brrip, xvid(we can specify any number of values here) or year and finds first of any one of the appearing patterns because we have used lazy parsing here using .*?. We then extract the first back referenced part \1.
Secondly, to remove the part within brackets we use
'(.*?)\(.*\)(.*)' regular expression.
Following code snippet gets the movie names(to an extent) from file name.
import re
fr = open('filenameslist.txt', 'r')
fw = open('movienames.txt', 'w')
for line in fr:
text = line.strip()
text1 = re.search('([^\\\]+)\.(avi|mkv|mpeg|mpg|mov|mp4)$', text)
if text1:
text = text1.group(1)
text = text.replace('.', ' ').lower()
text2 = re.search('(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*', text)
if text2:
text = text2.group(1)
text3 = re.search('(.*?)\(.*\)(.*)', text)
if text3:
text = text3.group(1)
# print text
fw.write(text + '\n')
fr.close()
fw.close()
Output can be improved further observing things like we can replace characters like underscore with space, we can check for only four digits where next character is non word character ..
Sunday, September 4, 2011
Extracting IMDB data using Python
Sample code for extracting IMDB data using python BeautifulSoup package.
Requirement : To extract all the feature movie names and their ratings from IMDB database for a particular year.
Parameters used in query were identified using IMDB advanced search function. Start, count and year parameters were used in this case for querying.The url is queried for 100 records at a time since more than that is not allowed. After extracting movie names and rating for 100 records, the url is queried for next 100 records and so on.
Two files 'imdb_conf' and 'ratings' are created by the code. 'imdb_conf' file keeps track of the record number last read and 'ratings' file stores the movie name, rating and year.
Web scrapping the IMDB website for required data.
[sourcecode language="python"]
from BeautifulSoup import BeautifulSoup
import os
import re
import urllib2
def get_start_pos_yr(fimdb_config):
#for starting after last fetched record
#last line contains the last record fetched
nlines = fimdb_config.readlines()
startfrom = -1
year = None
if len(nlines) > 1:
list_num =re.search('[^\t]+',nlines[-1])
if list_num:
startfrom = int(list_num.group())+1
year =re.search('\t[0-9]+',nlines[-1]).group().strip()
return startfrom,year
def get_soup(url):
#get soup object for the url
try:
page = urllib2.urlopen(url)
except urllib2.URLError, e:
print 'Failed to fetch ' + url
raise e
try:
soup = BeautifulSoup(page)
except HTMLParser.HTMLParseError, e:
print 'Failed to parse ' + url
raise e
return soup
def get_ntotal(soup):
#fetch total number of records present for particular query
total_count=1
for div in soup.findAll('div', {'id':'left'}):
#print ivd.contents[0]
total_count = re.search('[ ]+[0-9,]+',div.contents[0])
if total_count:
total_count=total_count.group().replace(',','').strip()
#print "total"+total_count
return total_count
def set_rating(soup,fwimdb_config,frating,year,startfrom):
cond = True
count_rec=0
total_res=get_ntotal(soup)
year=str(year)
#total_res=100
while cond:
for tr in soup.findAll('tr', {'class':re.compile('(odd|even)[ a-zA-Z]*')}): #each row
for td in tr.findAll('td', {'class':'title'} ):
for link in td.findAll('a',{'href':re.compile('/title/tt[^/]+/$')}):
movie_name=link.contents[0] #title name
for rating in td.findAll('div',{'class':'rating rating-list'}):
count_rec=count_rec+1
if rating.has_key('title'):
#print "hurray"
rt = re.search('[0-9]+[^(]+',rating['title']) #rating
if rt:
frating.write(movie_name+"\t"+rt.group().strip()+"\t"+year+"\n")
else:
frating.write(movie_name+"\t--\t"+year+"\n")
else:
frating.write(movie_name+"\t--\t"+year+"\n")
#print movie_name+"\t"+rt.group()
fwimdb_config.write(str(count_rec)+"\t"+year+"\n")
if startfrom == 0:
startfrom = 101 #second run
else:
startfrom = startfrom + 100
if startfrom >= int(total_res):
cond=False
fwimdb_config.write("-1"+"\t"+str(int(year)-1)+"\n")
print str(startfrom)+" "+str(total_res)
soup=get_soup("http://www.imdb.com/search/title?languages=en&title_type=feature&count=100&sort=num_votes,desc&start="+str(startfrom)+"&year="+year)
def main():
fwimdb_conf=open("imdb_conf","r+")
frating = open("ratings","a") #ratings
fwimdb_conf.write("LastreadLine\tYear\n")
startfrom,year = get_start_pos_yr(fwimdb_conf)
if startfrom == -1:
startfrom = 0
if year == None:
year="2010"
print startfrom
soup=get_soup("http://www.imdb.com/search/title?languages=en&title_type=feature&sort=num_votes,desc&count=100&start="+str(startfrom)+"&year="+year)
set_rating(soup,fwimdb_conf,frating,year,startfrom)
frating.close()
fwimdb_conf.close()
if __name__ == '__main__':
main()
[/sourcecode]
For demonstration purposes only. If you plan to use IMDB data beyond personal usage, you should contact IMDB Licensing department.
Requirement : To extract all the feature movie names and their ratings from IMDB database for a particular year.
Parameters used in query were identified using IMDB advanced search function. Start, count and year parameters were used in this case for querying.The url is queried for 100 records at a time since more than that is not allowed. After extracting movie names and rating for 100 records, the url is queried for next 100 records and so on.
Two files 'imdb_conf' and 'ratings' are created by the code. 'imdb_conf' file keeps track of the record number last read and 'ratings' file stores the movie name, rating and year.
Web scrapping the IMDB website for required data.
[sourcecode language="python"]
from BeautifulSoup import BeautifulSoup
import os
import re
import urllib2
def get_start_pos_yr(fimdb_config):
#for starting after last fetched record
#last line contains the last record fetched
nlines = fimdb_config.readlines()
startfrom = -1
year = None
if len(nlines) > 1:
list_num =re.search('[^\t]+',nlines[-1])
if list_num:
startfrom = int(list_num.group())+1
year =re.search('\t[0-9]+',nlines[-1]).group().strip()
return startfrom,year
def get_soup(url):
#get soup object for the url
try:
page = urllib2.urlopen(url)
except urllib2.URLError, e:
print 'Failed to fetch ' + url
raise e
try:
soup = BeautifulSoup(page)
except HTMLParser.HTMLParseError, e:
print 'Failed to parse ' + url
raise e
return soup
def get_ntotal(soup):
#fetch total number of records present for particular query
total_count=1
for div in soup.findAll('div', {'id':'left'}):
#print ivd.contents[0]
total_count = re.search('[ ]+[0-9,]+',div.contents[0])
if total_count:
total_count=total_count.group().replace(',','').strip()
#print "total"+total_count
return total_count
def set_rating(soup,fwimdb_config,frating,year,startfrom):
cond = True
count_rec=0
total_res=get_ntotal(soup)
year=str(year)
#total_res=100
while cond:
for tr in soup.findAll('tr', {'class':re.compile('(odd|even)[ a-zA-Z]*')}): #each row
for td in tr.findAll('td', {'class':'title'} ):
for link in td.findAll('a',{'href':re.compile('/title/tt[^/]+/$')}):
movie_name=link.contents[0] #title name
for rating in td.findAll('div',{'class':'rating rating-list'}):
count_rec=count_rec+1
if rating.has_key('title'):
#print "hurray"
rt = re.search('[0-9]+[^(]+',rating['title']) #rating
if rt:
frating.write(movie_name+"\t"+rt.group().strip()+"\t"+year+"\n")
else:
frating.write(movie_name+"\t--\t"+year+"\n")
else:
frating.write(movie_name+"\t--\t"+year+"\n")
#print movie_name+"\t"+rt.group()
fwimdb_config.write(str(count_rec)+"\t"+year+"\n")
if startfrom == 0:
startfrom = 101 #second run
else:
startfrom = startfrom + 100
if startfrom >= int(total_res):
cond=False
fwimdb_config.write("-1"+"\t"+str(int(year)-1)+"\n")
print str(startfrom)+" "+str(total_res)
soup=get_soup("http://www.imdb.com/search/title?languages=en&title_type=feature&count=100&sort=num_votes,desc&start="+str(startfrom)+"&year="+year)
def main():
fwimdb_conf=open("imdb_conf","r+")
frating = open("ratings","a") #ratings
fwimdb_conf.write("LastreadLine\tYear\n")
startfrom,year = get_start_pos_yr(fwimdb_conf)
if startfrom == -1:
startfrom = 0
if year == None:
year="2010"
print startfrom
soup=get_soup("http://www.imdb.com/search/title?languages=en&title_type=feature&sort=num_votes,desc&count=100&start="+str(startfrom)+"&year="+year)
set_rating(soup,fwimdb_conf,frating,year,startfrom)
frating.close()
fwimdb_conf.close()
if __name__ == '__main__':
main()
[/sourcecode]
For demonstration purposes only. If you plan to use IMDB data beyond personal usage, you should contact IMDB Licensing department.
Subscribe to:
Posts (Atom)