Input File names | Output Achieved | |
---|---|---|
countdown.to.zero.2010.xvid-submerge.avi | countdown to zero | |
DrJn.2010.BRRip_mediafiremoviez.com.mkv | drjn | |
Nim's.Island[2008]DvDrip-aXXo.avi | nim's island | |
Invictus.DVDSCR.xViD-xSCR.CD1.avi | invictus | |
Invictus.DVDSCR.xViD-xSCR.CD2.avi | invictus | |
20000 Leagues Under The Sea.avi | ||
Across The Universe.MoZinRaT CD1.avi | across the universe mozinrat | |
Adoration 2008 DvdRip ExtraScene RG.avi | adoration | |
Amelie(English Dubbed).avi | amelie | |
America.2009.STV.DVDRip.XviD-ViSiON.avi | america | |
VTS_02_1.avi | vts_02_1 | |
VTS_02_2.avi | vts_02_2 | |
Antibodies.2005.GERMAN.DVDRip.XviD.AC3.CD1-AFO.avi | antibodies | |
arranged.xvid-reserved.avi | arranged | |
badder.santa.dvdrip.xvid-deity.avi | badder santa | |
Balls of Fury[2007]DvDrip[Eng]-FXG.avi | balls of fury | |
Bruno (2009) DVDRip-MAXSPEED www.torentz.3xforum.ro.avi | bruno | |
Defiance DvDSCR[2009] ( 10rating ).avi | defiance | |
Down With Love (cute romantic comedy).avi | down with love | |
Einstein.And.Eddington.2008.DVDRip.XviD.avi | einstein and eddington | |
ENEMY_OF_THE_STATE..DVDrip(vice).avi | enemy_of_the_state |
Based on the observation that
- Most of the file name contains the format like 'dvdrip', 'xvid', 'brrip','dvdscr' or other words like 'CD1','(<year>)','[<year>]' specified in the name and everything after any of this words doesnot contains any useful data.
- Sometimes extra information is added to file name inside bracket like
Defiance DvDSCR[2009] ( 10rating ).avi
Down With Love (cute romantic comedy).avi
so we can also ignore the part including and after brackets as movie names doesn't have brackets in them and have no useful information after it.
'(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*'
This regular expression will find file names where we have dvdrip, brrip, xvid(we can specify any number of values here) or year and finds first of any one of the appearing patterns because we have used lazy parsing here using .*?. We then extract the first back referenced part \1.
Secondly, to remove the part within brackets we use
'(.*?)\(.*\)(.*)' regular expression.
Following code snippet gets the movie names(to an extent) from file name.
import re
fr = open('filenameslist.txt', 'r')
fw = open('movienames.txt', 'w')
for line in fr:
text = line.strip()
text1 = re.search('([^\\\]+)\.(avi|mkv|mpeg|mpg|mov|mp4)$', text)
if text1:
text = text1.group(1)
text = text.replace('.', ' ').lower()
text2 = re.search('(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*', text)
if text2:
text = text2.group(1)
text3 = re.search('(.*?)\(.*\)(.*)', text)
if text3:
text = text3.group(1)
# print text
fw.write(text + '\n')
fr.close()
fw.close()
Output can be improved further observing things like we can replace characters like underscore with space, we can check for only four digits where next character is non word character ..
Can you make the same for PHP please. :(
ReplyDeleteTried using but it takes everything out except the title and year so this string 'Balls of Fury[2007]DvDrip[Eng]-FXG' became Balls of Fury[2007]. How can i get it to get rid of the year part.
ReplyDelete@wolfman122
ReplyDeletetry using
import re
text ='Balls of Fury[2007]DvDrip[Eng]-FXG'.strip()
text2= re.search('(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*',text)
if text2:
text =text2.group(1)
text3= re.search('(.*?)\(.*\)(.*)',text)
if text3:
text =text3.group(1)
print text
It will work. Actually, code mentioned in blog is expecting file extension also so when you don' mention file extension it fails at first IF clause itself.
Thanks for the very helpful module to enhance my Movies Rater https://github.com/montaro/movies-rater
ReplyDeleteThe new version now after your module is way more accurate since almost my application now can get ratings for any movie name with any format and the frequently noise in files names.
Thanks again!
Thanks Montaro! I am Glad to hear that.
ReplyDeleteشركة تسليك مجاري بالاحساء
ReplyDeleteشركة تسليك مجارى بالقصيم
شركة عزل اسطح بالقصيم
شركة عزل اسطح بابها
شركة عزل اسطح بالاحساء
I came onto your blog while focusing just slightly submits. Nice strategy for next, I will be bookmarking at once seize your complete rises... stream torrent videos
ReplyDelete