Monday, September 5, 2011

Extracting movie title from torrent file name using Regular Expression

Movie files downloaded from torrent sites has file name which contains format types (like dvdrip, dvdscr, xvid etc), year, comments , user names and of course movie name. We want to extract movie name from this file names.



Input File namesOutput Achieved

countdown.to.zero.2010.xvid-submerge.avicountdown to zero

DrJn.2010.BRRip_mediafiremoviez.com.mkvdrjn

Nim's.Island[2008]DvDrip-aXXo.avinim's island

Invictus.DVDSCR.xViD-xSCR.CD1.aviinvictus

Invictus.DVDSCR.xViD-xSCR.CD2.aviinvictus

20000 Leagues Under The Sea.avi

Across The Universe.MoZinRaT CD1.aviacross the universe mozinrat

Adoration 2008 DvdRip ExtraScene RG.aviadoration

Amelie(English Dubbed).aviamelie

America.2009.STV.DVDRip.XviD-ViSiON.aviamerica

VTS_02_1.avivts_02_1

VTS_02_2.avivts_02_2

Antibodies.2005.GERMAN.DVDRip.XviD.AC3.CD1-AFO.aviantibodies

arranged.xvid-reserved.aviarranged

badder.santa.dvdrip.xvid-deity.avibadder santa

Balls of Fury[2007]DvDrip[Eng]-FXG.aviballs of fury

Bruno (2009) DVDRip-MAXSPEED www.torentz.3xforum.ro.avibruno

Defiance DvDSCR[2009] ( 10rating ).avidefiance

Down With Love (cute romantic comedy).avidown with love

Einstein.And.Eddington.2008.DVDRip.XviD.avieinstein and eddington

ENEMY_OF_THE_STATE..DVDrip(vice).avienemy_of_the_state

Based on the observation that

  • Most of the file name contains the format like 'dvdrip', 'xvid', 'brrip','dvdscr' or other words like 'CD1','(<year>)','[<year>]'  specified in the name and everything after any of this  words doesnot contains any useful data.


  • Sometimes extra information is added to file name inside bracket  like
    Defiance DvDSCR[2009] ( 10rating ).avi
    Down With Love (cute romantic comedy).avi
    so we can also ignore the part including and after brackets as movie names doesn't have brackets in them and have no useful information after it.

'(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*'

This regular expression will find file names where we have dvdrip, brrip, xvid(we can specify any number of values here) or  year and finds first of any one of the appearing patterns because we have used lazy parsing here using .*?.  We then extract the first back referenced part \1.

Secondly, to remove the part within brackets we use

'(.*?)\(.*\)(.*)' regular expression.

Following code snippet gets the movie names(to an extent) from file name.





import re
fr = open('filenameslist.txt', 'r')
fw = open('movienames.txt', 'w')
for line in fr:
    text = line.strip()
text1 = re.search('([^\\\]+)\.(avi|mkv|mpeg|mpg|mov|mp4)$', text)
if text1:
    text = text1.group(1)
text = text.replace('.', ' ').lower()
text2 = re.search('(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*', text)
if text2:
    text = text2.group(1)
text3 = re.search('(.*?)\(.*\)(.*)', text)
if text3:
    text = text3.group(1)
# print text
fw.write(text + '\n')
fr.close()
fw.close()




Output can be improved further observing things like we can replace characters like underscore with space, we can check for only four digits where next character is non word character ..

5 comments:

  1. Can you make the same for PHP please. :(

    ReplyDelete
  2. Tried using but it takes everything out except the title and year so this string 'Balls of Fury[2007]DvDrip[Eng]-FXG' became Balls of Fury[2007]. How can i get it to get rid of the year part.

    ReplyDelete
  3. @wolfman122
    try using

    import re
    text ='Balls of Fury[2007]DvDrip[Eng]-FXG'.strip()
    text2= re.search('(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{\(\[]?[0-9]{4}).*',text)
    if text2:
    text =text2.group(1)
    text3= re.search('(.*?)\(.*\)(.*)',text)
    if text3:
    text =text3.group(1)
    print text

    It will work. Actually, code mentioned in blog is expecting file extension also so when you don' mention file extension it fails at first IF clause itself.

    ReplyDelete
  4. Thanks for the very helpful module to enhance my Movies Rater https://github.com/montaro/movies-rater
    The new version now after your module is way more accurate since almost my application now can get ratings for any movie name with any format and the frequently noise in files names.

    Thanks again!

    ReplyDelete
  5. Thanks Montaro! I am Glad to hear that.

    ReplyDelete