Lazzy Scientist: Extracting movie title from torrent file name using Regular Expression

Monday, September 5, 2011

Extracting movie title from torrent file name using Regular Expression

Movie files downloaded from torrent sites has file name which contains format types (like dvdrip, dvdscr, xvid etc), year, comments , user names and of course movie name. We want to extract movie name from this file names.

	Input File names	Output Achieved
	countdown.to.zero.2010.xvid-submerge.avi	countdown to zero
	DrJn.2010.BRRip_mediafiremoviez.com.mkv	drjn
	Nim's.Island[2008]DvDrip-aXXo.avi	nim's island
	Invictus.DVDSCR.xViD-xSCR.CD1.avi	invictus
	Invictus.DVDSCR.xViD-xSCR.CD2.avi	invictus
	20000 Leagues Under The Sea.avi
	Across The Universe.MoZinRaT CD1.avi	across the universe mozinrat
	Adoration 2008 DvdRip ExtraScene RG.avi	adoration
	Amelie(English Dubbed).avi	amelie
	America.2009.STV.DVDRip.XviD-ViSiON.avi	america
	VTS_02_1.avi	vts_02_1
	VTS_02_2.avi	vts_02_2
	Antibodies.2005.GERMAN.DVDRip.XviD.AC3.CD1-AFO.avi	antibodies
	arranged.xvid-reserved.avi	arranged
	badder.santa.dvdrip.xvid-deity.avi	badder santa
	Balls of Fury[2007]DvDrip[Eng]-FXG.avi	balls of fury
	Bruno (2009) DVDRip-MAXSPEED www.torentz.3xforum.ro.avi	bruno
	Defiance DvDSCR[2009] ( 10rating ).avi	defiance
	Down With Love (cute romantic comedy).avi	down with love
	Einstein.And.Eddington.2008.DVDRip.XviD.avi	einstein and eddington
	ENEMY_OF_THE_STATE..DVDrip(vice).avi	enemy_of_the_state

Based on the observation that

Most of the file name contains the format like 'dvdrip', 'xvid', 'brrip','dvdscr' or other words like 'CD1','(<year>)','[<year>]' specified in the name and everything after any of this words doesnot contains any useful data.

Sometimes extra information is added to file name inside bracket like
Defiance DvDSCR[2009] ( 10rating ).avi
Down With Love (cute romantic comedy).avi
so we can also ignore the part including and after brackets as movie names doesn't have brackets in them and have no useful information after it.

'(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{$\[]?[0-9]{4}).*'

This regular expression will find file names where we have dvdrip, brrip, xvid(we can specify any number of values here) or year and finds first of any one of the appearing patterns because we have used lazy parsing here using .*?. We then extract the first back referenced part \1.

Secondly, to remove the part within brackets we use

'(.*?)\(.*$(.*)' regular expression.

Following code snippet gets the movie names(to an extent) from file name.

import re
fr = open('filenameslist.txt', 'r')
fw = open('movienames.txt', 'w')
for line in fr:
    text = line.strip()
text1 = re.search('([^\\\]+)\.(avi|mkv|mpeg|mpg|mov|mp4)$', text)
if text1:
    text = text1.group(1)
text = text.replace('.', ' ').lower()
text2 = re.search('(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{$\[]?[0-9]{4}).*', text)
if text2:
    text = text2.group(1)
text3 = re.search('(.*?)\(.*$(.*)', text)
if text3:
    text = text3.group(1)
# print text
fw.write(text + '\n')
fr.close()
fw.close()

Output can be improved further observing things like we can replace characters like underscore with space, we can check for only four digits where next character is non word character ..

7 comments:

Divyendu SinghDecember 28, 2012 at 3:18 AM
Can you make the same for PHP please. :(
ReplyDelete
Replies
wolffman122June 28, 2013 at 7:04 PM
Tried using but it takes everything out except the title and year so this string 'Balls of Fury[2007]DvDrip[Eng]-FXG' became Balls of Fury[2007]. How can i get it to get rid of the year part.
ReplyDelete
Replies
crazycloudJune 29, 2013 at 1:05 AM
@wolfman122
try using

import re
text ='Balls of Fury[2007]DvDrip[Eng]-FXG'.strip()
text2= re.search('(.*?)(dvdrip|xvid| cd[0-9]|dvdscr|brrip|divx|[\{$\[]?[0-9]{4}).*',text)
if text2:
text =text2.group(1)
text3= re.search('(.*?)\(.*$(.*)',text)
if text3:
text =text3.group(1)
print text

It will work. Actually, code mentioned in blog is expecting file extension also so when you don' mention file extension it fails at first IF clause itself.
ReplyDelete
Replies
MontaroDecember 11, 2014 at 10:42 PM
Thanks for the very helpful module to enhance my Movies Rater https://github.com/montaro/movies-rater
The new version now after your module is way more accurate since almost my application now can get ratings for any movie name with any format and the frequently noise in files names.

Thanks again!
ReplyDelete
Replies
crazycloudDecember 23, 2014 at 7:13 AM
Thanks Montaro! I am Glad to hear that.
ReplyDelete
Replies
خدمات منزليةMay 17, 2020 at 4:29 PM
شركة تسليك مجاري بالاحساء

شركة تسليك مجارى بالقصيم

شركة عزل اسطح بالقصيم

شركة عزل اسطح بابها

شركة عزل اسطح بالاحساء
ReplyDelete
Replies
Geron20425December 16, 2021 at 4:38 AM
I came onto your blog while focusing just slightly submits. Nice strategy for next, I will be bookmarking at once seize your complete rises... stream torrent videos
ReplyDelete
Replies

Add comment

Monday, September 5, 2011

Extracting movie title from torrent file name using Regular Expression

7 comments:

Blog Archive