if its too loud, turn it down

Thursday, June 25, 2009

Record WNYC's "The Next Big Thing" streams to MP3

Remember that show "The Next Big Thing" in WNYC? Good show. Ran from 1999 to early 2006. I miss it. Fortunately, they have most of the shows archived on the WNYC site. Unfortunately, they are in realaudio. When I tried to listn to the realaudio stream on my mac with the crappy realaudio player (are they even still in business?? If so, why?), I got a random collection of clicks and blips.

But I soon found out that mplayer will play them. So I wrote up a little script that will rip the realaudio stream to MP3, which is better anyway because then I can listen on my iPod. The quality isn't great - I think it's a 32kbps stream, but that's roughly equivalent to AM radio quality. It's listenable. It will run on any unix-based system (like a mac) with the following pre-requisites:
  • mplayer
  • wget
  • lame
  • sox
  • Perl 5.8.8+
  • Perl Module: MP3::Tag
Note: you may need mplayer configured with realaudio codecs as per this article, but I tried it on a linux system that didn't have them, so it should work without.

One caveat is that although mplayer has the capability to dump realaudio streams very quickly (by adding a large value, like 5000000 to the -bandwidth switch), I found that for some reason alot of audio in these streams was skipped. Not good. So I left out the bandwidth switch and let it dump the stream in real time. Takes alot longer, but you get all the audio.

Click "expand source" to view the code below. I call it "nextbigthing.pl":

# 2009-06-21
# created to take a URL of a WNYC's "The Next Big Thing" show archives as input (first argument on cmd line)
# ex: $ ./nextbigthing.pl http://www.wnyc.org/shows/tnbt/episodes/2002/05
# and create tagged MP3s of each show on the page as output
# * mplayer with links to realaudio codecs (see here: http://www.macosxhints.com/article.php?story=20050130184054216)
# * wget
# * lame
# * sox

use strict;
use POSIX qw(strftime);
use MP3::Tag;

# config these
my $wget = "/usr/bin/wget";
my $mplayer = "/usr/bin/mplayer";
my $lame = "/usr/bin/lame";
my $sox = "/usr/bin/sox";
my $tmpfolder = "/home/user/tmp/tnbt/"; # create this if it doesn't exist
my $outfolder = "/home/user/Music/the_next_big_thing/"; # create this if it doesn't exist

my @rmfiles = ();

# get the url off the command line
if (!$ARGV[0]) {
 print "no URL found.  put the url in single quotes as the first argument after calling this program on the command line\n";

# wget the page
my $outfile = $tmpfolder . "tmpout.txt";
my $wout = `$wget -U 'Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20041001' --timeout=30 -t 2 -q --output-document=${outfile} $ARGV[0]`;
if (not(-e $outfile)) {
 myExit("nothing found. check url and try again",\@rmfiles);
} else {

# put file into array
 my @lines = <FILE>;

# search the array for keywords and place our show info into a hash
my %SHOWS = ();
my $counter=0;
my ($showdate,$episode_name,$episode_date,$episode_description);
foreach my $line (@lines) {
 # there's a really specific format we're looking for here.  if they change their template, this won't work at all
 # we're looking for show info here
 if ($line =~ /episodenamesmall/) {
  $lines[$counter] =~ /^.*?episodes\/(\d\d\d\d)\/(\d\d)\/(\d\d)">$/;
  $showdate = "${1}-${2}-${3}";
  $lines[$counter+1] =~ /^(.*?)</;
  $episode_name = $1;
  $lines[$counter+2] =~ /^<.*?>(.*?)<.*?>$/;
  $episode_date = $1;
  $lines[$counter+3] =~ /^<p>(.*?)<\/p>$/;
  $episode_description = $1;
  $SHOWS{$showdate}{'showdate'} = $showdate;
  $SHOWS{$showdate}{'episode_name'} = $episode_name;
  $SHOWS{$showdate}{'episode_date'} = $episode_date;
  $SHOWS{$showdate}{'episode_desc'} = $episode_description;
 } # end search for 'episodenamesmall'

 # we're looking for the url to the realmedia here
 if ($line =~ /<a class="listen"/) {
  my $rafiles;
  my @ramfiles = ();
  $lines[$counter+2] =~ /^\s+href="\/stream\/ram.py\?(.*?)" class.*?$/;
  $rafiles = $1;
  $rafiles =~ s/file[\d]{0,2}=//g;
  my @tmpfiles = split(/&/,$rafiles);
  foreach my $tmpfile (@tmpfiles) {
   my $fullurl = 'rtsp://raudio.wnyc.org/' . $tmpfile;
  my $print_ra = join("^",@ramfiles);
  @{ $SHOWS{$showdate}{'files'} } = @ramfiles;

 } # end search for 'listen'
} # end foreach loop thru html file

# loop thru the hash, grab each show, and encode
for my $show ( sort keys %SHOWS ) {
 my @wavfiles = ();
 print "Episode $SHOWS{$show}{'episode_name'} - $SHOWS{$show}{'episode_date'}\n";
 # our filenames
 my $wavfilename = ${tmpfolder} . replaceSpace(stripChars($SHOWS{$show}{'episode_name'})) . '_' . $SHOWS{$show}{'showdate'} . '.wav';
 my $mp3 = $wavfilename;
 $mp3 =~ s/${tmpfolder}/${outfolder}/;
 $mp3 =~ s/\.wav/\.mp3/;
 # if an mp3 already exists in the dest. dir, skip
 if (-e $mp3) {
  print "\nFilename $mp3 already exists, skipping...\n\n";
 foreach my $file (@{ $SHOWS{$show}{'files'} }) {
  #print "File: $file\n";
  $file =~ /^rtsp:\/\/raudio.wnyc.org\/nbt\/(.*?)$/;
  my $ra_name = $1;
  # dump the raw .ra file
  my $raw = "${tmpfolder}${ra_name}";
  # removed the bandwidth switch because although it speeds up the dumping, it seems to skip alot of audio.
  # so as of now it's just dumping in real-time
  #my $get_cmd = "$mplayer -bandwidth 5000000 -noframedrop -dumpfile $raw -dumpstream '$file'";
  my $get_cmd = "$mplayer -noframedrop -dumpfile $raw -dumpstream '$file'";
  print "MPLAYER DUMP COMMAND: $get_cmd\n";
  my $get_out = `$get_cmd`;
  # convert files to wav
  my $filename = "${tmpfolder}${ra_name}.wav";
  my $mplayer_cmd = "$mplayer $raw -vc dummy -vo null -af volume=0,channels=2 -ao pcm:waveheader:file=$filename";
  print "MPLAYER WAVE COMMAND: $mplayer_cmd\n";
  my $mp_out = `$mplayer_cmd`;
  if (-e $filename) {
  } else {
   myExit("no wavfile was created from $mplayer_cmd",\@rmfiles);

 } # end foreach thru wavfiles
 # cat wavfiles together with sox
 my $wavfilejoin = join(' ',@wavfiles);
 my $soxcmd = "(for wavfile in ${tmpfolder}*.wav; do $sox \"\$wavfile\" -t .raw -r 44100 -sw -c 2 -; done) | $sox -t .raw -r 44100 -sw -c 2 - -t .wav $wavfilename";
 print "SOX CONCAT COMMAND: $soxcmd\n";
 my $soxout = `$soxcmd`;

 # encode the wav file
 if (-e $wavfilename) {
  my $title = "The Next Big Thing: \"" . stripChars($SHOWS{$show}{'episode_name'}) . "\" (" . $SHOWS{$show}{'episode_date'} . ")";
  my $author = "Dean Olsher";
  my $composer = "WNYC";
  my $album = "The Next Big Thing";
  $SHOWS{$show}{'showdate'} =~ /^(\d\d\d\d)-(\d\d)-(\d\d)$/;
  my $year = $1;
  my $info_url = $ARGV[0];
  my $genre = "Talk";
  # caused some problems adding the ID3v2 stuff on cmd line so we'll just have MP3::Tag do it
  #my $id3 = " --add-id3v2 --ignore-tag-errors --tt \"$title\" --ta \"$author $composer\" --tl \"$album\" --ty \"$year\" --tc \"$info_url\" --tg \"$genre\"";
  #my $encode = "$lame -V0 -h -b 128 --quiet --vbr-new$id3 $wavfilename $mp3";
  my $encode = "$lame -V0 -h -b 128 --quiet --vbr-new $wavfilename $mp3";
  print "LAME ENCODE COMMAND: $encode\n";
  my $enc_out = `$encode`;
  # add proper ID3 tags and clean up
  if($mp3) {
   my $comments = stripChars($SHOWS{$show}{'episode_desc'});
   myExit("SUCCESS! Encoding complete for $title",\@rmfiles,1);
  } else {
   myExit("the mp3 doesn't exist so I can't tag it",\@rmfiles);
  } # end if for mp3 exists
 } else {
  myExit("our resampled wav file doesn't exist so I can't encode an mp3",\@rmfiles);
 } # end if for wav exists
 print "\n\n";
} # end foreach thru shows

print "Done\n";

##################### SUBS ###########################

sub myExit {
 my ($message,$rmfiles_ref,$donotexit) = @_;
 my @rms = @$rmfiles_ref;
 foreach my $remove (@rms) {
  if (-e $remove) {
   print "REMOVING $remove\n";
 print "$message\n";
 if (!$donotexit) {
} # end sub myExit

sub stripChars {

 my($text) = @_;
 $text =~ s/\n/ /g; # strip carraige returns
 $text =~ s/\t/ /g; # strip tabs
 $text =~ s/\a/ /g; # strip carraige returns
 $text =~ s/"/'/g; # strip quotes and replace with single quotes
 $text =~ s/\s+/ /g; # strip repeating spaces and replace with one
 return ($text);

} # end sub stripchars

sub replaceSpace {

 my($text) = shift;
 $text =~ s/([^\w+\s+])//g;
 $text =~ s/^\s+//;
 $text =~ s/\s+$//;
 $text =~ s/([\s+])/_/g;
 return ($text);

} # end sub replacespace

sub id3v2tag {
 # adds id3v2 tag to mp3s

 my ($file,$title,$year,$author,$album,$info_url,$comments,$genre,$encoder) = @_;
 my $mp3 = MP3::Tag->new($file);
 $mp3->{ID3v2}->add_frame("TALB", "$album");
 $mp3->{ID3v2}->add_frame("TIT2", "$title");
 $mp3->{ID3v2}->add_frame("TPE1", "$author");
 $mp3->{ID3v2}->add_frame("TCON", "$genre");
 $mp3->{ID3v2}->add_frame("TSSE", "$encoder");
 $mp3->{ID3v2}->add_frame("TYER", "$year");
 $mp3->{ID3v2}->add_frame("COMM", "ENG", "", "$comments $info_url");
 $mp3->{ID3v2}->add_frame("TIT3", "$comments $info_url");
 $mp3->{ID3v2}->add_frame("TRSN", "$comments $info_url");
 $mp3->{ID3v2}->add_frame("TXXX", "$comments $info_url");
 $mp3->{ID3v2}->add_frame("WORS", "$comments $info_url");
 $mp3->{ID3v2}->add_frame("WXXX", "$comments $info_url");

} # end sub id3v2tag

The neat thing about this is it will retrieve the show info from the archive web page and add it to the ID3 info, so the MP3s are named and dated properly. It's called with the URL of each archive page like this:
./nextbigthing.pl http://www.wnyc.org/shows/tnbt/episodes/2002/05
I created a wrapper script that sends a whole bunch of the URLs to the script like this:

/home/user/bin/nextbigthing.pl http://www.wnyc.org/shows/tnbt/episodes/2002/06;
/home/user/bin/nextbigthing.pl http://www.wnyc.org/shows/tnbt/episodes/2002/07;
/home/user/bin/nextbigthing.pl http://www.wnyc.org/shows/tnbt/episodes/2002/08;
/home/user/bin/nextbigthing.pl http://www.wnyc.org/shows/tnbt/episodes/2002/09;
/home/user/bin/nextbigthing.pl http://www.wnyc.org/shows/tnbt/episodes/2002/10;
/home/user/bin/nextbigthing.pl http://www.wnyc.org/shows/tnbt/episodes/2002/11;
/home/user/bin/nextbigthing.pl http://www.wnyc.org/shows/tnbt/episodes/2002/12;
/home/user/bin/nextbigthing.pl http://www.wnyc.org/shows/tnbt/episodes/2003/01;
/home/user/bin/nextbigthing.pl http://www.wnyc.org/shows/tnbt/episodes/2003/02;
Then just let it run until all the streams are recorded!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.