Bioinformatics, My Technical Learnings

Monday, June 18, 2012

Automata: A coursera.org course on the theory of computation

Just recently I finished my 4th Coursera online class, titled Automata. Taught by Jeff Ullman, this course covered the basic theory of computation and automata theory and ran for 6 weeks. Ullman used his lecture notes from the equivalent course taught at Stanford, and 6 weeks felt like a very short period of time to fully internalize the concepts behind automata. Still, I think this is the most informative and rigorous online course I have taken to date.

The course consisted of video lectures, homework quizzes, and a single final exam. Their were also some extra credit problems and programming assignments, which were of lower priority for me. The per week video material was about 2 hrs straight through, and would usually take me around 10-15 hours taking notes and pausing to reflect on the material. Unlike other online courses, the video material was presented in larger chunks, usually 20 to 30 minutes long, and I have to say I like this format better. Ullman was very active on the discussion forums, leaving no question unanswered, and should be recognized for his outstanding dedication to helping students learn the material. If he teaches another course I will consider taking it for this reason alone.

The content covered included regular expressions, deterministic finite automata, non-deterministic finite automata, context free grammars, push-down automata, Turing Machines, and complexity classes P and NP. A major theme was properties and equivalences among and between these types of languages. Perhaps the most fascinating material was that of Turing Machines and computability, specifically how we prove the undecidability of the halting problem. Also, the complexity classes P and NP were well explained, and a few reductions of common problems like knapsack were given. If I spend more time learning about this topic, I will surely study Arora and Barak's Computational Complexity: A Modern Approach, which would not be possible without first learning the topics presented in Ullman's course.

Many proofs along the way where given, and I have developed a much better sense at the theoretical limits of computation along with sharpening my math and logic skills. Understanding whether a given problem is in P, is essential in knowing if an optimal algorithm can be written in polynomial time without making many assumptions about the nature of the problem instance. Any algorithmic development I undertake will have to take this into consideration. Further, I have learned the tools to understand many topics in computer science at a deeper level than possible before the course, and am looking forward to learning about topics built upon automata thoery, like approximation algorithms and the limits of parallel computation.

Wednesday, April 25, 2012

Design and Analysis of Algorithms I: a coursera.org course with Tim Roughgarden

Looking for an algorithms for a little over a year, I was very thrilled to find out coursera.org was going to offer an algorithms course, albeit part I of II. My objective in taking the class was to gain the skills needed for the mathematical analysis of algorithms, exposure to complex problems and computational solutions, and increase my programming skill and capability. Though not a master in any of these areas, I feel the course has definitely improved my skills, and am eager to sign up for Design and Analysis of Algorithms II.

The course consisted of 5 programming assignments, 5 problems sets, a final, and some optional theory problems. The material was presented in 5 week long blocks of about 2 hours worth of video material. The programming assignments ranged for pretty difficult(calculate strongly connected components on a graph of over 800k nodes) to relatively easy, (find if two elements of an array sum to a given number). The course discussion boards and theory problems really helped solidify my understand of the material. Additional "Theory" problems were also included, and were of the same difficulty as test problems given to Stanford students taking the course. These additional problems required much thought, and solidified the concepts presented in lecture.

The content of the class consisted of basic mathematical analysis methods(big-Oh notation), the divide and conquer paradigm, master method, randomized algorithms, a handful of data structures, and graph algorithms. Analysis of each presented algorithm that included a proof of its correctness and runtime complexity. Each algorithm presented was explained in a very intuitive way, much to the credit of our instructor Tim Roughgarden.

For this course, I purchased Cormen's algorithm text book, which turned out to be a very good investment. Compared to other textbooks, like Klienberg and Tardos, Cormen was more in depth, with greatly detailed analysis of algorithms and the mathematical rigor I need to convince myself of an argument. If there is one take home skill I got from the course, its ability to sit down with Cormen, and internalize a concept. A couple of the theory problems were drawn from Cormen, and I am glad to have the some help learning how to solve problems that once seem intimidated.

In conclusion, The course did an excellent job explaining the concepts required for algorithmic analysis, and given me the skills to continue my learning by using sources such as Cormen's book. Often used in proofs in the class, and in Cormen for that matter, are generating functions. This is a topic I need to look further in to if I want to generate proofs for my own nascent algorithms. I am eagerly awaiting the next installment of the class, and look forward to taking Coursera's Automata with Jeffrey Ullman in the meantime.

Thursday, March 8, 2012

Learning Clojure and the art of functional programming...

This post describes an overview of my experience with the Clojure programming language, why I learned it, how I learned it, what it's like, and where am I going with it.

Over the past few months, I have been programming a bit in Clojure. A direct descendant of McCarthy's eternal Lisp, Clojure is a functional language that runs on the JVM, fully interlopes with Java, and has excellent concurrency support with its Software Transactional Memory. Clojure trades all the boilerplate code of brackets and semicolons common to a language like C or Java, for function calls, parens, and macros. The level of expressiveness in code is truly helpful when hashing out a complex idea or algorithm. For example, I implemented a Burrows-Wheeler transformation https://gist.github.com/2004382 in under 40 lines.

I became comfortable with Clojure by taking a problem solving approach to learn the language, with the help of the website 4Clojure.org. 4Clojure presents a programming challenge, for example implementing library functions like 'map', that requires the user to write a function to satisfies a set of unit tests. By working the problems in increasing order of difficulty, the concepts of functional programming, IMO the hardest part to learning lisp, came naturally. For a period I became active on freenode's #clojure IRC channel, where I meet some fellows who I now share Clojure advice and code with via email chain and google group, http://groups.google.com/group/clojure-hackers?hl=en As for Clojure literature, I find The Joy of Clojure most helpful for technical questions and best practices. For theory, Functional Programming Practice and Theory by MacLennan really helped me grasp the concepts of on a more abstract, mathematical rigorous level. If you want to understand the logic and correctness of FP, this book is it! Finally, Pearls of Functional Algorithm Design contains a wealth of algorithms collected from mostly academic journals and other sources implemented in Haskell and easily converted to Clojure.

Clojure, as a young language, (2007) lacks the development tools and packages of a more mature production language although using Java classes and libraries solves the problem of a small code base. The source code I have read is well documented(clojuredocs.org), and between IRC and stackoverflow.com there are enough participants to field most questions. As for an editor, I have found VimClojure very suitable for syntax highlighting, making it easy to write functions and sample data to a file that is then loaded into an adjacent REPL and evaluated. For the community, I imagine Clojure now to be a lot like Ruby before Rails came out, the people involved are highly motivated, and language shows great promise. Fortunately for me, Clojure already does everything I need, even without a killer app.

In conclusion, Clojure has opened my eyes to a completely new way of thinking about a program: as the art of manipulating program state though time, transforming the logical identity of the input into that of the desired output (more here, http://clojure.org/state ) using the most powerful tool and rudimentary tool, the function. The practice of learning a new language in a different paradigm has helped me challenge the conventions and assumptions of procedural coding, opening the door for me to learn R programming on a much deeper level. When it comes down to choosing the next language for a future project, Clojure's ability to bring functional programming to the JVM/Java is very enticing, offering a very easy way to express my ideas as programs in a way no other language can.

Tuesday, January 24, 2012

The Nuts and Bolts of Proofs by Antonella Cupillari

In my quest to better understand bioinformatics, I have found myself on the topic of algorithms. After thumbing through Cormen's classic text, and a couple other resources online, I decided that a better grasp of mathematical proofs would give me the tools needed to tackle algorithms. I picked up The Nuts and Bolts of Proofs by Antonella Cupillari, and worked my way through in about a month. The topics covered were: Direct proof, proof by contrapositive, equivalence theorems, use of counterexamples, mathematical induction, uniqueness/existence theorems, and equality of sets/numbers.
Cupillari does an excellent job explaining the how to construct proofs, and leaves the reader with plenty of exercises after each chapter, as well as an hefty section of exercises without solutions and a collection proofs. I felt he did a good job covering the concepts of proofs, using no math beyond high school algebra to place the emphasis on constructing arguments using mathematical logic. The allowed me to start thinking in proofs sooner, without being bogged down by mathematical theory. Having completed the chapter exercises, and with no other training on mathematical proofs, I feel very comfortable constructing proofs which make up a large portion of the exercises in many algorithm text books. If you are interested in learning proofs and aren't afraid to put a little work in, I would definitely recommend this book!

Stanford's Machine Learning Course, ml-class.org

After two months of octave and online videos, I finished Stanford's online course on machine learning. I am very pleased with the course, and wish it could continue indefinitely. The course consisted of online videos, content quizzes, and finally programming assignments in octave. Topics covered include linear regression, neural networks, support vector machines, anomaly detection, and were presented in a way consistent with practical applications of these algorithms. The programming exercises were taught using Octave, GNU's version of Matlab, and stressed vectorization for efficient computations. This enabling the students to implement and train a computational expensive neural network, and train it with backpropagation for optical character recognition.
I learned a lot about linear algebra in the course, and developed an appreciation for its broad uses in numerical problems. Learning more about linear algebra is a priority, and I plan to take a course on it during graduate school or sooner but for right now a handle on matrix multiplication and 3d coordinates will have to do. Beyond basic linear algebra, the course was not very math heavy, requiring no proofs or formalism. That being said, some of the vectorization required for the homework assignments was a little tricky, but nothing beyond a little pen and paper work to figure out.
If you are looking to learn more about machine learning in general, or would like to learn more about a specific algorithm covered in the course, I would recommend going to ml-class.org and watching some of the videos. I am glad I took this course, and look forward to using the knowledge I gained in my research.

Monday, December 12, 2011

Introduction To Databases: a db-class.org course review.

I just finished the final for Stanford's free online database course, and am satisfied with my experience. First, the course was open to everybody and ran from Oct. 14 to Dec 12. It consisted of video lectures, quizzes, two online tests(final and a midterm) and exercises, which were executed in an online sandbox. The topics covered could be broken down into two groups, use of existing databases, and database design. With my background, I was mostly interested in use of existing databases, so I am going to talk about that.

SQL was covered very well, greatly expanding my knowledge and teaching me some relational algebra theory to understand its mathematical basis. The online exercises in SQL were most helpful, and I felt pretty challenged by a few of the exercises. They course admin's at Stanford were even able to implement a DSL for relational algebra exercises, which I thought was pretty cool. Going forward, I feel pretty confident using SQL to extract information from a database, and the relational algebra way of thinking has served me well manipulating data.frames in R.

XML was also covered, and was a major reason for me taking the course. In my research I deal with many different types of file formats, and would ideally like to see them go to XML for the sake of consistency. We covered XPath, XQuery and I have already utilized those in my work, while XSLT was a little bit more confusing. My go to XPath application is now Kernow for sandboxing, and perl's XML::Simple and XML::Xpath for scripting. Unfortunately, some of the XML files I tried to parse were very large( > 500MB) and I ditched these tools to use the tried and true method of regular expression parsing in perl. I suppose some more looking would help me find an solution for large XML files, and am currently open to suggestions.

I was interested in taking the database course to increase my computational skills, particularly SQL and XML, and through the rest of the course feel like I have gained a broad, although relatively shallow, understanding of different aspect of database design, use, and theory. I look forward to implementing an SQLite database on my own data someday, and encourage anyone interested in databases to enroll in this course if it is offered again.

Tuesday, November 29, 2011

perl Programming: a script to run background Java programs

One task I often find myself doing is running a java program on each member of a dataset. If the java program is configured to run in the background, you will initiated enough java calls to crash your computer. I hacked together a solution using perl that waits for each call to java to finish before initiating the next.

 
#!/usr/bin/perl
use strict;
my (@fileA,@out);
#put java command into string
my $command =  "./java.sh";
#simulate multipul calls with foreach loop
foreach my $wait (1..20){#represents the iterations of inputs run via a java program
  #open a handle for the command, piping into $java
  open my $java, '-|',  $command or die "Could not run java ... - $!";
  #while handle is still be written to, print .
  print "running command ${wait}";

  while (<$java>) {
       print ".";
  #push the output of the command into file array
  @fileA = <$java>;
   }
   push(@out, @fileA);
  #close the pipe when the command is done
  close($java);
}

#print to some output file
my $file = "out.txt";
#open it, print to it the @out array, close it, print \n 
open my $f, '>', $file;
print $f @out;
close $f;
print "\n";

#java.sh
#############################################
#!/bin/bash
#java -classpath /home/user/pathto/jarFile.jar ducks.main &
############################################