How To Write A Spelling Corrector In Ruby

November 5, 2009

If you haven't seen it before, Peter Norvig has a spelling corrector that is written in just 21 lines of Python code (not counting blank lines and the import). He also lists a few other implementations in other languages, include one in Ruby. The Ruby one was listed as 34 lines. I was surprised that it was that many lines more in Ruby, so I wanted to give it a try. I didn't look at the Ruby solution first and here's what I came up with:

require 'set'

def words(text)
  text.downcase.scan /[a-z]+/ 
end

def train(features)
  features.inject(Hash.new(1)){|model, f| model[f] += 1; model }
end

NWORDS = train(words(File.open('big.txt').read))

ALPHABET = 'abcdefghijklmnopqrstuvwxyz'.split //

def edits1(word)
  s = (0..word.size).map{|i| [word[0,i], word[i,word.size]]}
  deletes    = s.map{|a,b| !b.empty? ? a + b[1..-1] : nil}.compact
  transposes = s.map{|a,b| b.size > 1 ? a + b[1].chr + b[0].chr + b[2..-1] : nil}.compact
  replaces   = s.map{|a,b| !b.empty? ? ALPHABET.map{|c| a + c + b[1..-1]} : nil}.flatten.compact
  inserts    = s.map{|a,b| ALPHABET.map{|c| a + c + b}}.flatten
  Set.new(deletes + transposes + replaces + inserts)
end

def known_edits2(word)
  s = edits1(word).map do |e1| 
    edits1(e1).map{|e2| NWORDS.include?(e2) ? e2 : nil}.compact
  end.flatten
  s.empty? ? nil : Set.new(s)
end

def known(words)
  s = Set.new(words.find{|w| NWORDS.include?(w)})
  s.empty? ? nil : s
end

def correct(word)
  candidates = known([word]) || known(edits1(word)) || known_edits2(word) || [word]
  candidates.max{|a,b| NWORDS[a] <=> NWORDS[b] }
end

To run this, download the data file, put the code in a file called spelling.rb, then start up IRB:

$ wget http://norvig.com/big.txt
$ irb -r spelling -f --simple-prompt
>> correct "speling"
=> "spelling"
>> correct "korrecter"
=> "corrected"

This one weighs in at 30 lines. I tried to do this as close to the Python implementation as possible. I also tried to use idiomatic Ruby. You could shave the number of lines down below 21, but it wouldn't meet any reasonable Ruby style guidelines. I'm still probably cheating a little as a few of those lines are approaching 100 characters, but it's at least reasonable, in my opinion. Here are some things I noticed that caused the Ruby version to be longer or less clear:

  1. end vs. significant indentation

    6 of the lines are just an end statement. Python uses indentation to end methods, so the end statements aren't needed in Python. This adds more lines, but has trade offs. I actually really like the idea of significant indentation, it's one of the reasons that I'm such a big fan of Haml. But it falls down in certain places. For example, Ruby has Embedded Ruby, which looks similar to JSP or PHP, but it's actually trivial to implement the basic cases. It truly is embedded Ruby, because the code between the <% %> and <%= %> tags is just ruby code. You commonly see things like this:

    <% if logged_in? %>
      Welcome, <%= current_user.login %>
    <% else %>
      Howdy Stranger!
    <% end >
    

    You can't do this in Python because of the lack of the end statement. This is why I'm surprised Haml was invented in Ruby and not Python. In Python, it fits with the language and is actually necessary, whereas in Ruby, significant indentation isn't part of the language and ERB is actually pretty good. There is a Python port of HAML, but I'm not sure how well that works or how widely it is used in the Python community.

  2. List Comprehensions vs. Blocks

    Python and Ruby, compared to C, Java, etc., have very powerful, concise syntax for iterating through and transforming collections, but they are very different. Python uses list comprehensions and Ruby uses blocks. As you can see above, there are certain cases where list comprehensions are very compact. List comprehensions can have a guard, which is a little cleaner than the Ruby equivalent where you have to return nil if the guard doesn't match and then compact that result. Also, when you want to iterate over two lists, you can do so with mutliple for statements, whereas in Ruby you do nested blocks and then flatten the result.

  3. Collection Slicing

    Python has a little cleaner, more consistent syntax for breaking up collections into sub collections. It also treats strings as a collection of characters more consistently.

  4. Truthiness

    In Python, many things are considered false. I'm not sure what the entire list is, but it seems to include empty collections (and therefore empty strings) as empty. Since Ruby only treats nil and false as false, we have to return nil instead of an empty set from the known and known_edits2 methods, so that the series of statements in the first line of the correct method will work correctly.

In summary, in this case, the Python code is shorter and clearer, but it's pretty close. I'm sure there are other code examples where the Ruby code would be a little shorter, but I do think in general, Ruby and Python are going to be pretty close in terms of code clarity and number of lines of code.

Posted in Technology | Tags Python, Ruby | 0 Comments

Infinite Recursion

September 2, 2009

A few days ago I posted an article on Tail Call Optimization. One really quick way to determine if any language/VM/interpreter performs Tail Call Optimization (TCO) is to write a function that calls itself, in other words, creating an infinitely recursive function. If the function just runs forever and doesn't return, the interpreter is doing TCO, otherwise you will get some sort of stack overflow error. So I decided to test a variety of languages to see what happens when you write an infinitely recursive function. First up is ruby:

def forever
  forever
end

forever

It was no surprise to me that running this results in this error:

run.rb:2:in `forever': stack level too deep (SystemStackError)
    from run.rb:2:in `forever'
    from run.rb:5

Ruby doesn't have TCO, I knew that. Next up, Python:

def forever(): forever()

forever()

This has a similar result, but the stack track is a little more revealing:

Traceback (most recent call last):
  File "run.py", line 3, in <module>
    forever()
  File "run.py", line 1, in forever
    def forever(): forever()
  File "run.py", line 1, in forever
    def forever(): forever()
...
  File "run.py", line 1, in forever
    def forever(): forever()
  File "run.py", line 1, in forever
    def forever(): forever()
RuntimeError: maximum recursion depth exceeded

This shows pretty clearly what is going on, the stack frames are pilling up and eventually it gets to the point where the Python interpreter says enough is enough. Interestingly enough, this mailing list thread shows that it's completely feasible to add TCO to Python and Guido just doesn't want to add it.

JavaScript is no surprise either, but we can write our infinitely recursive function with an interesting little twist:

(function(){ arguments.callee() })()

Yes that's right, it's an anonymous recursive function! Running this with SpiderMonkey results in InternalError: too much recursion. So how about Java:

class Run {
  public static void main(String[] args) {
    Run run = new Run();
    run.forever();
  }

  public void forever() {
    forever();
  }
} 

The stack trace for this one looks a bit like the Python one, we see a stack frame for each iteration:

Exception in thread "main" java.lang.StackOverflowError
    at Run.forever(Run.java:8)
    at Run.forever(Run.java:8)
    at Run.forever(Run.java:8)
    at Run.forever(Run.java:8)
...

So that means no TCO on the JVM. So Scala doesn't have TCO, right?

def forever : Unit = forever

forever

Wrong. This one runs forever. Scala does TCO with some bytecode tricks which Nick Wiedenbrueck explains really well in this blog post.

Clojure is a functional Lisp saddled with the problem of no-TCO on the JVM, but it gets around it in a slightly different way than Scala. If you write a tail recursive function like this:

(defn forever [] (forever))

You will get a java.lang.StackOverflowError just as you do in Java. Instead, Clojure provides a language level feature to do recursion:

(defn forever [] (recur))

Calling recur in the function makes it call the function again. recur also checks that it is in the tail recursive position, which ends up being an interesting feature because you have to explicitly say "I expect this to do TCO". In other languages that do it transparently, you might think TCO is happening, but it might not be and you won't find out until you get a stack overflow error. Also, this construct allows us to have recursive anonymous functions:

((fn [] (recur)))

Erlang, being a function language, has TCO as well:

-module(run).

-compile(export_all).

forever() ->
  forever().

Now one thing all of these functional languages that have TCO; Scala, Clojure and Erlang, all have in common is that while the infinitely recursive function runs, it essentially does nothing, but it will peg your CPU utilization to nearly 100%. This next language, Haskell, being the mother of all functional langauges, of course has TCO. Here's the function:

forever = forever

Yes, that's a function. And if you call it with forever, it just sits there and runs forever. But here's the crazy thing, it uses no CPU. My guess is that it has to do with lazy evaluation, but I'm not sure. Any ideas?

Posted in Technology | Tags FunctionalProgramming, Python, Ruby, Java, Javascript, Haskell, Clojure, TailCallOptimization, Scala, TCO, Erlang | 19 Comments

Keyword Arguments and the Case for Literal Syntax for Hashes

March 8, 2009

I've studied various programming languages and one feature I've grown to love is keyword arguments. Here's Paul Graham's take on keyword arguments:

Rtml even depended heavily on keyword parameters, which up to that time I had always considered one of the more dubious features of Common Lisp. Because of the way Web-based software gets released, you have to design the software so that it's easy to change. And Rtml itself had to be easy to change, just like any other part of the software. Most of the operators in Rtml were designed to take keyword parameters, and what a help that turned out to be. If I wanted to add another dimension to the behavior of one of the operators, I could just add a new keyword parameter, and everyone's existing templates would continue to work. A few of the Rtml operators didn't take keyword parameters, because I didn't think I'd ever need to change them, and almost every one I ended up kicking myself about later. If I could go back and start over from scratch, one of the things I'd change would be that I'd make every Rtml operator take keyword parameters.

In order to have keyword arguments in your language, what you only really need is Hash as a first class data type with a literal syntax. This is a feature that all modern programming languages should have. One reason why is it makes the feature of keyword arguments trivial to implement and use. If your language has literal syntax for Hashes, you don't really need much as in the terms of syntax of the language to have support for keyword arguments. Python has a little extra built-in support for keyword arguments. In Ruby, hashes are used as keyword arguments to methods pretty easily. To clarify, when I say Hash, I'm talking about the data structure that is referred to as a Dictionary in Python, NSDictionary in Objective-C, a Hash in Ruby, or a Map in Java. Objective-C does not have a literal syntax for Hashes, neither does Java, and I'll show why not having a literal syntax for Hashes leads to not having keyword arguments.

Objective-C does not have keyword arguments, but what it does have is named arguments. I'm not sure if keyword arguments and named arguments are the official correct terms, but I like those terms and I'm going to define specifically what I mean by each. Named arguments in Objective-C mean that each and every argument to a method call must have a name. In a language like Java which doesn't have named arguments, you could see something like this:

Date iWeeksFromNow = now.add(0, 0, (i*7), 0, 0, 0);

It's hard to tell what this method does, because you have to know what the position of each argument means. In Objective-C, it might look like this:

NSCalendarDate *iWeeksFromNow = [now dateByAddingYears:0 
                                                months:0
                                                  days:(i*7)
                                                 hours:0
                                               minutes:0
                                               seconds:0];

This call is self-documenting, because you can now easily determine what each parameter means. Although this is an improvement, it still has some flaws. First, I still must know the correct order of the arguments. For example, this won't work:

NSCalendarDate *iWeeksFromNow = [now dateByAddingSeconds:0 
                                                 minutes:0
                                                   hours:0
                                                    days:(i*7)
                                                  months:0
                                                   years:0];

More egregiously, you can't omit the arguments for which you would like to supply no value or have the default value used. This is why we see all these parameters with a value of 0 being passed in. You could fix this API in Objective-C by defining separate method to add each unit, so something like this would work:

NSCalendarDate *iWeeksFromNow = [now dateByAddingDays:(i*7)];

But what if you want to specify values for two of the arguments? You end up with a huge explosion of methods in the API. This also leads to another big problem, which is what if you want add another value that could be passed in? For example, let's say we wanted to by able to pass in weeks, which is exactly what we are trying to do in this example. We could modify the API and call the code like this:

NSCalendarDate *iWeeksFromNow = [now dateByAddingYears:0 
                                                months:0
                                                  days:0
                                                 weeks:i
                                                 hours:0
                                               minutes:0
                                               seconds:0];

But that means we have to go change all the code that doesn't have weeks in the method call to include it. Depending on how much code you have calling the API, this could be a nightmare. In this case, where we are talking about a core method of Cocoa, so I would suspect that Apple is unlikely to ever change this method for that reason. This is why APIs developed in a language without keyword arguments sometimes stagnate over time, for fear of breaking backward compatibility. If this were a keyword argument method, support for accepting weeks as an argument could easily be added without breaking any existing code.

Keyword arguments, as typically implemented in Ruby or Python, are optional and unordered. A method like this one in a Ruby API might look like this:

iWeeksFromNowNextYear = now.add(:weeks => i, :years => 1)

This is a very clean way to design an API. It can easily be extended in the future when more arguments need to be added. I don't need to pass in values for parameters I don't care about. It is clean and easy to use as the caller of the API. Techincally, this exact kind of thing is possible in Java and Objective-C. The NSCalendateDate could have a method called add that takes an NSDictionary, which might make calling code look like this:

NSCalendarDate *iWeeksFromNowNextYear = [now dateByAdding: 
  [NSDictionary dictionaryWithObjectsAndKeys:
    [[NSNumber alloc] initWithInt:i], @"weeks",
    [[NSNumber alloc] initWithInt:1], @"years",
    nil]];

But as you can see, there is too much syntactical noise for this kind of thing to every become idiomatic, which goes back to my original point, which is that having a literal syntax for Hashes is a huge win for the syntax of a language.

Posted in Technology | Tags ObjectiveC, Python, Ruby | 2 Comments

Calling Methods During Class Definition

April 17, 2008

One of the features I like most about Ruby is the ability to execute code during the definition of a class. Here's a simple example:

class Foo
  puts "hi"
end

A more useful example is that you can call a method and self is set to the class you are defining.

class Foo
  def self.whoami
    puts "You are #{self}"
  end
end
class Bar < Foo
  whoami
end

This prints You are Bar. This is a feature that isn't common to most languages, for example, you can't do it in PHP, Python(I don't think, Pythonistas jump in there if I'm wrong), Java or even Groovy. The reason why this kind of method is so helpful is this how you can write code that writes code. This is the closest thing Ruby has to Lisp macros. You see this used in Ruby in with the attr_accessor method and in Rails with many methods, belongs_to and has_many being the most obvious examples.

This allows you to define the metadata about a class in the most DSL-like syntax. It would be really great to see this kind of thing in Groovy.

Posted in Technology | Tags Python, Ruby, Java, PHP | 0 Comments

The Edge of Innovation

April 11, 2008

If you are a programmer that deals with web applications and you keep up with the latest trends, then there is no doubt you will at least have heard of Ruby on Rails. You might be at the level where you have read about Ruby on Rails and played with it a bit, but you really haven't immersed yourself the Ruby/Rails way. Maybe you know how to build web applications with Java, .Net or PHP, and you think "Rails is nice, but there's nothing Rails can do that you can't do in X".

If you are in this camp, it may be because you aren't willing to adapt your ways to Rails. To really understand the benefits of Rails, you have to not only learn Rails, but learn the best practices followed by good Rails programmers, like Skinny Controller, Fat Model, REST and Behavior Driven Development. You don't see the true benefit of Ruby until you start to fully embrace concepts like these. The only way to learn concepts like these is to read blogs, listen to podcasts, talk to other Ruby programmers at work, user groups and conferences and most importantly, you have to actually write code that reflects what you have learned. You must put aside your preconceived notions about what is right and what is wrong and surrender to the flow. You must unlearn what you have learned.

So if you haven't experienced this transformation first hand yet, and someone asks you "What is the biggest advantage to Ruby on Rails", I would be willing to bet your answer would be productivity. This is the way I think many people involved with technology, who don't fully grasp the Rails Way, perceive Rails. They believe, with some cynicism, that because of the dynamic nature of Rails, you can develop applications faster. They'll also probably say the downside is that Rails can't scale.

But anyway, I'm hear to say that productivity is the most over-rated benefit of Rails. The real advantage to Rails is that it is written in Ruby, which is a very powerful language that will change the way you think about programming. It's funny, I've thought to myself a number of times about how interesting it is that Ruby fundamentally changes the way you think about programming and that "Thinking in Ruby" would probably be a great book. But ironically, Bruce Eckel, the author of the "Thinking in..." line of programming books, seems to be happy with Python and not willing to give Ruby a chance. Who knows, that article is a few years old now, so maybe he's changed his attitude towards Ruby since then. I know mine has.

It's hard to quantify the advantage that "Thinking in Ruby" brings. The simplest way I can state it is that you will look at problems differently and come up with better solutions, solutions you may not have thought of if you were programming in other languages. The way I support this claim is by looking at some of the web application development innovations that have come out of the Ruby community.

The first is Rails itself. Rails has been copied, ported, attempted to be ported or talked about being ported to almost every other mainstream language you could think of, including Groovy, PHP, JavaScript, Perl, Java and .Net. This phenomenon is unique to Rails, I can't think of any other web application framework that can say that. If not the whole framework itself, parts of it such as convention over configuration, migrations and embracing REST have influenced the way web application development is done in almost every language.

Another example is HAML. HAML is a truly new and different take of the problem of generating HTML from a combination of dynamic code and HTML. It is a new idea and it has been ported to PHP, Python, and .Net. Whenever you have a framework or library that is being ported to other languages, it shows that the framework being ported contains new and good ideas about programming. In other words, it is a contribution to the paradigm of web development and a clear sign that the original language that the framework was implemented is at the edge of innovation.

Another example is Behavior Driven Development. This example is even more interesting because the idea originally started in Java with the JBehave framework. Even though the idea for behavior driven development started with Java, the idea didn't really take off until it was implemented in RSpec. They are fairly similar in terms of syntax. Here's an example from the JBehave website:

public class CheeseBehaviour extends UsingJMock {
    public void shouldRequireTheUseOfMocks() {
        // given
        Mock cheese = new Mock(Cheese.class);
        Sheep sheep = new Sheep((Cheese)cheese.proxy());

        //expect
        cheese.expects(once()).method("squelch").withAnyArguments();

        // when
        sheep.eatCheese();

        // then
        verify();
    }
}

and here it is converted to RSpec:

describe Sheep do
  before do
    @cheese = mock(Cheese)
    @sheep = mock(Sheep, :cheese => @cheese)
  end
  it "should squelch when it eats cheese" do
    @cheese.should_receive(:squelch)
    @sheep.eat_cheese
  end
end

For whatever reason, JBehave really never took hold in the Java community, but RSpec has in the Ruby community. RSpec has been ported to .Net, PHP and Groovy. All of those projects describe their code as a port of RSpec, not JBehave. Again it is Ruby influencing the wider web application development community.

Post World War II, the center of the art world was New York City and it was there that the modern art movemement was born. New York was where innovation in the art world was happening. In that time period if you wanted to be a serious artist, you had to go to New York to experience the movement first hand. Today, I believe the Ruby community is leading the way in innovative techniques for web application development. There is certainly innovation happening in other languages like Python, Smalltalk and Erlang as well, but I don't think any one other language/community is doing as much as Ruby. As far as I can tell, languages like Java, .Net and PHP are doing nothing to innovate web application development. They are simply lagging behind, playing catch up and trying to figure out how implement new features pioneered in the Ruby community and others as closely as possible, given the limitations of the language. So if you are a web developer, I suggest you ask yourself this question. Are the languages and frameworks you are working with leading others to come up with new ideas? Are the languages and frameworks that you are working helping you come up with new ideas? If not, embrace Ruby and someday you will discover an elegant solution to a problem, one that you may not have without Ruby.

A language that doesn't affect the way you think about programming is not worth knowing. -- Alan Perlis

Posted in Technology | Tags Javascript, Java, Ruby, .Net, PHP, RSpec, Python, Rails, HAML | 0 Comments

  Older Articles >>