How To Write A Spelling Corrector In Ruby

November 5, 2009

If you haven't seen it before, Peter Norvig has a spelling corrector that is written in just 21 lines of Python code (not counting blank lines and the import). He also lists a few other implementations in other languages, include one in Ruby. The Ruby one was listed as 34 lines. I was surprised that it was that many lines more in Ruby, so I wanted to give it a try. I didn't look at the Ruby solution first and here's what I came up with:

require 'set'

def words(text)
  text.downcase.scan /[a-z]+/ 
end

def train(features)
  features.inject(Hash.new(1)){|model, f| model[f] += 1; model }
end

NWORDS = train(words(File.open('big.txt').read))

ALPHABET = 'abcdefghijklmnopqrstuvwxyz'.split //

def edits1(word)
  s = (0..word.size).map{|i| [word[0,i], word[i,word.size]]}
  deletes    = s.map{|a,b| !b.empty? ? a + b[1..-1] : nil}.compact
  transposes = s.map{|a,b| b.size > 1 ? a + b[1].chr + b[0].chr + b[2..-1] : nil}.compact
  replaces   = s.map{|a,b| !b.empty? ? ALPHABET.map{|c| a + c + b[1..-1]} : nil}.flatten.compact
  inserts    = s.map{|a,b| ALPHABET.map{|c| a + c + b}}.flatten
  Set.new(deletes + transposes + replaces + inserts)
end

def known_edits2(word)
  s = edits1(word).map do |e1| 
    edits1(e1).map{|e2| NWORDS.include?(e2) ? e2 : nil}.compact
  end.flatten
  s.empty? ? nil : Set.new(s)
end

def known(words)
  s = Set.new(words.find{|w| NWORDS.include?(w)})
  s.empty? ? nil : s
end

def correct(word)
  candidates = known([word]) || known(edits1(word)) || known_edits2(word) || [word]
  candidates.max{|a,b| NWORDS[a] <=> NWORDS[b] }
end

To run this, download the data file, put the code in a file called spelling.rb, then start up IRB:

$ wget http://norvig.com/big.txt
$ irb -r spelling -f --simple-prompt
>> correct "speling"
=> "spelling"
>> correct "korrecter"
=> "corrected"

This one weighs in at 30 lines. I tried to do this as close to the Python implementation as possible. I also tried to use idiomatic Ruby. You could shave the number of lines down below 21, but it wouldn't meet any reasonable Ruby style guidelines. I'm still probably cheating a little as a few of those lines are approaching 100 characters, but it's at least reasonable, in my opinion. Here are some things I noticed that caused the Ruby version to be longer or less clear:

  1. end vs. significant indentation

    6 of the lines are just an end statement. Python uses indentation to end methods, so the end statements aren't needed in Python. This adds more lines, but has trade offs. I actually really like the idea of significant indentation, it's one of the reasons that I'm such a big fan of Haml. But it falls down in certain places. For example, Ruby has Embedded Ruby, which looks similar to JSP or PHP, but it's actually trivial to implement the basic cases. It truly is embedded Ruby, because the code between the <% %> and <%= %> tags is just ruby code. You commonly see things like this:

    <% if logged_in? %>
      Welcome, <%= current_user.login %>
    <% else %>
      Howdy Stranger!
    <% end >
    

    You can't do this in Python because of the lack of the end statement. This is why I'm surprised Haml was invented in Ruby and not Python. In Python, it fits with the language and is actually necessary, whereas in Ruby, significant indentation isn't part of the language and ERB is actually pretty good. There is a Python port of HAML, but I'm not sure how well that works or how widely it is used in the Python community.

  2. List Comprehensions vs. Blocks

    Python and Ruby, compared to C, Java, etc., have very powerful, concise syntax for iterating through and transforming collections, but they are very different. Python uses list comprehensions and Ruby uses blocks. As you can see above, there are certain cases where list comprehensions are very compact. List comprehensions can have a guard, which is a little cleaner than the Ruby equivalent where you have to return nil if the guard doesn't match and then compact that result. Also, when you want to iterate over two lists, you can do so with mutliple for statements, whereas in Ruby you do nested blocks and then flatten the result.

  3. Collection Slicing

    Python has a little cleaner, more consistent syntax for breaking up collections into sub collections. It also treats strings as a collection of characters more consistently.

  4. Truthiness

    In Python, many things are considered false. I'm not sure what the entire list is, but it seems to include empty collections (and therefore empty strings) as empty. Since Ruby only treats nil and false as false, we have to return nil instead of an empty set from the known and known_edits2 methods, so that the series of statements in the first line of the correct method will work correctly.

In summary, in this case, the Python code is shorter and clearer, but it's pretty close. I'm sure there are other code examples where the Ruby code would be a little shorter, but I do think in general, Ruby and Python are going to be pretty close in terms of code clarity and number of lines of code.

Posted in Technology | Tags Python, Ruby

Comments Disabled