Class Cass::Parser

  1. cass/lib/cass/parser.rb
Parent: Object

Parses a string (e.g., text read from a file) into sentences. Can use either the Stanford Natural Language Parser (if installed), or a barebones parser that splits text at line breaks and periods. Generally speaking, you shouldn’t rely on the Parser class to parse and sanitize your input texts for you. This class implements only barebones functionality, and there’s no guarantee the resulting text will look the way you want it. You are strongly encouraged to process all texts yourself beforehand, and use this functionality only as a last resort.

Methods

public class

  1. parse

Public class methods

parse (text, opts={})

Parses a string into sentences.If the Stanford Parser and associated Ruby gem are installed (stanfordparser.rubyforge.org/), they will be called to do the job. If not, only basic parsing will be performed: text will be split into sentences at newlines and periods. Note that this is suboptimal and may generate problems for some documents.

[show source]
# File cass/lib/cass/parser.rb, line 20
    def self.parse(text, opts={})
      # Try to load Stanford Parser wrapper
      begin
        require 'stanfordparser'
      rescue LoadError
        puts "Error: stanfordparser gem couldn't load. Using barebones parsing mode instead. If you'd like to use" +
            " the Stanford Parser, make sure all required components are installed (see http://stanfordparser.rubyforge.org/). You'll need to make sure the java library is installed, as well as the treebank and jrb gems."
        spfail = true
      end
    
      if spfail or opts['parser_basic'] == true
        puts "Using a basic parser to split text into sentences. Note that this is intended as a last resort only; you are strongly encouraged to process all input texts yourself and make sure that lines are broken up the way you want them to be (with each line on a new line of text in the file). If you use this parser, we make no guarantees about the quality of the output."
        rx = opts.key?('parser_regex') ? opts['parser_regex'] : "[\r\n\.]+"
        text.split(/#{rx}/)
      else
        puts "Using the Stanford Parser to parse the text. Note that this could take a long time for large files!" if (defined?(VERBOSE) and VERBOSE)
        parser = StanfordParser::DocumentPreprocessor.new
        parser.getSentencesFromString(text)
      end
    end