AltLaw Structure

A search engine is generally divided into three parts:

  • A crawler (also called a "spider" or "robot") to find and download new data;
  • An indexer to turn that data into a searchable index; and
  • A front-end interface to allow users to search the index.

Crawlers

(These are part of the "crawlers" component in Trac.)

These scripts visit the web sites of the courts and download all the cases available on those sites. On some court web sites it is possible to "scrape" metadata such as titles and dates -- this information is stored by the scripts in simple XML files, one per downloaded case. See CourtProvidedMetadata for details.

We welcome submissions of crawlers for additional courts.

Indexer

(This is the "indexer" component in Trac.)

The AltLaw index is powered by Solr, an open-source search server sponsored by the Apache Foundation. Solr is based on the open-source Lucene search engine.

A collection of Ruby classes called Preindexers prepare documents for the index by:

  1. extracting metadata and converting it to a canonical form
  2. extracting plain text from the original case file (usually PDF) for indexing
  3. generating HTML to display the case

Everything -- metadata, text, and HTML -- is stored in the Solr index; there is no separate database.

Front End

(This is the "websearch" component in Trac.)

The front-end web application that powers the web site is a Ruby on Rails application. The web app communicates with the Solr server to search and retrieve documents from the index.

More Information