[CommonLII] [Databases] [WorldLII] [Feedback] [Help]

Full SINO Documentation

You are here: CommonLII >> About CommonLII >> Help >> Full SINO Documentation

Sino - Yet another search engine for the Web

Andrew Mowbray

Australasian Legal Information Institute

Overview
Building a Sino Index
Invoking Sino
The Sino Search Language

Overview

Sino is a free text retrieval engine intended for use with httpd and other embedded applications. Why yet another text retrieval engine for the Web you might ask ? Well, to be honest, it all started as a bit of a joke. Geoff (that is, Geoff King - the AustLII manager) was all impressed about Glimp's small concordance sizes. Peter (that is, Peter van Dijk - AustLII's principle consultant) was busy destroying CPU time with some beautiful (his words, not mine) hypertext mark-up scripts. More to annoy them than anything else, I thought I'd write something which went totally against the grain (mind you, Peter has not always been exactly renowned for his "green" disk space conservation policies). Enter Sino - short for Size is no object - a free text retrieval system built for retrieval times over index size. It worked like a charm - the first search we threw at it on the AustLII production machine (something like "a*") managed to produce a temporary spill file of almost a 1/4 of a gigabyte. This was bad, but at least it was fast ...

The main things I have tried to achieve in building Sino are as follows:

annoy Peter (and to a lesser, but still significant extent - Geoff). They still feel that I could be doing something more productive
write something that anyone could use for free to air services like AustLII
provide a much more respectable search language and interface than was available on any of the existing public domain products (particularly from an Australian lawyers' perspective)
produce something that is fast (no real magic needed here, just a conventional inverted file approach with a few smarts borrowed from my old free text system -Airs)
don't get too hung up about index sizes (the AustLII indexes are running at 30% of text size, which to my mind is more than acceptable)
try to keep indexing times within sensible limits (AustLII's Sparc 20 is taking about an hour to index 60,000+ files containing 250+M)
keep it portable so that it will at least run under Windows and on the Mac as well as under Unix
try not to produce 1/4G spill files again!

What started as a fairly light hearted project, has developed into a serious system. Sino is now quite stable and is running as the production search engine on AustLII.

Building a Sino Index

To create a Sino index just run sinomake from the root of the directory you wish to index. You can specify include and exclude patterns via the files .sino_include and .sino_exclude respectively. These should consist of a list of regular expressions which you want or don't want to be indexed. If you don't specifify anything, the default is to include all .htm, .html and .txt files and to exclude anything starting with a dot. You will probably also want to create a .sino_common file containing the names of all common (non-indexed) words. If your feeling like tweaking this, you can display frequently occuring words with sinoshow -fsize where size is the minimun number of word occurences to display.

Sinomake will produce the files, .sino_words, .sino_hits, and .sino_docs in the current directory.

Invoking Sino

The current usage for sino (the search engine) is:

sino [ -n ] [ directory-name ] [ directory-mask ]

If you just want to test the results of sinomake, then set the default directory to the one containing the .sino files and type sino. This will give you a (ridiculously simple) search interface.

If you are calling Sino from something else, you will probably want to call it in non-interactive mode:

sino -n directory-name [directory-mask]

This will look for the .sino files in directory-name, read a search as one line on standard input and spit out the search results as file-name SPACE title NEWLINE on standard output (where file-name is the file name to be display rooted at directory-name and title is the HTML title of the document) - simple but elegant. The directory-mask argument may be used to restrict search results to particular directories. When sino is displaying results, this is matched against the start of found file names. Only matching documents will be displayed.

The Sino Search Language

The Sino search language is rather cosmopolitan. If you have used one of the popular on-line legal database systems (or even if you haven't) you probably do not need to learn anything new. Most Lexis, Status, Info-One, (and for the non-lawyers, even C and agrep) style searches are recognised. This section is intended for people who need to understand exactly what Sino can parse. If you know "zot" about free text searching, see the next section. Otherwise, if you do not a a deep seated interest in Sino, you might want to quickly browse the relevant Emulated Search Languages section.

Sino Search Basics

When you do a Sino search, you as fundamentally searching for documents which contain some words or phrases. If you can come up with a phrase which you think is distinctive enough, just type it in and hit the return key! If you need to find documents containing more than a single word or phrase, things get a little (but not a lot) more complicated.

If you want more than one phrase or word to appear in the retrieved documents, put an and between them. For example, to find documents containing the phrase "moral rights" as well as the word "copyright", you would type: "moral rights and copyright" (less the quotes of course).

If, on the other hand, you want to find one term and/or another one, put an or between them. For example, to find stuff which contains the words "treaty", "convention" or "international agreement" you would search for: "treaty or convention or international agreement". If you wanted to, you could even put these two searches together - as in: "treaty or convention or international agreement and moral rights and copyright".

If you want to find two words or phrases which appear close to each other (for example, the parties to a case), you can use the near connector. If you wanted to find cases where Smith sued (or was sued by) Brown, you might type: "smith near brown".

The rest of this document is a fairly detailed description of how Sino searches documents. If your new to free text searching, you might want to go away and have a play at this point, and come back when you have some questions.

Words and Phrases

Now, let's get technical ... The basic unit of a Sino search is the word. A word is any continuous sequence of alphanumeric characters. Words are case insensitive. All words are searchable other than a relatively small list of common words which is specified for each database. The list of non-searchable words is typically quite small (less than 100 words) and is generally limited to words of little informational content (such as "the", "is", "but" and so forth). Words may be combined into phrases without the need for any special connectors (eg. "pervert the course of justice").

Sino automatically expands searches to match regular English plurals (that is, a search for "treaty" will also match "treaties" and a search for "contract" will match "contracts"). The search parser allows for Unix shell style pattern matching, including the ability to forward truncate (particularly handy for Norwegian!). The following wild cards are recognised:

*
matches any string (including null)

?
matches any single character

[ ... ]
matches any one of the enclosed characters. A pair of characters separated by a '- ' matches a range of characters (eg [a-c] will match 'a', 'b', or 'c'). If the first character is a '^' or a '!', characters not enclosed are matched (eg [^a-c] will matched anything except 'a', 'b' or 'c'.

The pattern must match an entire word. To search for words containing substrings, use "*substring*". The left square bracket symbol is also used for boolean grouping. Where you wish to start a word with a [ ... ], you need to put the whole word in quotes (eg "[ab]*ing").

As far as is consistent, Sino also supports regular expressions. It will for example, treat the sequence ".*" as "*", ignore '^' and '$' characters and will even deal with agrep's '#' character. The main limitation is that sequences such as "[0-9]*" will not work.

Care should be taken when applying pattern matching to ensure that patterns are not ridiculously wild. The Sino search engine has to combine all of the occurrence information for each matched word with a boolean OR. Patterns such as "*" or even "a*" will lead to rather slow search times!

Boolean and Proximity Operators

Words and phrases may be connected together with boolean and proximity operators to form more complex searches. The operators are borrowed from a number of existing free text retrieval systems. They may be used in any combination and regardless of their heritage.

Boolean AND

The boolean AND operator allows you to identify documents which contain two (or more) words or phrases. It may be written as: "and", '+', '&', "&&" or ';'. Some typical searches are:

copyright and material form
18 and crimes act 1900
defamation and journalist and newspaper

Where the keyword "and" is used to indicate a boolean AND it has low precedence (like on Lexis) - it is only evaluated after both of its arguments have been fully evaluated. Where it is written in any of the other forms, it has a (more traditionally) higher precedence than a boolean OR. The rationale for this is that OR is usually used for synonyms which ought to group tightly and so giving AND a lower precedence is usually more convenient for free text searching and is less likely to lead neophyte searchers into difficulties.

Boolean OR

The boolean OR operator is used to find documents containing either or both of two terms and is typically used to find synonymous words and phrases. It is written in Sino as: "or", ',', '|' or "||". Examples include:

section or s
husband or wife or spouse
proprietary limited or p l or pty ltd

Boolean NOT

The NOT operator allows you to find documents which contain one thing but not another. It may be written as: "not", '-', or '^'. In practice, this operator is seldom used, but to illustrate:

trust not family
trade practice act not 51

Proximity Operators

Proximity operators are used to find documents where 2 or more terms appear near each other. Sino indexes documents in terms of where words appear. Consequently, all proximity operators are in terms of word positions. The simplest form of this class of operators is "near" (as used on Info One). This operator requires that words or phrases appear within 50 words of each other. For example:

smith near brown
31 near bail act 1900

Although convenient, this operator is obviously a little on the restrictive side. For more flexible proximity searching, you have the choice of Lexis or Status style operators. These take the following forms:

/n/
words and phrases must appear within n words of each other (STATUS)

/m, n/
words must appear within m to n words of each other (STATUS)

w/n
words or phrases must occur within n words of each other (Lexis)

pre/n
first word must proceed second by less than n words (Lexis)

For example:

smith w/10 brown>
smith /10/ brown
smith /-10,10/ brown [ All find the word 'smith' within 10 words of 'brown']
smith pre/10 brown
smith /1,10/ brown [ Both find 'smith' followed by 'brown' up to 10 words later ]

Named Sections (Segments)

Named section (segment) searching takes one of the following forms:

section(searchterms)
phrase @ section

Standard named sections are title (the html title of a document) and text (everything).

Keyed Fields (Dates etc)

Date searches take the following forms:

[#]date = date
[#]date < date
[#]date > date
[#]date >< date

Any sensible (English style) date is OK.

Precedence

Normally searches are evaluated from left to right. This is subject to the following order of operator precedence (highest to lowest):

word
( terms) phrase
w/n pre/n w/seg /n/ /m, n/ @ name ( terms )
or & &&
and not ^ || | , ;

You can use parentheses to alter this. Round, square and curly brackets are all recognised. If you need to make any special symbols literal, these should be enclosed in quotes (double, single or back quotes).

Search Language Emulations

The following tables list available elements from the emulated search languages:

Info-One

Info One is a commercial Australian provider of CD-ROM based and on-line services covering (primarily) State case law. Their CD and on-line products both use the same search language. Sino supports the following Info One style operators:

and
boolean AND (words/phrases must appear in same document)

or
boolean OR (either or both words/phrases must appear)

not
boolean NOT (the first word/phrase must appear but the second word must not)

near
words and phrases must appear within 50 words of each other

@
word or phrase must appear in specified section

[ ]
square brackets may be used to group operators

"term"
double quotes may be used to escape the special meaning of and, not etc

#key
info-one style date searches are supported

In general, the implementation is fairly faithful to the original. The fact that Sino indexes words rather than characters, means that the near operator has slightly different meaning. Another slight difference is that or has higher precedence than and (a common error for many neophytes anyway). As some punctuation characters have special meaning to other search languages, it is important not to include such characters in searches.

Lexis

Lexis/Nexis is the world's largest on-line legal database. The search language has been adopted by several other commercial products, including the Innerview software as used by the Australian CD-ROM producer DiskROM. Sino supports the following Lexis constructs:

and
boolean AND (words/phrases must appear in same document)

or
boolean OR (either or both words/phrases must appear)

and not
boolean NOT (the first word must appear but the second word must not)

w/seg
words and phrases must appear within the same section (segment)

w/n
words or phrases must occur within n words of each other

pre/n
first word must proceed second by less than n words

section(terms)
word or phrase must appear in specified section

()
round brackets may be used to group operators

"term"
double quotes may be used to escape the special meaning of and, not etc

key
lexis style date seacrhes are supported

Sino counts common words (noise words) as occupying word positions for search purposes. This will give subtly different results from Lexis for searches such as "sale goods" (which will not match "sale of goods"). There is currently no support for the Lexis operators not w/n and not w/seg.

Status

Status was one of the first free text retrieval systems to be developed (in the early 1970's !). It was used by the short lived Eurolex service and is still in use in Australia by the Commonwealth Attorney General's service Scale. Sino allows the following operators:

+
boolean AND (words/phrases must appear in same document)

,
boolean OR (either or both words/phrases must appear)

-
boolean NOT (the first word/phrase must appear but the second word must not)

/n/
words and phrases must appear within n words of each other

/m, n/
words must appear within m to n words of each other (m and n may be negative)

@
word or phrase must appear in specified section

()
round brackets may be used to group operators

#key
status style date searches are supported

Sino does not index paragraphs and so the // (within paragraph) operator is not available. The meaning of /n/ is more general (but more useful) than is the case for Status. Otherwise, the implementation is fairly close to the original.

C and agrep

For users who come from a computing science background, the following C-like and agrep like operators are also supported:

& or &&
C-like boolean AND (words/phrases must appear in same document)

| or ||
C-like boolean OR (either or both words/phrases must appear)

;
agrep-like boolean AND (words/phrases must appear in same document)

,
agrep-like boolean OR (either or both words/phrases must appear)

^
boolean NOT (the first word/phrase must appear but the second word must not)

() or {} or []
square, round or curly brackets may be used to group operators

The implementation of C and agrep style searches is pretty half hearted and is only really intended for casual use.

CommonLII: Feedback | Privacy Policy | Disclaimers
URL: http://www.commonlii.org/commonlii/help/sino.html

	[CommonLII] [Databases] [WorldLII] [Feedback] [Help]
	Full SINO Documentation
You are here: CommonLII >> About CommonLII >> Help >> Full SINO Documentation