<li><ahref="#grammar">Vank Query Language Grammar</a></li>
<li><ahref="#prox">Terms / Proximity</a></li>
<li><ahref="#belief">Combining Beliefs</a></li>
<li><ahref="#filter">Filter Operators</a></li>
<li><ahref="#numeric">Numeric / Date Field Operators</a></li>
<li><ahref="#prior">Document Priors</a></li>
<li><ahref="#applications">Applications</a></li>
</ol>
<!-- end page table of contents -->
<hr/>
<!-- begin main -->
<p><spanclass="notetext">Note: many thanks to <ahref="http://ciir.cs.umass.edu/~metzler/"target="_blank">Don Metzler</a> for this information.</span></p>
<h3id="intro">1. Introduction</h3>
<p>
The Vank query language, based on the InQuery query language, was designed to be robust. It can handle both
simple keyword queries and extremely complex queries. Such a query language sets Vank apart from many other
available search engines. It allows complex phrase matching, synonyms, weighted expressions, Boolean filtering,
numeric (and dated) fields, and the extensive use of document structure (fields), among others.
</p>
<p>
Although Vank handles unstructured documents, many of the query language features make use of structured (tagged)
documents. Consider the following document: <br/>
<pre>
<html>
<head>
<title>Department Descriptions</title>
</head>
<body>
The following list describes ...
<h1>Agriculture</h1> ...
<h1>Chemistry</h1> ...
<h1>Computer Science</h1> ...
<h1>Electrical Engineering</h1> ...
</body>
</html>
</pre>
<br/>
In Vank, a <i><b>document</b></i> is viewed as a sequence of text that may contain arbitrary tags. In the example
above, the document consists of text marked up with HTML tags.
</p>
<p>
For each tag type <tt>T</tt> within a document (i.e. <tt>title</tt>, <tt>body</tt>, <tt>h1</tt>, etc), we define the <i><b>context</b></i> of <tt>T</tt> to be all of the text and tags that appear within tags of type <tt>T</tt>. In the example above, all of the text and tags appearing between <tt><body></tt> and <tt></body></tt> tags defines the body context. A single context is generated for each unique tag name. Therefore, a context defines a subdocument. Note that because of nested tags certain word occurrences may appear in many contexts. It is also the case that there may be nested contexts. For example, within the <tt><body></tt> context there is a nested <tt><h1></tt> context made up of all of the text and tags that appear within the body context and within <tt><h1></tt> and <tt></h1></tt> tags. Here are the tags for the <tt>title</tt>, <tt>h1</tt>, and <tt>body</tt> contexts:
</p>
<p>
<tt>title</tt> context:
<pre>
<title>Department Descriptions</title>
</pre>
</p>
<p>
<tt>h1</tt> context:
<pre>
<h1>Agriculture</h1>
<h1>Chemistry</h1> ...
<h1>Computer Science</h1> ...
<h1>Electrical Engineering</h1> ...
</pre>
</p>
<p>
<tt>body</tt> context:
<pre>
<body>
The following list describes ...
<h1>Agriculture</h1> ...
<h1>Chemistry</h1> ...
<h1>Computer Science</h1> ...
<h1>Electrical Engineering</h1> ...
</body>
</pre>
</p>
<p>
Finally, each context is made up of one or more <i><b>extents</b></i>. An extent is a sequence of text that appear within a single begin/end tag pair of the same type as the context. For the example above, in the <tt><h1></tt> context, there are extents "<tt><h1></tt>agriculture<tt></h1></tt>", "<tt><h1></tt>chemistry<tt><h1></tt>", etc. Both the title and body contexts contain only a single extent because there is only a single pair of <tt><title></tt> ... <tt></title></tt> and <tt><body></tt> ... <tt></body></tt> tags, respectively. The number of extents for a given tag type <tt>T</tt> is determined by the number of sequences of the form: <tt><T></tt> text <tt></T></tt> that occur within the document.
</p>
<h3id="grammar">2. Vank Query Language Grammar</h3>
NOTE: If you are unsure which belief operator to use, it always "safest" to default to
using the #combine or #weight operator. These operators are often the best choice for
combining evidence. NEVER use #wsum or #wand unless you really know what you're doing!
</p>
<h4>Extent / Passage retrieval:</h4>
<p>
<ul>
<li>
#beliefop[field]( query ) -- evaluates #beliefop( query ) for all extents
of type "field" in the document and returns a score for each. The language
model used to evaluate the query is formed from the text of the extent.
</li>
<li>
#beliefop[passageWIDTH:INC]( query ) -- evaluates #beliefop( query ) for every
fixed length passage of length WIDTH terms. The passage window is slid over the text
in increments of INC terms. The language model used to evaluate the query is formed
from the text within the current passage.
</li>
</ul>
</p>
<p>
<i>Example:</i>
<ul>
<li>
#combine[sentence]( #1(napolean died in #any:DATE ) ) -- returns a scored
list of sentence extents that match the given query
</li>
<li>
#combine[passage100:50]( #1(napolean died in #any:DATE ) ) -- returns a scored
list of passages (of length 100) that match the given query.
</li>
</ul>
</p>
<h3id="filter">5. Filter Operators</h3>
<p>
Filter operators allow you to score only a subset of an entire collection by filtering out those documents
that actually get scored.
</p>
<h4>Filter operators:</h4>
<p>
<ul>
<li>#filreq -- filter require</li>
<li>#filrej -- filter reject</li>
</ul>
</p>
<p>
<i>Examples:</i>
<ul>
<li>
#filreq( sheep #combine(dolly cloning) ) -- only consider those documents matching the query "sheep" and rank
them according to the query #combine(dolly cloning)
</li>
<li>
#filrej( parton #combine(dolly cloning) ) -- only consider those documents NOT matching the query "parton" and rank them according to the query #combine(dolly cloning)
</li>
</ul><br/>
NOTE: first argument must always be a term/proximity expression
</p>
<h3id="numeric">6. Numeric / Date Field Operators</h3>
<p>
Numeric and date field operators provide a number of facilities for matching different criteria. These operators
are very useful when used in combination with the filter operators.
</p>
<h4>General numeric operators:</h4>
<p>
<ul>
<li>#less( F N ) -- matches numeric field extents of type F if value < N</li>
<li>#greater( F N ) -- matches numeric field extents of type F if value > N</li>
<li>#between( F N_low N_high ) -- matches numeric field extents of type F if N_low ≤ value ≤ N_high</li>
<li>#equals( F N ) -- matches numeric field extents of type F if value == N</li>
</ul>
</p>
<h4>Date operators:</h4>
<p>
<ul>
<li>#date:after( D ) -- matches numeric "date" extents if date is after D</li>
<li>#date:before( D ) -- matches numeric "date" extents if date is before D</li>
<li>#date:between( D_low, D_high ) -- matches numeric "date" extents if D_low ≤ date ≤ D_high</li>
</ul>
</p>
<p>
<i>Acceptable date formats:</i>
<ul>
<li>11 january 2004</li>
<li>11-JAN-04</li>
<li>11-JAN-2004</li>
<li>January 11 2004</li>
<li>01/11/04 (MM/DD/YY)</li>
<li>01/11/2004 (MM/DD/YYYY)</li>
</ul>
</p>
<p>
<i>Examples:</i>
<ul>
<li>
#filreq(#less(READINGLEVEL 10) george washington) -- if each document in a collection contained a
numeric tag that specified the reading level of the document, then this query will only retrieve
documents that have a reading level below grade 10 and documents will be ranked according to the
query "george washington".
</li>
<li>
#combine( european history #date:between( 01/01/1800, 01/01/1900 ) ) -- such a query may be
constructed to find information about 19th century european history, as this query will find
pages that discuss "european history" and contain 19th century dates.
</li>
</ul>
</p>
<p>
NOTE: The general numeric operators only work on indexed numeric fields, whereas the date
operators are only applicable to a specially indexed numeric field named "date". See the
indexing documentation for more on numeric fields.
</p>
<h3id="prior">7. Document Priors</h3>
<p>
Document priors allow you impose a "prior probability" over the documents in a collection.
</p>
<h4>Prior</h4>
<p>
<ul>
<li>#prior( NAME ) -- creates the document prior specified by the name given</li>
</ul>
</p>
<p>
<i>Example:</i>
<ul>
<li>
#combine(#prior(RECENT) global warming) -- we might create a prior named RECENT to be used to
give greater weight to documents that were published more recently.
</li>
</ul>
</p>
<h3id="applications">8. Applications</h3>
<p>
Here we list suggested uses of the language for several common information retrieval tasks.
</p>
<h4>Ad Hoc Retrieval (Query Likelihood)</h4>
<p>
Ad hoc retrieval is the standard information retrieval task of finding documents that are topically
relevant to a given information need (query). One common probabilistic approach to ad hoc retrieval
is the query likelihood retrieval paradigm from language modeling. It is very simple to construct an
Vank query that ranks documents the same as query likelihood. For the query, "literacy rates africa",
we construct the following Vank query:
</p>
<p>
<pre>
#combine( literacy rates africa )
</pre>
<br/>
This returns a ranked list that is exactly equivalent to the query likelihood ranking (under the given