The highlight module requires that you have the text of the indexed document available. You can keep the text in a stored field, or if the original text is available in a file, database column, etc, just reload it on the fly. Note that you might need to process the text to remove e.g. HTML tags, wiki markup, etc.
The highlight module works on a pipeline:
Footnotes
[1] | Some search systems, such as Lucene, can use term vectors to highlight text without retokenizing it. In my tests I found that using a Position/Character term vector didn’t give any speed improvement in Whoosh over retokenizing the text. This probably needs further investigation. |
The whoosh.searching.Hit objects you get from a whoosh.searching.Results object have a highlights() method which returns highlighted snippets from the document. The only required argument is the name of the field to highlight:
results = searcher.search(myquery)
for hit in results:
print hit["title"]
print hit.highlights("content")
This assumes the "content" field is marked stored in the schema so it is available in the stored fields for the document. If you don’t store the contents of the field you want to highlight in the index, but have access to it another way (for example, reading from a file or a database), you can supply the text as an optional second argument:
results = searcher.search(myquery)
for hit in results:
print hit["title"]
# Instead of storing the contents in the index, I stored a file path
# so I could retrieve the contents from the original file
path = hit["path"]
text = open(path).read()
print hit.highlight("content", text)
You can customize the creation of the snippets by setting the fragmenter and/or formatter attributes on the Results object or using the fragmenter and/or formatter keyword arguments to highlight(). Set the Results.fragmenter attribute to a whoosh.highlight.Fragmenter object (see “Fragmenters” below) and/or the Results.formatter attribute to a whoosh.highlight.Formatter object (see “Formatters” below).
For example, to return larger fragments and highlight them by converting to upper-case instead of with HTML tags:
from whoosh import highlight
r = searcher.search(myquery)
r.fragmenter = highlight.ContextFragmenter(surround=40)
r.formatter = highlight.UppercaseFormatter()
for hit in r:
print hit["title"]
print hit.highlights("content")
Using the keyword argument(s) is useful when you want to alternate highlighting styles in the same results:
r = searcher.search(myquery)
# Use this fragmenter for titles, just returns the entire field as a single
# fragment
tf = highlight.WholeFragmenter()
# Use this fragmenter for content
cf = highlight.SentenceFragmenter()
# Use the same formatter for both
r.formatter = highlight.HtmlFormatter(tagname="span")
for hit in r:
# Print the title with matched terms highlighted
print hit.highlights("title", fragmenter=tf)
# Print the content snippet
print hit.highlights("content", fragmenter=cf)
You can use the top keyword argument to control the number of fragments returned in each snippet:
# Show a maximum of 5 fragments from the document
print hit.highlight("content", top=5)
You can control the order of the fragments in the snippet with the order keyword argument. The value of the argument should be a sorting function for fragments. The whoosh.highlight module contains several sorting functions such as whoosh.highlight.SCORE(), whoosh.highlight.FIRST(), whoosh.highlight.LONGER(), whoosh.highlight.SHORTER(). The default is highlight.FIRST, which is usually best.
The high-level interface is the highlight function:
excerpts = highlight(text, terms, analyzer, fragmenter, formatter, top=3,
scorer=BasicFragmentScorer, minscore=1, order=FIRST)
# Set up the index
# ----------------
st = RamStorage()
schema = fields.Schema(id=fields.ID(stored=True),
title=fields.TEXT(stored=True))
ix = st.create_index(schema)
w = ix.writer()
w.add_document(id=u"1", title=u"The man who wasn't there")
w.add_document(id=u"2", title=u"The dog who barked at midnight")
w.add_document(id=u"3", title=u"The invisible man")
w.add_document(id=u"4", title=u"The girl with the dragon tattoo")
w.add_document(id=u"5", title=u"The woman who disappeared")
w.commit()
# Perform a search
# ----------------
s = ix.searcher()
# Parse the user query
parser = qparser.QueryParser("title", schema=ix.schema)
q = parser.parse(u"man")
# Extract the terms the user used in the field we're interested in
# THIS IS HOW YOU GET THE TERMS ARGUMENT TO highlight()
terms = [text for fieldname, text in q.all_terms()
if fieldname == "title"]
# Get the search results
r = s.search(q)
assert len(r) == 2
# Use the same analyzer as the field uses. To be sure, you can
# do schema[fieldname].format.analyzer. Be careful not to do this
# on non-text field types such as DATETIME.
analyzer = schema["title"].format.analyzer
# Since we want to highlight the full title, not extract fragments,
# we'll use WholeFragmenter. See the docs for the highlight module
# for which fragmenters are available.
fragmenter = highlight.WholeFragmenter()
# This object controls what the highlighted output looks like.
# See the docs for its arguments.
formatter = highlight.HtmlFormatter()
for d in r:
# The text argument to highlight is the stored text of the title
text = d["title"]
print highlight.highlight(text, terms, analyzer,
fragmenter, formatter)
A fragmenter controls the policy of how to extract excerpts from the original text. It is a callable that takes the original text, the set of terms to match, and the token stream, and returns a sequence of Fragment objects.
The available fragmenters are:
See the whoosh.highlight docs for more information.
A formatter contols how the highest scoring fragments are turned into a formatted bit of text for display to the user. It can return anything (e.g. plain text, HTML, a Genshi event stream, a SAX event generator, anything useful to the calling system).
Whoosh currently includes only two formatters, because I wrote this module for myself and that’s all I needed at the time. Unless you happen to be using Genshi also, you’ll probably need to implement your own formatter. I’ll try to add more useful formatters in the future.
See the whoosh.highlight docs for more information.
A Formatter subclass needs a __call__ method. It is called with the following arguments:
formatter(text, fragments)
The Fragment object is a simple object that has attributes containing basic information about the fragment:
The basic work you need to do in the formatter is:
Fragment.startchar and Fragment.endchar
excerpt between Token.startchar and Token.endchar. (Remember that the character indices refer to the original text, so you need to adjust them for the excerpt.)
The tricky part is that if you’re adding text (e.g. inserting HTML tags into the output), you have to be careful about keeping the character indices straight.