reading module

This module contains classes that allow reading from an index.

Classes

class whoosh.reading.IndexReader

Do not instantiate this object directly. Instead use Index.reader().

all_doc_ids()

Returns an iterator of all (undeleted) document IDs in the reader.

all_stored_fields()

Yields the stored fields for all documents.

all_terms()

Yields (fieldname, text) tuples for every term in the index.

close()

Closes the open files associated with this reader.

define_facets(name, doclists, save=False)

Tells the reader to remember a set of facets under the given name.

Parameters:
  • name – the name to use for the set of facets.
  • doclists – a dictionary mapping facet names to lists of document IDs.
  • save – whether to save caches (if any) to some form of permanent storage (i.e. disk) if possible. This keyword may be used or ignored in the backend.
doc_count()

Returns the total number of UNDELETED documents in this reader.

doc_count_all()

Returns the total number of documents, DELETED OR UNDELETED, in this reader.

doc_field_length(docnum, fieldname, default=0)

Returns the number of terms in the given field in the given document. This is used by some scoring algorithms.

doc_field_lengths(docnum)

Returns an iterator of (fieldname, length) pairs for the given document. This is used internally.

doc_frequency(fieldname, text)

Returns how many documents the given term appears in.

expand_prefix(fieldname, prefix)

Yields terms in the given field that start with the given prefix.

field_length(fieldname)

Returns the total number of terms in the given field. This is used by some scoring algorithms.

first_id(fieldname, text)

Returns the first ID in the posting list for the given term. This may be optimized in certain backends.

frequency(fieldname, text)

Returns the total number of instances of the given term in the collection.

generation()

Returns the generation of the index being read, or -1 if the backend is not versioned.

group_docs_by(fieldname, docnums, groups, counts=False, offset=0)

Returns a dictionary mapping field values to items with that value in the given field(s).

Parameters:
  • fieldname – either the name of a field, or a tuple of field names to specify a multi-level sort.
  • docnums – a sequence of document numbers to group.
  • counts – if True, return a dictionary of doc counts, instead of a dictionary of lists of docnums.
  • offset – a number to add to the docnums returned.
has_deletions()

Returns True if the underlying index/segment has deleted documents.

has_vector(docnum, fieldname)

Returns True if the given document has a term vector for the given field.

is_deleted(docnum)

Returns True if the given document number is marked deleted.

iter_field(fieldname, prefix='')

Yields (text, doc_freq, index_freq) tuples for all terms in the given field.

iter_from(fieldname, text)

Yields (field_num, text, doc_freq, index_freq) tuples for all terms in the reader, starting at the given term.

iter_prefix(fieldname, prefix)

Yields (field_num, text, doc_freq, index_freq) tuples for all terms in the given field with a certain prefix.

key_docs_by(fieldname, docnums, limit, reverse=False, offset=0)

Returns a sequence of (sorting_key, docnum) pairs for the document numbers in docnum.

If limit is None, this method associates every document number with a sorting key but does not sort them. If limit is not None, this method returns a sorted list of at most limit pairs.

This method is useful for sorting and faceting documents in different readers, by associating the sort key with the document number.

Parameters:
  • fieldname – either the name of a field, or a tuple of field names to specify a multi-level sort.
  • docnums – a sequence of document numbers to key.
  • limit – if not None, only keys the first/last N documents.
  • reverse – if True, reverses the sort direction (when limit is not None).
  • offset – a number to add to the docnums returned.
leaf_readers()

Returns a list of (IndexReader, docbase) pairs for the child readers of this reader if it is a composite reader, or None if this reader is atomic.

lexicon(fieldname)

Yields all terms in the given field.

max_field_length(fieldname, default=0)

Returns the maximum length of the field across all documents.

most_distinctive_terms(fieldname, number=5, prefix=None)

Returns the top ‘number’ terms with the highest tf*idf scores as a list of (score, text) tuples.

most_frequent_terms(fieldname, number=5, prefix='')

Returns the top ‘number’ most frequent terms in the given field as a list of (frequency, text) tuples.

postings(fieldname, text, scorer=None)

Returns a Matcher for the postings of the given term.

>>> pr = reader.postings("content", "render")
>>> pr.skip_to(10)
>>> pr.id
12
Parameters:
  • fieldname – the field name or field number of the term.
  • text – the text of the term.
Return type:

whoosh.matching.Matcher

set_caching_policy(*args, **kwargs)

Sets the field caching policy for this reader.

sort_docs_by(fieldname, docnums, reverse=False)

Returns a version of docnums sorted by the value of a field or a set of fields in each document.

Parameters:
  • fieldname – either the name of a field, or a tuple of field names to specify a multi-level sort.
  • docnums – a sequence of document numbers to sort.
  • reverse – if True, reverses the sort direction.
stored_fields(docnum)

Returns the stored fields for the given document number.

Parameters:numerickeys – use field numbers as the dictionary keys instead of field names.
supports_caches()

Returns True if this reader supports the field cache protocol.

vector(docnum, fieldname)

Returns a Matcher object for the given term vector.

>>> docnum = searcher.document_number(path=u'/a/b/c')
>>> v = searcher.vector(docnum, "content")
>>> v.all_as("frequency")
[(u"apple", 3), (u"bear", 2), (u"cab", 2)]
Parameters:
  • docnum – the document number of the document for which you want the term vector.
  • fieldname – the field name or field number of the field for which you want the term vector.
Return type:

whoosh.matching.Matcher

vector_as(astype, docnum, fieldname)

Returns an iterator of (termtext, value) pairs for the terms in the given term vector. This is a convenient shortcut to calling vector() and using the Matcher object when all you want are the terms and/or values.

>>> docnum = searcher.document_number(path=u'/a/b/c')
>>> searcher.vector_as("frequency", docnum, "content")
[(u"apple", 3), (u"bear", 2), (u"cab", 2)]
Parameters:
  • docnum – the document number of the document for which you want the term vector.
  • fieldname – the field name or field number of the field for which you want the term vector.
  • astype – a string containing the name of the format you want the term vector’s data in, for example “weights”.
class whoosh.reading.MultiReader(readers, generation=-1)

Do not instantiate this object directly. Instead use Index.reader().

Exceptions

exception whoosh.reading.TermNotFound

Table Of Contents

Previous topic

query module

Next topic

scoring module

This Page