Adding an expansion source¶
The current code makes strong assumptions that the only source for expansion terms is a Gensim word2vec model.
To add a new source for terms, probably the following steps have to be taken:
New expand-module¶
Either create a new expand-newsource-module, based on expand.py, or refactor the existing code into abstract classes, and extend these for any new expansion sources.
New ModelRegistry¶
expand.py defines a ModelRegistry, which assumes word2vec models.
The ModelRegistry's purpose is to return a model, given an user identifier. Whatever is returned from ModelRegistry.get_model(<UID>) will be passed to Query.expand() and should provide a means to retrieve expansion terms.
In the current state, this Query.expand() simply delegates to the Gensim model's most_similar() method:
def expand(self, model: gensim.models.Word2Vec, n: int = 2):
# [...]
for word in query_words:
try:
similar = model.wv.most_similar(word, topn=n)
except:
# [...]
New Query¶
The Query class is responsible for managing queries. It captures the state a query is in (the query itself, the article ID, expanded or not), and provides a single method, expand(), to assemble the query terms.
expand() should consult whatever source is being used for additional terms; it can do so either for the full query string, or for each word comprising it (this is the current approach).
After calling expand(), the new state should be stored: whether expansion was performed, as well as the expanded query.
New Searcher¶
The current implementation of Searcher depends on the ModelRegistry, which may be a different one after following the above steps.
The Searcher takes in a search spec, an instance of the Search-wrapper, and the registry.
When Searcher.run() is invoked, the registry is consulted based on the search spec, and an actual search is submitted to the underlying engine.
The only obstacle in re-using the Searcher-class for a new expansion source is the tight coupling to the ModelRegistry. Everything else should be re-usable.