
Automated Testing of Lucene-Backed Search
The Problem
Suppose you build a website about the history of fasteners. It will be full of solid information about bolts, nails, nuts, pins, clips, rivets, rods, screws, and washers. It will have all the bells and whistles, and all content will be searchable and indexed with Lucene. Happy day, it’s a grand success. It is a chronicle of the bitter struggle between stove and carriage bolts. The internecine strife between Robertson and Phillips screw advocates rages across its pages. (A truce has been called while they unite against Torx, but it won’t last.)
By how do you know the search functions actually work? Writing a test to look up “nail” and check that at least one result is returned tells us the backing store is reachable, but not much more than that. The search results page will contain the number of results, but how do you know it is correct? The task is further complicated by stemming: a search for “nail” must also return documents containing “nailing”, “nailed”, and “nails”. Add to this ever-changing sets of boost and bury rules and there is no practical way to keep test assertions aligned with content.
A Solution
The solution we implemented to get around this problem is goes like this:
- Run Lucene indexer locally on the current content set.
- Read terms and their counts directly from Lucene index to CSV.
- Read terms from CSV during test
- Assert the expected count matches what is returned by by application.
This strategy hinges on being able to run the classes that parse content into Lucene documents on their own, outside the main application. At the very least, it should be able to do this by providing some mock context classes. It is important to recognize the indexer is assumed to be working correctly.
Term Extractor
The term extractor works by opening the Lucene index, getting the terms for each document, and keeping a total of how many documents contain the given term. It also uses the spell check API to differentiate between stemmed and unstemmed terms. Short terms are discarded to save time and space.
See below for code.
The output looks like this and is easy to feed into a TestNG data provider:
parsitan 1 u parsol 1 u partial 34 u particle 11 o particular 36 o particularly 82 o partly 4 o partner 12 u
The number of documents containing the term is the second column and the third column shows whether the term is the unstemmed version or the original one. Depending on the search implementation in the application under test, you may want to use only original values for searching. Unstemmed ones are useful for searching “did you mean” queries.
To be fair, another way to do this type of testing would be to load Lucene with reams of dummy documents. These documents would remain outside the normal publication channel and their contents would be fully known making it theoretically possible to infer what the correct search result ought to be.
This solution has two problems: segregating these documents from normal users, and keeping pace with search tuning. The latter can be done with magic parameters in URLs and does not impose much overhead on the application. The latter is much more next to impossible to keep current when any kind of search tuning is in effect.
// Housekeeping and checking omitted. public class TermExtractor { public static final String[] TARGET_FIELDS = { "title", "content" }; public static final String LANG_FIELD = "lang"; public static final String TYPE_FIELD = "type"; public static final char ORIGINAL_TERM = 'o'; public static final char UNSTEMMED_TERM = 'u'; // Don't save info for terms less than MIN_TERM_LENGTH long. public static final int MIN_TERM_LENGTH = 5; private SpellChecker spellChecker; public static final void main(String argv[]) throws Exception { String indexPath = argv[ 0 ]; File indexDir = new File(indexPath); TermExtractor extractor = new TermExtractor(); extractor.extract(indexDir); } public void extract(File indexDir) throws Exception { Map> docIdByTerm = new HashMap<>(); SetunstemmedTerms = new HashSet<>(); FSDirectory d = FSDirectory.open(indexDir); DirectoryReader reader = DirectoryReader.open(d); IndexSearcher searcher = new IndexSearcher(reader); StringBuilder msgBuff = new StringBuilder(); // Set up dictionary for given language File tmpDictDir = buildSpellChecker(reader, "en"); // Build query for document type and language // common to all the docs we want to search later Term typeTerm = new Term(TYPE_FIELD, "CHAPTER"); Term langTerm = new Term(LANG_FIELD, "en"); BooleanQuery bq = new BooleanQuery(); bq.add(new TermQuery(typeTerm), BooleanClause.Occur.MUST); bq.add(new TermQuery(langTerm), BooleanClause.Occur.MUST); TotalHitCountCollector collector = new TotalHitCountCollector(); searcher.search(bq, collector); System.out.println( "Documents found:" + collector.getTotalHits()); TopDocs td = searcher.search(bq, Math.max(1, collector.getTotalHits())); ScoreDoc[] scores = td.scoreDocs; int total = 0; System.out.print( "DOC-ID\t" ); for( String tmp : TARGET_FIELDS ) { System.out.print( tmp ); System.out.print( '\t' ); } System.out.println(); for (int i = 0; i < scores.length; i++) { ScoreDoc score = scores[ i ]; Document doc = searcher.doc(score.doc); Fields docFields = reader.getTermVectors(score.doc); // For feedback during processing msgBuff.setLength(0); msgBuff.append( score.doc ); msgBuff.append('\t'); for (String fieldName : TARGET_FIELDS) { int subTotal = buildTermMap(docFields, docIdByTerm, unstemmedTerms, doc, fieldName ); total += subTotal; msgBuff.append(subTotal); msgBuff.append('\t'); } System.out.println(msgBuff); } System.out.println("Found " + total + " terms."); write(docIdByTerm, unstemmedTerms); docIdByTerm.clear(); unstemmedTerms.clear(); } private File buildSpellChecker(IndexReader reader, String targetLang) throws IOException { LuceneDictionary dict = new LuceneDictionary(reader, "dictionary_" + targetLang); Analyzer dictAnalyzer = new StandardAnalyzer(Version.LUCENE_45); IndexWriterConfig idxCfg = new IndexWriterConfig(Version.LUCENE_45, dictAnalyzer); File tmpDir = new File(FileUtils.getTempDirectory(), "dict-" + System.currentTimeMillis() + "-" + targetLang); tmpDir.deleteOnExit(); FSDirectory dictIdx = FSDirectory.open(tmpDir); this.spellChecker = new SpellChecker(dictIdx); this.spellChecker.indexDictionary(dict, idxCfg, false); System.out.println("Temp dictionary index in: " + tmpDir.getAbsolutePath()); return tmpDir; } private static void write(Map> uuidByTerm, Set File outFile = new File("all-terms.csv"); System.out.println("Writing " + outFile.getAbsolutePath()); IteratorunstemmedTerms) throws IOException { keyItr = uuidByTerm.keySet().iterator(); BufferedWriter writer = new BufferedWriter(new FileWriter(outFile)); while (keyItr.hasNext()) { String key = keyItr.next(); Listuuids = uuidByTerm.get(key); writer.write(key); writer.write('\t'); writer.write(Integer.toString(uuids.size())); writer.write('\t'); if (unstemmedTerms.contains(key)) { writer.write(ORIGINAL_TERM); } else { writer.write(UNSTEMMED_TERM); } writer.write('\n'); writer.flush(); } writer.close(); } private int buildTermMap(Fields ff, Map> docIdByTerm, Set if (ff == null) { return 0; } Terms tt = ff.terms(field); if (tt == null) { return 0; } TermsEnum termsEnum = tt.iterator(TermsEnum.EMPTY); BytesRef br = null; // Get the identifier for the document. // This is a field, NOT the Lucene doc number String docId = doc.get( "UUID" ); while ((br = termsEnum.next()) != null) { String term = br.utf8ToString(); // Discard really short terms if (term == null || term.length() < MIN_TERM_LENGTH) { continue; } // Has this term been unstemmed? // Check if it is a suggestion and // keep a list of terms that have been modified if (!this.spellChecker.exist(term)) { String[] suggestions = this.spellChecker.suggestSimilar(term, 1); if (suggestions != null && suggestions.length >= 1) { String unstemmed = suggestions[ 0 ]; term = unstemmed; // Keep track of modified terms unstemmedTerms.add(term); } } ListunstemmedTerms, Document doc, String field) throws Exception { int docHitTotal = 0; docIdList = docIdByTerm.get( term ); if (docIdList == null) { docIdList = new ArrayList<>(); docIdByTerm.put(term, docIdList); } if (!docIdList.contains(docId)) { docIdList.add(docId); Collections.sort(docIdList); docHitTotal++; } } return docHitTotal; } }