{"id":544,"date":"2015-08-21T13:58:52","date_gmt":"2015-08-21T17:58:52","guid":{"rendered":"http:\/\/pmcgovern.ca\/wp\/?p=544"},"modified":"2021-12-12T12:37:12","modified_gmt":"2021-12-12T17:37:12","slug":"automated-testing-of-lucene-backed-search","status":"publish","type":"post","link":"https:\/\/pmcgovern.ca\/wp\/?p=544","title":{"rendered":"Automated Testing of Lucene-Backed Search"},"content":{"rendered":"<p><strong>The Problem<\/strong><\/p>\n<p>Suppose you build a website about the history of fasteners. It will be full of solid information about bolts, nails, nuts, pins, clips, rivets, rods, screws, and washers. It will have all the bells and whistles, and all content will be searchable and indexed with Lucene. Happy day, it&#8217;s a grand success. It is a chronicle of the bitter struggle between stove and carriage bolts. The internecine strife between Robertson and Phillips screw advocates rages across its pages. (A truce has been called while they unite against Torx, but it won&#8217;t last.)<\/p>\n<p>By how do you know the search functions actually work? Writing a test to look up &#8220;nail&#8221; and check that at least one result is returned tells us the backing store is reachable, but not much more than that. The search results page will contain the number of results, but how do you know it is correct? The task is further complicated by stemming: a search for &#8220;nail&#8221; must also return documents containing &#8220;nailing&#8221;, &#8220;nailed&#8221;, and &#8220;nails&#8221;. Add to this ever-changing sets of boost and bury rules and there is no practical way to keep test assertions aligned with content.<\/p>\n<p><strong>A Solution<\/strong><\/p>\n<p>The solution we implemented to get around this problem is goes like this:<\/p>\n<ol>\n<li>Run Lucene indexer locally on the current content set.<\/li>\n<li>Read terms and their counts <strong>directly<\/strong> from Lucene index to CSV.<\/li>\n<li>Read terms from CSV during test<\/li>\n<li>Assert the expected count matches what is returned by by application.<\/li>\n<\/ol>\n<p>This strategy hinges on being able to run the classes that parse content into Lucene documents on their own, outside the main application. At the very least, it should be able to do this by providing some mock context classes. It is important to recognize the indexer is assumed to be working correctly.<\/p>\n<p><strong>Term Extractor<\/strong><\/p>\n<p>The term extractor works by opening the Lucene index, getting the terms for each document, and keeping a total of how many documents contain the given term. It also uses the spell check API to differentiate between stemmed and unstemmed terms. Short terms are discarded to save time and space.<\/p>\n<p>See below for code.<\/p>\n<p>The output looks like this and is easy to feed into a TestNG data provider:<\/p>\n<pre>parsitan        1       u\nparsol\t        1       u\npartial\t        34      u\nparticle\t11\to\nparticular\t36\to\nparticularly\t82\to\npartly\t        4       o\npartner\t        12      u\n<\/pre>\n<p>The number of documents containing the term is the second column and the third column shows whether the term is the unstemmed version or the original one. Depending on the search implementation in the application under test, you may want to use only original values for searching. Unstemmed ones are useful for searching &#8220;did you mean&#8221; queries.<\/p>\n<p>To be fair, another way to do this type of testing would be to load Lucene with reams of dummy documents. These documents would remain outside the normal publication channel and their contents would be fully known making it theoretically possible to infer what the correct search result ought to be.<\/p>\n<p>This solution has two problems: segregating these documents from normal users, and keeping pace with search tuning. The latter can be done with magic parameters in URLs and does not impose much overhead on the application. The latter is much more next to impossible to keep current when any kind of search tuning is in effect.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"java\">\/\/ Housekeeping and checking omitted.\npublic class TermExtractor {\n\npublic static final String[] TARGET_FIELDS = { \"title\", \"content\" };\npublic static final String LANG_FIELD = \"lang\";\npublic static final String TYPE_FIELD = \"type\";\n\npublic static final char ORIGINAL_TERM = 'o';\npublic static final char UNSTEMMED_TERM = 'u';\n\n\/\/ Don't save info for terms less than MIN_TERM_LENGTH long.\npublic static final int MIN_TERM_LENGTH = 5;\n\nprivate SpellChecker spellChecker;\n\npublic static final void main(String argv[]) throws Exception {\n\nString  indexPath = argv[ 0 ];\n\nFile indexDir = new File(indexPath);\n\nTermExtractor extractor = new TermExtractor();\nextractor.extract(indexDir);\n}\n\npublic void extract(File indexDir) throws Exception {\n\nMap<string, list<string=\"\">&gt; docIdByTerm = new HashMap&lt;&gt;();<\/string,>\n\nSet<string> unstemmedTerms = new HashSet&lt;&gt;();<\/string>\n\nFSDirectory d = FSDirectory.open(indexDir);\n\nDirectoryReader reader = DirectoryReader.open(d);\n\nIndexSearcher searcher = new IndexSearcher(reader);\n\nStringBuilder msgBuff = new StringBuilder();\n\n\/\/ Set up dictionary for given language\nFile tmpDictDir = buildSpellChecker(reader, \"en\");\n\n\/\/ Build query for document type and language\n\/\/ common to all the docs we want to search later\nTerm typeTerm = new Term(TYPE_FIELD, \"CHAPTER\");\nTerm langTerm = new Term(LANG_FIELD, \"en\");\n\nBooleanQuery bq = new BooleanQuery();\nbq.add(new TermQuery(typeTerm), BooleanClause.Occur.MUST);\nbq.add(new TermQuery(langTerm), BooleanClause.Occur.MUST);\n\nTotalHitCountCollector collector = new TotalHitCountCollector();\nsearcher.search(bq, collector);\n\nSystem.out.println( \"Documents found:\" + collector.getTotalHits());\n\nTopDocs td = searcher.search(bq, Math.max(1, collector.getTotalHits()));\n\nScoreDoc[] scores = td.scoreDocs;\nint total = 0;\n\nSystem.out.print(  \"DOC-ID\\t\" );\nfor( String tmp : TARGET_FIELDS ) {\nSystem.out.print( tmp );\nSystem.out.print( '\\t' );\n}\nSystem.out.println();\n\nfor (int i = 0; i &lt; scores.length; i++) {\n\n      ScoreDoc score = scores[ i ];\n\n      Document doc = searcher.doc(score.doc);\n\n      Fields docFields = reader.getTermVectors(score.doc);\n\n      \/\/ For feedback during processing\n      msgBuff.setLength(0);    \n      msgBuff.append( score.doc );\n      msgBuff.append('\\t');\n\n      for (String fieldName : TARGET_FIELDS) {\n\n        int subTotal = buildTermMap(docFields, docIdByTerm, unstemmedTerms, doc, fieldName );\n        total += subTotal;\n\n        msgBuff.append(subTotal);\n        msgBuff.append('\\t');\n      }\n\n      System.out.println(msgBuff);\n    }\n\n    System.out.println(\"Found \" + total + \" terms.\");\n\n    write(docIdByTerm, unstemmedTerms);\n\n    docIdByTerm.clear();\n    unstemmedTerms.clear();\n\n  }\n\n  private File buildSpellChecker(IndexReader reader, String targetLang) throws IOException {\n\n    LuceneDictionary dict = new LuceneDictionary(reader, \"dictionary_\" + targetLang);\n    Analyzer dictAnalyzer = new StandardAnalyzer(Version.LUCENE_45);\n    IndexWriterConfig idxCfg = new IndexWriterConfig(Version.LUCENE_45, dictAnalyzer);\n\n    File tmpDir = new File(FileUtils.getTempDirectory(), \"dict-\" + System.currentTimeMillis() + \"-\" + targetLang);\n\n    tmpDir.deleteOnExit();\n\n    FSDirectory dictIdx = FSDirectory.open(tmpDir);\n\n    this.spellChecker = new SpellChecker(dictIdx);\n    this.spellChecker.indexDictionary(dict, idxCfg, false);\n\n    System.out.println(\"Temp dictionary index in: \" + tmpDir.getAbsolutePath());\n\n    return tmpDir;\n  }\n\n  private static void write(Map<string, list<string=\"\">&gt; uuidByTerm, Set<string> unstemmedTerms) throws IOException {<\/string><\/string,>\n\nFile outFile = new File(\"all-terms.csv\");\n\nSystem.out.println(\"Writing \" + outFile.getAbsolutePath());\n\nIterator<string> keyItr = uuidByTerm.keySet().iterator();\nBufferedWriter writer = new BufferedWriter(new FileWriter(outFile));<\/string>\n\nwhile (keyItr.hasNext()) {\nString key = keyItr.next();\nList<string> uuids = uuidByTerm.get(key);\nwriter.write(key);\nwriter.write('\\t');\nwriter.write(Integer.toString(uuids.size()));\nwriter.write('\\t');<\/string>\n\nif (unstemmedTerms.contains(key)) {\nwriter.write(ORIGINAL_TERM);\n} else {\nwriter.write(UNSTEMMED_TERM);\n}\n\nwriter.write('\\n');\nwriter.flush();\n}\nwriter.close();\n}\n\nprivate int buildTermMap(Fields ff, Map<string, list<string=\"\">&gt; docIdByTerm, Set<string> unstemmedTerms, Document doc, String field) throws Exception {\nint docHitTotal = 0;<\/string><\/string,>\n\nif (ff == null) {\nreturn 0;\n}\nTerms tt = ff.terms(field);\n\nif (tt == null) {\nreturn 0;\n}\n\nTermsEnum termsEnum = tt.iterator(TermsEnum.EMPTY);\n\nBytesRef br = null;\n\n\/\/ Get the identifier for the document.\n\/\/ This is a field, NOT the Lucene doc number\nString docId = doc.get( \"UUID\" );\n\nwhile ((br = termsEnum.next()) != null) {\n\nString term = br.utf8ToString();\n\n\/\/ Discard really short terms\nif (term == null || term.length() &lt; MIN_TERM_LENGTH) {\n        continue;\n      }\n\n      \/\/ Has this term been unstemmed?\n      \/\/ Check if it is a suggestion and\n      \/\/ keep a list of terms that have been modified\n      if (!this.spellChecker.exist(term)) {\n\n        String[] suggestions = this.spellChecker.suggestSimilar(term, 1);\n\n        if (suggestions != null &amp;&amp; suggestions.length &gt;= 1) {\nString unstemmed = suggestions[ 0 ];\nterm = unstemmed;\n\n\/\/ Keep track of modified terms\nunstemmedTerms.add(term);\n}\n}\n\nList<string> docIdList = docIdByTerm.get( term );<\/string>\n\nif (docIdList == null) {\ndocIdList = new ArrayList&lt;&gt;();\ndocIdByTerm.put(term, docIdList);\n}\n\nif (!docIdList.contains(docId)) {\ndocIdList.add(docId);\nCollections.sort(docIdList);\ndocHitTotal++;\n}\n}\n\nreturn docHitTotal;\n}\n\n}\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>The Problem Suppose you build a website about the history of fasteners. It will be full of solid information about bolts, nails, nuts, pins, clips, rivets, rods, screws, and washers. It will have all the bells and whistles, and all&#8230;<\/p>\n","protected":false},"author":1,"featured_media":547,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11],"tags":[6,7],"class_list":["post-544","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-programming","tag-selenium","tag-testng"],"_links":{"self":[{"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/posts\/544","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=544"}],"version-history":[{"count":25,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/posts\/544\/revisions"}],"predecessor-version":[{"id":879,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/posts\/544\/revisions\/879"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=\/wp\/v2\/media\/547"}],"wp:attachment":[{"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=544"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=544"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pmcgovern.ca\/wp\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=544"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}