Dec
11
Introduction to SOLR “Enterprise Search”
Filed Under Lucene | Leave a Comment
SOLR bills itself as an open source enterprise search engine. I would not go as far as to call it “enterprise,” but certainly believe SOLR is a nice value added wrapper around the already powerful Lucene package. It may become “enterprise class,” but it has a ways to go. That being said, it’s certainly good to aim high!
Raw Lucene search engines have a few issues that SOLR very nicely addresses. SOLR addresses Lucene issues including:
- Platform Lock-in (Java)
- Indexing requires custom Java coding
- Existing document update issues
- Index Replication
- IndexSearcher warming (fake automated queries to prepopulate the cache)
SOLR runs as a java web application. The nightly version os SOLR that I downloaded came bundled with Jetty in a very easy to run, test, and even deploy into a light duty production role.
Instead of being a Java library that you use directly from your Java code, SOLR works as a web application that you POST documents via HTTP to index documents, and query using HTTP GET requests. This HTTP interaction means your application does not need to be written in Java. Your application can be in any language that can post data and request data via http.
When it comes to setting up your index, instead of having to code in your field information into Java code. It is setup in an XML file. This file, among other things, includes the list of document fields and document primary keys needed to maintain your index.
If you are familiar with Lucene, my mention of the primary key may have peaked your interest. To those not familir with Lucene, it does not have any notion of primary keys or document updates. To update a document in Lucene, you first needed to locate and delete the previous version of the document through a rather indirect process. When provided with a primary key, SOLR will handle that process automatically for you.
The SOLR example that is included with the nightly build includes a failr simple script and sample documents to show how indexing works. Run the examples to index the samples, then you can run searches through the admin interface. To use the search results in your own application, you query the same url as shows up in the admin interface, and parse the xml response.
As an initial experiment, I have used SOLR to index product information on my sheetmusic site. It’s used for the sheetmusic searchengine as well as related products query. My searchengine implementation is only really “quick hack” quality (I have not implemented next/previous page links yet) , but the related products part usage is more polished.
Dec
11
High Performance Lucene Indexing
Filed Under Lucene | Leave a Comment
In some instances, indexing into RAM rather than direct to disk can create a large indexing performance increase. Here’s one way to do it. You may need to increase the Java JVM memory parameters with the arguments -Xms128M -Xmx256M, of course modifing the sizes to fit your needs. Tweeking the foldCount size with affect how much memory is required by setting how large the RAMDirectory is allowed to grow in terms of the number of Lucene Documents it can hold. Each time the foldCount is reached, and/or when indexing is complete, the index will be flushed to disk.
Lucene Example Code: RAM to Disk
int foldCount = 500000; int indexSize = 0; int count = 0; try { RAMDirectory ramDir = new RAMDirectory(); IndexWriter ramWriter = new IndexWriter(ramDir, analyzer, true); IndexWriter writer = new IndexWriter(indexDir,analyzer,true ); writer.mergeFactor = 100000; while(rs.next()){ Document doc = new Document(); ramWriter.addDocument(doc); count++; indexSize++; if(indexSize == foldCount){ foldToDisk(ramDir, ramWriter, writer); ramWriter = new IndexWriter(ramDir, analyzer, true); indexSize = 0; } } foldToDisk(ramDir, ramWriter, writer); writer.optimize(); writer.close(); } catch (IOException e) { e.printStackTrace(); } public static void foldToDisk(RAMDirectory ramDir, IndexWriter ramWriter, IndexWriter writer) throws IOException { ramWriter.close(); Directory dirA[] = new Directory[1]; dirA[0] = ramDir; System.out.print(”.”); mergeDirs(writer, dirA); System.out.println(”.”); }
Dec
11
Multi-Field Lucene Example
Filed Under Lucene | Leave a Comment
Storing and search more than one field is very easy to do in Lucene — This can make your lucene search engine much more powerful!
Tip: If you’re not already familiar with how to index and search single field documents, this is intended to build on our Simple Lucene Example.
Lucene Example Code: Multi-Field Documents
Create a Lucene document with more than one field
String content = "This is the example text I want to have Lucene index";
Document doc = new Document();
doc.add(Field.Keyword("keyword","Java"));
doc.add(Field.Text("title","My Document Title"));
doc.add(Field.Text("content",content));
You would then add the document to the index like normal.
Create a Lucene MultiFieldQueryParser
String fields[] = {"keyword","title","content"};
String queryString = "Java";
try {
Query query = MultiFieldQueryParser.parse(queryString,fields,new StandardAnalyzer());
} catch (ParseException e) {
System.out.println("Lucene ParseException: " + e. getMessage);
e.printStackTrace();
}
Read the additional fields from the returned hits
int hitCount = hits.length();
for(int i=0; (i < hitCount && i < 10); i++){
Document doc = hits.doc(i);
System.out.println(doc.get("keyword") + ", " + doc.get("title") + ", " + doc.get("content"));
}
That’s it!
That all you need to do to take the step to multi-field Lucene documents and searching from a single field.
Dec
11
Simple Lucene Example
Filed Under Lucene | Leave a Comment
Lucene is a great core for a Java search engine. Here is simple Lucene example code to index simple single field data along with a very basic search function. This will create simple Java search engine. For this simple lucene example code, each block is catching the thrown exceptions so you can see what is thrown. In a real world lucene implementation, you may handle this differently.
Lucene Example Code: Steps to Index the data
- Create a new Lucene index using an IndexWriter
Create a Lucene Document
Add the Lucene document to the index
Optimize and close the index
Create a new Lucene index using an IndexWriter
String indexPath = "/path/to/whereYou/wantThe/IndexStored"; IndexWriter writer = null;
try {
// Make a lucene writer and create new Lucene index with arg3 = true
writer = new IndexWriter(indexPath, new StandardAnalyzer(), true);
} catch (IOException e) {
System.out.println("IOException opening Lucene IndexWriter: " + e.getMessage());
}
Create a Lucene document
String content = "This is the example text I want to have Lucene index"; Document doc = new Document();
doc.add(Field.Text("content",content));
Add the document to the index
try { writer.addDocument(doc);
} catch (IOException e) {
System.out.println("IOException adding Lucene Document: " + e.getMessage());
}
Optimize and close the IndexWriter
try { writer.optimize();
writer.close();
catch (IOException e) {
System.out.println("IOException closing Lucene IndexWriter: " + e.getMessage());
}
Lucene Example Code: Steps to Search the Lucene Index
Open a Lucene IndexSearcher
IndexSearcher indexSearcher = new IndexSearcher(indexPath);
If you are using the Lucene search engine from a web page, you should store and reuse the same IndexSearcher for each query. The Lucene IndexSearcher caches information to make queries after the first one faster. Reusing the Lucene IndexSearcher also takes it easy on the Java garbage collector, increasing performance and memory utilization. Not reusing the IndexSearcher is a common mistake and cause of frustration for many first time lucene users. For use on the web, here is some simple JSP code to store the IndexSearcher in an application attribute and reuse it for future page loads.
indexSearcher = (IndexSearcher) application.getAttribute("searcher"); if(indexSearcher == null){
indexSearcher = new IndexSearcher(indexPath);
application.setAttribute("searcher",indexSearcher);
}
Construct a Lucene Query
String queryString = "example"; try {
Query query = QueryParser.parse(queryString,”content”,new StandardAnalyzer());
} catch (ParseException e) {
System.out.println(”Lucene ParseException: ” + e. getMessage);
e.printStackTrace();
}
Have Lucene perform the Search
Hits hits = null; try {
Hits hits = indexSearcher.search(query);
catch (IOException e) {
System.out.println("Lucene Searching Exception: " + e.getMessage());
}
Display the top Lucene Hits
int hitCount = hits.length(); for(int i=0; (i < hitCount && i < 10); i++){
Document doc = hits.doc(i);
System.out.println(doc.get("content"));
}
That’s it!
Those are the bits needed to create a simple, one field, Lucene search engine in Java. In terms of the try and catch block and variables, you’d probably implement things in a more combined manor, but the samples on this page are designed at least at some level to exist in isolation from each other.
Want a more powerful Search Engine?
I also have a Multi-Field Search Engine Example if you want to get a little bit more powerful.