A couple months back, I decided to take a calculated risk with my most popular page; I cut off my head. It’s worked out for the better, and is an interesting case study. Let me explain…

A page on one of my sites was ranking a very solid #1 for a single head keyword. That’s great, right? Maybe.

Let’s say my site is about Widgets. This one head keyword, let’s call it WidgMaster was bringing in a decent amount of traffic.

While the page was certainly related to the WidgMaster variant of Widgets, it was also very relevant to learning about widgets in general. As evidence, the page currently ranks on page 3 of Google for the much more broad “widgets” term.

In terms of traffic, the page was successful, but unnecessarily pigeon-holed into a very small niche.

It’s worth noting that this page is linked to more heavily than any of the other internal pages on this site, having been on the front page of Slashdot once, and digg twice. It has more “link juice” than any other page on the site, and pulls in three times as much traffic as the homepage.

I wanted to see if I could make the page pull in a more general audience. I was looking for more “Widget” focused long tail traffic, as described very well recently in “Deep Links, Longtail Keywords, and Why you Should Love Them Both.

I performed a simple and minor, but crucial strategic change. I figured the chance of success was 50/50, and there was certainly potential for traffic loss.

I reordered the words in the page title.

Learn about WidgMaster Installation and Configuration - Widgets

Became…

Learn about Widget Installation and Configuration - WidgMaster

Almost immediately, I lost the #1 spot that I had held for WidgMaster, which had been solid #1 for over a year. The page now ranks around #6 on Google, high enough to still get a trickle of traffic, but less than 10% of what was previously coming in for that keyword.

Interesting, and supporting my initial hypothesis, the traffic levels stayed about the same.

I was hoping for an increase, but the resulting wash isn’t all bad.

The page now ranks much better for more general long term “Widget” phrases. The resulting visitors are less WidgMaster focus “one hit wonders”, and much more interested in sticking around my site after their initial landing.

While this strategy is not right for all situations and pages. Much more usable for “short head” terms than “tall head” terms, it’s certainly a strategy and method that you should consider experimenting with on a small scale, and having in your SEO Bag o’ Tricks.

SOLR bills itself as an open source enterprise search engine.  I would not go as far as to call it “enterprise,” but certainly believe SOLR is a nice value added wrapper around the already powerful Lucene package.  It may become “enterprise class,” but it has a ways to go.  That being said, it’s certainly good to aim high!

Raw Lucene search engines have a few issues that SOLR very nicely addresses.  SOLR addresses Lucene issues including:

  • Platform Lock-in (Java)
  • Indexing requires custom Java coding
  • Existing document update issues
  • Index Replication
  • IndexSearcher warming (fake automated queries to prepopulate the cache)

SOLR runs as a java web application.  The nightly version os SOLR that I downloaded came bundled with Jetty in a very easy to  run, test, and even deploy into a light duty production role.

Instead of being a Java library that you use directly from your Java code, SOLR works as a web application that you POST documents via HTTP to index documents, and query using HTTP GET requests.  This HTTP interaction means your application does not need to be written in Java.  Your application can be in any language that can post data and request data via http.

When it comes to setting up your index, instead of having to code in your field information into Java code. It is setup in an XML file.  This file, among other things, includes the list of document fields and document primary keys needed to maintain your index.

If you are familiar with Lucene, my mention of the primary key may have peaked your interest.  To those not familir with Lucene, it does not have any notion of primary keys or document updates.  To update a document in Lucene, you first needed to locate and delete the previous version of the document through a rather indirect process.  When provided with a primary key, SOLR will handle that process automatically for you.

The SOLR example that is included with the nightly build includes a failr simple script and sample documents to show how indexing works.  Run the examples to index the samples, then you can run searches through the admin interface.  To use the search results in your own application, you query the same url as shows up in the admin interface, and parse the xml response.

As an initial experiment, I have used SOLR to index product information on my sheetmusic site.  It’s used for the sheetmusic searchengine as well as related products query.  My searchengine implementation is only really “quick hack” quality (I have not implemented next/previous page links yet) , but the related products part usage is more polished.

In some instances, indexing into RAM rather than direct to disk can create a large indexing performance increase. Here’s one way to do it. You may need to increase the Java JVM memory parameters with the arguments -Xms128M -Xmx256M, of course modifing the sizes to fit your needs. Tweeking the foldCount size with affect how much memory is required by setting how large the RAMDirectory is allowed to grow in terms of the number of Lucene Documents it can hold. Each time the foldCount is reached, and/or when indexing is complete, the index will be flushed to disk.

Lucene Example Code: RAM to Disk


int foldCount = 500000;
int indexSize = 0;
int count = 0;
try {
   RAMDirectory ramDir    = new RAMDirectory();
   IndexWriter  ramWriter = new IndexWriter(ramDir, analyzer, true);

   IndexWriter writer = new IndexWriter(indexDir,analyzer,true );
   writer.mergeFactor = 100000;

   while(rs.next()){
      Document doc = new Document();
      ramWriter.addDocument(doc);
      count++;
      indexSize++;
      if(indexSize == foldCount){
         foldToDisk(ramDir, ramWriter, writer);
         ramWriter = new IndexWriter(ramDir, analyzer, true);
         indexSize = 0;
      }
   }

   foldToDisk(ramDir, ramWriter, writer);
   writer.optimize();
   writer.close();
} catch (IOException e) {
   e.printStackTrace();
}


public static void foldToDisk(RAMDirectory ramDir,
			IndexWriter ramWriter,
			IndexWriter writer) throws IOException {
		ramWriter.close();
		Directory dirA[] = new Directory[1];
		dirA[0] = ramDir;
		System.out.print(”.”);
		mergeDirs(writer, dirA);
		System.out.println(”.”);
}

Storing and search more than one field is very easy to do in Lucene — This can make your lucene search engine much more powerful!

Tip: If you’re not already familiar with how to index and search single field documents, this is intended to build on our Simple Lucene Example.

Lucene Example Code: Multi-Field Documents

Create a Lucene document with more than one field

   String content = "This is the example text I want to have Lucene index";

   Document doc = new Document();

   doc.add(Field.Keyword("keyword","Java"));

   doc.add(Field.Text("title","My Document Title"));

   doc.add(Field.Text("content",content));

You would then add the document to the index like normal.

Create a Lucene MultiFieldQueryParser

   String fields[] = {"keyword","title","content"};

   String queryString = "Java";

   try {

 Query query =  MultiFieldQueryParser.parse(queryString,fields,new StandardAnalyzer());

   } catch (ParseException e) {

 System.out.println("Lucene ParseException: " + e. getMessage);

 e.printStackTrace();

   }

Read the additional fields from the returned hits

   int hitCount = hits.length();

   for(int i=0; (i < hitCount && i < 10); i++){

 Document doc = hits.doc(i);

 System.out.println(doc.get("keyword") + ", " + doc.get("title") + ", " + doc.get("content"));

   }

That’s it!

That all you need to do to take the step to multi-field Lucene documents and searching from a single field.

Lucene is a great core for a Java search engine. Here is simple Lucene example code to index simple single field data along with a very basic search function. This will create simple Java search engine. For this simple lucene example code, each block is catching the thrown exceptions so you can see what is thrown. In a real world lucene implementation, you may handle this differently.

Lucene Example Code: Steps to Index the data

    Create a new Lucene index using an IndexWriter
    Create a Lucene Document
    Add the Lucene document to the index
    Optimize and close the index

Create a new Lucene index using an IndexWriter

   String indexPath = "/path/to/whereYou/wantThe/IndexStored";   IndexWriter writer = null;

try {

// Make a lucene  writer and create new Lucene index with arg3 = true

writer = new IndexWriter(indexPath, new StandardAnalyzer(), true);

} catch (IOException e) {

System.out.println("IOException opening Lucene IndexWriter: " + e.getMessage());

}

Create a Lucene document

   String content = "This is the example text I want to have Lucene index";   Document doc = new Document();

doc.add(Field.Text("content",content));

Add the document to the index

   try { writer.addDocument(doc);

} catch (IOException e) {

System.out.println("IOException adding Lucene Document: " + e.getMessage());

}

Optimize and close the IndexWriter

   try { writer.optimize();

writer.close();

catch (IOException e) {

System.out.println("IOException closing Lucene IndexWriter: " + e.getMessage());

}

Lucene Example Code: Steps to Search the Lucene Index

Open a Lucene IndexSearcher

   IndexSearcher indexSearcher = new IndexSearcher(indexPath);

If you are using the Lucene search engine from a web page, you should store and reuse the same IndexSearcher for each query. The Lucene IndexSearcher caches information to make queries after the first one faster. Reusing the Lucene IndexSearcher also takes it easy on the Java garbage collector, increasing performance and memory utilization. Not reusing the IndexSearcher is a common mistake and cause of frustration for many first time lucene users. For use on the web, here is some simple JSP code to store the IndexSearcher in an application attribute and reuse it for future page loads.

   indexSearcher = (IndexSearcher) application.getAttribute("searcher");   if(indexSearcher == null){

indexSearcher = new IndexSearcher(indexPath);

application.setAttribute("searcher",indexSearcher);

}

Construct a Lucene Query

   String queryString = "example";   try {

Query query = QueryParser.parse(queryString,”content”,new StandardAnalyzer());

} catch (ParseException e) {

System.out.println(”Lucene ParseException: ” + e. getMessage);

e.printStackTrace();

}

Have Lucene perform the Search

   Hits hits = null;   try {

Hits hits = indexSearcher.search(query);

catch (IOException e) {

System.out.println("Lucene Searching Exception: " + e.getMessage());

}

Display the top Lucene Hits

   int hitCount = hits.length();   for(int i=0; (i < hitCount && i < 10); i++){

Document doc = hits.doc(i);

System.out.println(doc.get("content"));

}

That’s it!

Those are the bits needed to create a simple, one field, Lucene search engine in Java. In terms of the try and catch block and variables, you’d probably implement things in a more combined manor, but the samples on this page are designed at least at some level to exist in isolation from each other.

Want a more powerful Search Engine?

I also have a Multi-Field Search Engine Example if you want to get a little bit more powerful.

When performing on site optimization, it’s important to keep your eye on the ball.

Here’s a hint; SEO is not really about code and content optimization.

Seriously, many of us can’t see the forest through the trees. Stop looking at the trees, and start enjoying the beauty of the forest.

What do I mean by that? Most of the SEO discussion and articles about on-page optimization are looking at it backwards.

It’s not about performing the optimizations that the search engines are looking for. You know the drill…

  • Optimized and Unique Title Tag
  • Proper use of header, strong, list tags, etc.
  • Link to related content using keyword rich anchor text
  • Build content silos around a tightly focus niche
  • etc. etc.

While that may work, you’re going through the motions but missing the entire point.

Here’s the point you may be missing.

It’s NOT about performing optimizations.

Search engines don’t want to return highly optimized pages. They really don’t.

It’s IS about making quality sites.

Search want to return high quality pages that are relevant to search terms.

So, on a page by page basis, how to you make a high quality site?

  • Optimized and Unique Title Tag
  • Proper use of header, strong, list tags, etc.
  • Link to related content using keyword rich anchor text
  • Build content silos around a tightly focus niche
  • etc. etc.

Yes. That’s the same list you already saw.

So what’s different? Your perspective, the process, and the end result.

If you look at it from the pure SEO side, it’s entirely possible to “check-off” all the items on your list without really increasing, and possibly even reducing, the quality of the page for actual users.

If you look at it from the page quality side, but with an eye on SEO guided principles, you end up with a highly optimized quality page, because that was your goal. The more I look at things from this perspective, the more I see how SEO and page quality go hand-in-hand.

Google and (some of) the other search engines know this. They don’t pick their on-page weighting factors at random. They choose and weight them because they are generally indicators of higher quality pages for their search results.

Once your mental paradigm makes the shift, you will find yourself think less and less about optimizing pages, and more and more about how to actually make your pages better for users.

It only makes logical sense that building high quality pages is a better long term strategy that just trying to optimize content for the sake of optimizing content.

Stop giving search engine spiders optimized pages. Take the next step and start giving them the high quality pages they are really looking for in the first place.

Microsoft has finally officially commented on the fake traffic they’ve been sending to websites recently.

Recently I commented about how Microsoft Live Search has been stuffing our log files with bot traffic pretending to be human.

Basically they say, “we’ve been doing bad stuff for the last EIGHT months that screws up your metrics and disregards internet respect standards, but it’s OK, because we are going to stop soon.”

If anything, their statements strengthens my mistrust.

At least now I know how far back I can’t trust any traffic from MSN. They offer no solutions for how to scrub our log files (which I’ve already spent time trying to do myself.)

Maybe the real problem is one of scale. Google and Yahoo probably do this too, but it’s different. Google and Yahoo (to a lesser extend) send real traffic volumes. Any stuff like this that Google and Yahoo do ends up being noise that is not noticed.

With MSN…these queries are more than just noise; it can end up being the bulk of MSN traffic coming in. In one case, the bogus queries continually use a keyword phrase that I am trying to optimize for and measure results around.

The only reason I even noticed this problem is I saw a large spike in traffic to that phrase, but could not figure out any ranking change driving the increase. The bogus Microsoft traffic was enough to cut through the Yahoo and Google traffic for that phrase, and lead me down the path to discovering this problem.

At least Microsoft has commented. They have not apologized or offered any solutions. In my mind they’ve only gone as far as saying they are playing dirty pool (but plan to stop soon!) The damage is done. I’ve said my piece. What are you thoughts on the matter?