Showing better highlighted search result fragments with Elasticsearch

Elasticsearch has a pretty awesome highlighting feature, but it comes with a major deficiency. When it truncates your document/string, it gives you no indication that it has done so.

Take a look at this screen shot

As you can see the text (bolded behind the dropdown of results) is truncated in the results in the dropdown itself, but there’s no indication that is what has happened.

Doesn’t seem like a big deal, but for the perfectionists and craftsmen out there, this has to make you itch right? How is someone to know that there is more to that fragment of text that what they’re seeing?

Well, heres some ruby code to the rescue. Throw it in a helper and call it in your view or wherever

1
2
3
4
5
6
7
8
9
10
11
12
  def ellipses_for_highlights(params_highlight, params_original)
    # have to do this because highlighted stuff from ES has a trailing space for whatever reason
    stripped_highlighted_item = strip_tags(params_highlight).rstrip
    # if the beginning of the highlighted text doesn't match the original it has been clipped
    tmp = params_original =~ /#{stripped_highlighted_item}/
    front_ellipsis = tmp != 0
    # if the last 10 characters of the highlighted text don't match the original, same deal
    back_ellipsis = last_string_chars(stripped_highlighted_item, 10) != last_string_chars(params_original, 10)
 
    highlighted_item = front_ellipsis ? "... " + params_highlight : params_highlight
    highlighted_item = back_ellipsis ? highlighted_item + " ..." : highlighted_item
  end

Link to github gist here

to use this, just pass in the highlighted string from elasticsearch and the original string for comparison.
so something like this

1
    ellipses_for_highlights(item.highlight.name.first, item.name)

and you’ll get something like this

It will only truncate on the front or back of the string if elasticsearch only truncated at that spot, in addition to truncating on both ends if it realizes that elasticsearch did too. Better, right?

Couple of things to note.
– This will only work cleanly if you have  :term_vector set to “with_postions_offsets” in your mapping. This enables elasticsearch break the fragment on words vs truncating in the middle of a word. If you have it turned off (i.e you’re just using the plain highlighter), you’ll get something that looks more like this (notice how the truncation is happening in the middle of words)

– Also keep in mind that because of the behavior explained above when using term_vectors in your highlighting, the fragment_size will not match the number you specify exactly, makes sense (because it has to break on a word which can have be any number of characters in it), but its not described anywhere

Adding Autocomplete using elasticsearch

A commonly-requested feature in search applications is autocomplete or search suggestions. The basic idea is to give users instant feedback as they’re typing. Implementations of this feature can vary — sometimes the suggestions are based on popular queries (e.g., Google’s Autocomplete), other times they may be previews of results (e.g., Google Instant). The suggestions may be relatively simple, or they can be extremely complex, taking into account things like the user’s search history, generally popular queries, top results, trending topics, spelling mistakes, etc. Building the latter can consume the resources of a company the size of Google, but it’s relatively easy to add simple results-based autocomplete to an existing elasticsearch search application.

Read More … 

How I became a scientist (quote)

My mother made me a scientist without ever intending to.
Every other Jewish mother in Brooklyn would ask her child after school, “So? Did you learn anything today?”
But not my mother …
“Izzy,” she would say, “did you ask a good question today?”
That difference – asking good questions – made me become a scientist.
— Isidor Isaac Rabi, Nobel laureate

Found this gem on the 37 Signals blog

If you’ve ever had ANYTHING to Do with Rails. ever. Please read this now.

What The Rails Security Issue Means For Your Startup!

There are many developers who are not presently active on a Ruby on Rails project who nonetheless have a vulnerable Rails application running on localhost:3000.  If they do, eventually, their local machine will be compromised. (Any page on the Internet which serves Javascript can, currently, root your Macbook if it is running an out-of-date Rails on it. No, it does not matter that the Internet can’t connect to your localhost:3000, because your browser can, and your browser will follow the attacker’s instructions to do so. It will probably be possible to eventually do this with an IMG tag, which means any webpage that can contain a user-supplied cat photo could ALSO contain a user-supplied remote code execution.)

tracking down *exactly* where a Ruby object method is defined

Ever spent way longer than you would have liked trying to find out exactly where a particular Ruby object method is defined, especially in something like Rails where a method could have been included from a plugin, gem, helper, or otherwise metaprogrammed in?

Well with Ruby 1.9.3 … you can now do this

Post.first.method(:published?).source_location

and get this back

=> [“/Users/xxx/.rvm/gems/ruby-1.9.3-p362/gems/state_machine-1.1.2/lib/state_machine/machine.rb”, 752]

Blew my mind, and I’ve been writing Ruby for almost 6 years now.

How to try out puma with Apache right now!

I came across puma reading Mike Perham’s blog and was instantly intrigued. Its a threaded server that runs using one copy of your app vs the way Passenger does it by spinning up about 2 or more copies of your app as processes forked from a parent process and distributing requests to each one in turn to keep them all busy.

The thing that jumped out at me was the promise of memory savings by going from 5-6 processes in memory to 1. I run a 768MB VPS with linode. With Passenger I was running 500-600MB RAM usage because of the distinct ruby processes that Passenger forks to handle requests to your server. (Each process was about 80M and I was running 5 or 6 of them)

There is no Apache documentation for proxying to puma, but after looking at this example by the Phusion guys about how to proxy Apache to Passenger Standalone, I figured out a nice little step-by-step way to quickly try out puma to see if you like it or not.

This assumes you’re already running Phusion Passenger with Apache in production

  1. gem install puma on your server, don’t add it to your gemfile
  2. Now mosey on over here and get Apache Proxy installed,
    /etc/apache2/mods-available/proxy.conf will probably already be there for you so all you’ll probably have to do is
    a2enmod proxy
    a2enmod proxy_http
    /etc/init.d/apache2 restart
  3. Alright, now go find your apache.conf or httpd.conf file and comment out all the passenger related stuff, things like …#LoadModule passenger_module /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/passenger-3.0.7/ext/apache2/mod_passenger.so

    #PassengerRoot /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/passenger-3.0.7

    #PassengerRuby /opt/ruby-enterprise-1.8.7-2010.02/bin/ruby

    … and other Passenger configuration directives (PassengerMinInstances for example)

  4. Now go find the file where you defined your virtual host, for me it was in
    /etc/apache2/sites-available/default
  5. Make it look like this (comment out all your other stuff for now, you can add other cool crap once you get it working)
    <VirtualHost *:80>
    ServerName www.yourapp.com
    DocumentRoot /path_to_your_app/public
    PassengerEnabled off
    ProxyPass / http://127.0.0.1:9292/
    ProxyPassReverse / http://127.0.0.1:9292/ #the trailing slashes here are VERY important
    </VirtualHost>
  6. now go to your app root and run
    puma -e production
  7. restart apache and navigate to your app and you should see it load right up

Couple of things to note

  • you will probably want to add a line to your production.rb that is simply
    config.threadsafe!
    It basically eager loads your app (vs autoloading the sections it needs) to help avoid problems with threading rails, you can learn more about config.threadsafe! in this very detailed post
  • I went from using close to 600MB of RAM to just 350MB and it was blazing fast!!!
    Then I moved over to using puma with nginx and Ruby 2.0.0 (post coming up) and it was even faster!
  • If you think you want to keep puma around then I encourage you to install the puma prerelease (currently 2.0.0.b4) by running
    gem install puma –prerelease
    once you do that, then you can run puma as a daemon by doing
    puma -d -e production
    otherwise you’ll have to have a terminal window open running it the way you run webrick/thin in dev
  • New relic reporting won’t work out the box unless you use the prerelease version
Thats it!
I hope you like puma as much as I do.

How to run multiple elasticsearch nodes on one machine

By default elasticsearch runs assuming a one machine, one node setup (You specify node data in elasticsearch/config/elasticsearch.yml), so what happens if you want to run multiple nodes on one box, say, you want to play with multiple nodes on your dev machine?

The easy answer is that you could create multiple elasticsearch.yml files (elasticsearch.0.yml, elasticsearch.1.yml etc etc) and then start each instance from the command line referencing the new config files.

For example
usr/local/bin/elasticsearch -fD es.config=/usr/local/Cellar/elasticsearch/0.xx.x/config/elasticsearch.0.yml

usr/local/bin/elasticsearch -fD es.config=/usr/local/Cellar/elasticsearch/0.xx.x/config/elasticsearch.1.yml

That should get you most of the way there (the new node comes up on port 9201), but if you have any problems and need an alternative read this detailed response on Stackoverflow 

 

Getting Started with Elasticsearch

I’ve been doing a lot of Elasticsearch work at my fulltime job and I’m liking it very much (Actually in San Francisco for an Elasticsearch conference right now). That being said … I started reading this great article by Jon Tai about how to use Elasticsearch as a supplement to your database to get quicker results for unstructured/complex queries, then I started to look at the rest of his blog posts about Elasticsearch and quickly realized that if you’re trying to get up to speed with Elasticsearch, there isn’t clearer, more easily digestable writing on the web about the basics of Lucene and Elasticsearch.

Trust me, I know. I’ve been screwing with ES for the last six months or so, and the knowledge I have is pieced together is from numerous google searches, Stackoverflow questions,  random one-off blogposts about Elasticsearch, Tire or/and videos from the Elasticsearch site.

So once you actually get ES setup on your dev machine, go get yourself a good cup of whatever and then snuggle up with the following (in this order).
Testing Lucene Analyzers with elasticsearch
Lucene Scoring and elasticsearch’s _all Field
Then watch this 40 minute video by Elasticsearch creator, Shay Banon, that explains the way Elasticsearch is designed and how to use it to your advantage
Big Data, Search and Analytics (I’ve watched this 3 times since last August and I pick up something new each time)

Cheers.

Be ready for your close-up

“To every man there comes in his lifetime that special moment when he is figuratively tapped on the shoulder and offered a chance to do a very special thing, unique to him and fitted to his talents. What a tragedy if that moment finds him unprepared or unqualified for that which would be his finest hour.”

— Winston Churchill

Timestamps for tumblr posts in your dashboard

Ever been scrolling through your tumblr, for what seemed like hours, but didn’t want to stop because you’d lose your place without getting to where you stopped the last time?

I love tumblr, but this bugged me so much that I hacked together a Tumblr Timestamps Chrome extension that tells you exactly when a tumblr post was published (slots it in the lower left hand corner of every post). This way you can keep track of where you start or leave off, and (hopefully) better manage how much time you spend on tumblr.

enjoy!

imagination > knowledge

Remember …

Imagination is more important than knowledge. For knowledge is limited to all we now know and understand, while imagination embraces the entire world, and all there ever will be to know and understand.”
— Albert Einstein

ux gripes: Customer Service phone menus that only allow you navigate with your voice

A lot of times, I’ll get on the phone to check my balance or do something routine while I’m in my office (which I share with 2 other people), but the customer service menu navigation is ONLY voice activated. Since I don’t want to disturb my co-workers, I either have to stop what I’m doing and leave the room to find a quiet place to yell instructions at my phone or just remember to do it later (which I never do). Apart from the potentially poor user experience (slower/inconvenient way to get through a menu you’re already familiar with), it simply is a massive pain in the ass sometimes.

What’s so frustrating is that this can easily be fixed by giving the user the option of hitting a button to revert to the number pad for navigation. But then again, if a company has a voice navigated customer service menu, they probably don’t really give two ____s about what’s convenient for you.

How To Be A Super-Achiever

Interview after interview with some of the world’s most successful people—actress Laura Linney, Zappos CEO Tony Hsieh, crossword mastermind Will Shortz—they began seeing patterns emerge. No matter how diverse their goals or crafts, these super-achievers shared many of the same habits. How can you follow in their footsteps?

Read the article here or watch this short and sweet summary (recommended) …

feature I’d love to see: Add What’s New/Changelog to “About Google Chrome” page

This one is really nerdy, but it would be cool if in the “About Google Chrome” page of the browser, a list of whats new or what’s changed showed up right after the version number.

This could be restricted it to only if the user was on the dev/beta channel. I figure if you’re on dev/beta, you care about that kind of information.

Even cooler would be a list of all the versions that you previously used, that folded out on click, so you could see your own specific upgrade path (along with their changelogs)

iOS Development Tips I Would Want If I Was Starting Out Today

Making iOS apps is getting easier and easier with each new release of Xcode. However, all the new features and approaches means there are more options to choose from, outdated books and old documentation.

Back in my day it was so much harder – that’s is true in many respects, but a much higher level of quality and features is expected now.The bar keeps rising, and that’s a very good thing.

If I was starting out with iOS development today these are the things I would hope somebody would tell me.
Read More