Archive for the ‘Search’ category

SOLR search adoption – the power of sane defaults?

March 27th, 2008

Tonight I met someone from a (largish) local company and learned they’re migrating their search functionality to SOLR.  This is the second largish company in the area I know that’s migrating to SOLR.  I’m not naming names only because I’m not sure they’d want me to do so.  Suffice it to say these are names fairly well known in the marketing and communications industries.

I’m not surprised at all by the adoption, as SOLR makes it pretty easy to get started using the power of Lucene without requiring you to do a lot of setup or administration up front.  These ‘sane defaults’, as I believe Erik Hatcher put it to me, are what give projects like SOLR a competitive advantage against even commercial offerings.  Whether technology is good or bad is often secondary to whether it’s easy to get it to a testing stage.

If you’re using SOLR, what was the deciding factor?  Ease of setup?  Flexibility?  Compatibility with existing Lucene data?

If you’re not using SOLR for your data search needs, what are you using?  Raw LuceneXapianSphinx?  A commercial product?  If so, which one?

P.S  If you’re not sure how to go about implementing search for your site and have some questions, email me – mgkimsal@gmail.com.

FriendFeed prediction – clustered feed data

March 18th, 2008

Robert Scoble just switched his home pages from TechMeme to FriendFeed.

“So what?” is likely what you’re thinking. Yeah, big deal, right? Well, TechMeme had a clustering algorithm which would group together news articles of related content, and give you a good idea of the ‘hot topics’ of the day. It did this in a completely automated way.

I predict that FriendFeed (or another social network aggregator) will introduce topic clustering, based on the keywords and topics of people you follow. Clusty.com has done topic clustering for years, though it’s not something that is of great use to ‘general’ searching (at least, not in many cases). Carrot2, an open source clustering engine, also provides this sort of functionality.

I took a first stab at clustering my feed data with carrot2. I’m not sure I had enough data to draw useful conclusions yet – it might need a larger body of a group of people’s tweets (for example) which I just didn’t have at the time.

For people who follow thousands of users, it would obviously be useful to have a ‘big picture’ view of the hottest topics being twittered/blogged/etc about. But take it one step beyond that. Being able to look at *other peoples’* topic clusters would give you an instant view as to whether they have people worth following.

When I look at twitter, I can look at other people’s followers. Great concept, but it doesn’t tell me anything about the topics those people tend to twitter about, so I’m never sure if it’s worth following them. Nor do I get any notion of how those people are related. Marrying facebook or plaxo data against twitter feeds would be useful, no? Or just letting me add my own relationship metadata in to twitter itself.

Getting a high level view of peoples’ topiclusters would be incredibly useful. “Topiclusters” – yeah, I just made up that word and yeah, it’s lame. “Topsters”? “Substers?” (subject clusters?).

Possible book project – open source search

March 9th, 2008

I’ve had a book project in the back of my mind for a bit.  Though there’s never enough hours in the day to get everything done I need to, I have one book I’m wrapping up in the next few days, and am seriously considering committing myself to this next one.  No publisher lined up yet or anything of that nature, but if I can’t find one I’d self publish through lulu.com.

The title obviously gives it away – I’m looking at doing a book on open source search products.  I was thinking of doing an entire book on SOLR last fall, but honestly I’m not sure there’s enough about SOLR to write an entire book – at least not without repeating a lot of the information already out there in tutorials and what not.  And I’m not sure that another deep in-depth technical book is necessary on something that’s moving so fast.  The idea was to give a moderately-deep (but not overly deep, if you can make that distinction) look at setting up and using a group of open source search projects out there.

  • Lucene is the leader in this space, without question.  It’s been around for quite a while, and keeps getting better with each release.  However, Lucene itself is very low-level, and Java only.  Many implementations of Lucene have sprung up, such as Lucene.Net and Lucy, as well as tools which build on Lucene like SOLR and Nutch.
  • PostgreSQL has full text search capabilities which I plan to explore in more detail.
  • MySQL has had a degree of full text search capabilities for years, and the Sphinx project has emerged over the last couple of years to provide even more functionality and speed.  I believe Sphinx is essentially standalone but can be coupled with MySQL or PostgreSQL – again, that’s research fodder for the book.

Are there other open source search projects that you’d be interested in seeing covered in a book?  Is this a topic you see any demand or interest in?  Whenever I see a gap in the book market, I always wonder if it’s because there’s no interest, or just that no one has filled the gap yet.  Usually something appears to fill that gap a few months after I notice it, but I’ve yet to see this gap filled after almost a year of thinking about it.

Feedback?

Speech to take off as a search interface? No way…

February 22nd, 2008

Bill Gates is predicting that we’ll interact with computers via speech and touchscreens more in the future.

In five years, Microsoft expects more Internet searches to be done through speech than through typing on a keyboard, Gates told about 1,200 students and faculty members Thursday at Carnegie Mellon University.

I just can’t see this happening without a major redesign of work areas.  Far too many people are in shared offices where constant speech is a distraction.  Also, think of how we search for data now – we find data, look at it, decide this isn’t what we needed, and search again, perhaps for only a slight variation of what we originally searched for.  Now imagine *hearing* your co-worker do this.  Constantly.  All day long.

Imagine being in a library and wanting or needing to search for an unpopular topic.  Now imagine doing it with speech

Keyboards and mice aren’t 100% perfect, but they are *quiet*, and don’t distract others around you nearly as much as someone talking.  Keyboards and mice also offer a level of privacy that speech just never will.

Now, if Gates was predicting *thought* interaction – just think about stuff and the computer would help you search for it – I’d agree 100% that it’s good and we should be striving towards that.  But speech?  I’m perhaps just a stick in the mud, but I don’t see it taking off, except for niche vertical markets (customer service kiosks, perhaps, or medical systems for disabled individuals).

What do you think?  Am I missing some killer use cases?

Former colleague mentioned @ developerworks – PHP/SOLR

January 22nd, 2008

Former colleague Donovan Jimenez had his PHP/SOLR client plugged as the “most robust” PHP client for SOLR at IBM’s developerworks site.  Not much else to plug here, but if you’re interested in doing SOLR with PHP, his client does the job admirably.  I’m using it in my matchorclash.com site right now too.  Grab Donovan’s client here.

Tagging evolved

January 10th, 2008

I was having an interesting conversation with Joe Brinkman from the DotNetNuke project this evening, and he got to talking about the ‘social networking’ focus in the next DNN release.  I had a small brainwave and suggested something to him, but the implications might be larger than I originally considered.

He mentioned that they’d be looking at providing the capability to ‘tag’ every piece of content in the system, instead of just a few item types which can be tagged currently in DNN.  Their focus will be on the business/enterprise aspect of tagging and the social features in DNN, and given this I suggested that the tags have timestamps associated with them.  When doing a search through tags (which itself often isn’t done – it’s just a blind SQL SELECT triggered from a REST URL), giving tags with older dates less weight in the final results will likely make sense.  You could even implement a cutoff.  If a document was tagged ‘vacationpolicy’ 6 years ago, it’s very likely it’s not the vacation policy you’re looking for today.

I realize that tagging isn’t the only way to categorize data, but it’s another piece of data which will need to be considered when searching.  Using that extra metadata about the tag should be a factor in search results.  Storing ‘who’ tagged something would also be useful for influencing search results, as my ‘friends’ (people in my department, or people with my interests, or whatever) who tag something as ‘foo’ should result in things they tagged as ‘foo’ being rated higher than items tagged ‘foo’ by people I don’t know, or actively dislike for some reason.

I have to imagine that sites like Flickr, which have built a huge collection of tag data, have much of this tag meta information on hand, and could easily use it to influence results.  Introducing new behaviour on public sites for something which is already expected behaviour might not on the cards anytime soon, but I have to imagine these sorts of filters and weighting structures will make their way in to tag search algorithms (if such things even exist right now – I bet they don’t yet).

What do you think?

Amazing firefox plugin – useful for researchers

September 19th, 2007

I just stumbled on Zotero, a fantastic firefox plugin for archiving, annotating and searching stuff you find on the web.  There’s very little I can say about it that they don’t say better on their site.  I’ve only been using it today, but it’s simply amazing.  I’ve been looking for something like this for a long time.  I’ve tried using the google notebook, but it’s always so slow to load up during it’s connections to the server.  I like that the google notebook lets me share stuff live, but I’m ending up not using that aspect much.  I think this Zotero is going to be much more useful in my everyday browsing and clipping.  The exporting in RDF is just icing on the cake.

OSCON live recap (and solr BOF tonight)

July 26th, 2007

So, I hit a couple more sessions last night.  The ‘high performance web pages’ talk from Steve @ Yahoo wasn’t open – SRO apparently.  Instead, I caught the end of “Profiling PHP apps” (Reilly).  I missed the beginning, but was hoping to get a bit of something out of it.  I did – a reference to wincachegrind (also referenced multiple times in comments on another post here).

I did end up meeting up with Ben Ramsey and George Schlossnagle.  I’ve talked on and off for years with George, and met his brother Theo, but we’d never met in person before.  Very nice guy.  Ben was good to see as well, and recently became a dad.  I was surprised that he was able to make it at all!  I gave both George and Ben copies of my ‘work in progress‘ book about the PHP job market, ideally getting some feedback from each of them with criticisms and pointers about how to make it better.

I checked out the ‘scrum war stories’ ‘birds of a feather’ session last night, run by Eric Pugh.  It was interesting to hear other peoples’ experiences with scrum and agile methodologies.  There was another person (Bill West) from Raleigh, who’s also been to the Agile Artisans group run by Jared Richarson (who’s also here speaking today – if you are doing Rails and want to tune your apps, check out Jared at 1:45 today – he’s a good speaker and great guy).  We then checked out the MySQL session with “pizza and beer”.  I had pizza and some home made “black vodka” personally made and poured by Monty himself!  It was *strong*!  I think I took 4 years off my esophagus last night with those few ounces.

I’m watching Bill Hilf’s keynote right now, and will be attending a lot of web dev sessions today, and am going to try to corner some people today for a webdevradio podcast interview.

Lastly, I’m hosting a SOLR “hands on” BOF tonight at 8:30.  There were a number of people who had some SOLR questions after yesterday’s presentation which I didn’t get a chance to respond to.  Hopefully a few people will stop by tonight and we’ll go over any more questions or demos people want to see, or others will stop by and share their success stories (please drop by!)

If you’re at OSCON and want to get togethe

SOLR presentation

July 25th, 2007

I ended up running over just a bit in my presentation, and didn’t quite get through all my slides (missed the last 3).  For anyone that wanted to see how it ends, download the files from  http://www.webdevradio.com/solr_oscon.tgz.  The only thing I didn’t demonstrate in detail was the PHP/SOLR search code, which is running at http://www.pfblogs.com/v2/ right now.

I’m currently at the MySQL Internals with Monty and Jay (Pipes), but it’s a bit over my head.  I guess I wasn’t expecting this many internals at an ‘internals’ talk – Monty is getting in to the speed performance of different iterator classes and whatnot – a bit too low-level for me.  Very interesting nonetheless, just not something I can use in my day to day work.

Hosted wordpress search service

June 24th, 2007

Going through SOLR putting together my presentation, I’ve restarted thinking of my hosted SOLR service I was considering some time ago.  I was thinking last night that a hosted blog search – wordpress, to start with – would be a great service, and pretty easy to set up.  WordPress “search”  functionality is something that seems to be moderately high on the ‘wanted’ list.  My primary concern is how to offer this and perhaps make a bit of money off of it.  Should the service be locked down by IP?  Or user/pass/key?  The akismet service sounds like a good model, except that it’d probably require even more horsepower than akismet because that is only run once per comment posted.  This functionality might be run multiple times per visitor to a blog.  Perhaps it’s free for the first 200 entries or something like that.  Many small blogs only have a few entries and comments, so offering it to free for them would get some traction.  Some payment per month for larger accounts might make sense.  Or perhaps the results page could be hosted, and offer ads on that?  That’d probably upset people, although hosted Google search probably does that already.