(tl;dr - I've chosen pyes because it has batteries included).
First: Why do I need a client and what do I need it for?
Elasticsearch is a webservice. All you need is to make http call.
In a simplest case, with one server and fairly straightforward queries,
anything that can make GET and POST request (like requests - this really should in python standard library)
will work just fine. What I need however is far from simple case.
First of all, when I'm accessing ES cluster with several nodes,
I need to deal with occasional failures. At the very list client should be able
to specify connection timeout and amount of retries.
Some client implement connection pooling, loadbalancing and failover, but since dedicated
loadbalancer is much better at handling all of those, I don't care about client support for that.
(this also the reason for using http instead of thrift).
Second: while simple ES queries are easy to write by hand, this is what I'm frequently dealing with:
{ "sort": [ { "follows.date_added": { "order": "desc", "nested_filter": { "terms": { "follows.owner_id": [ 1 ] } } } }, { "entries.usd_price": { "order": "asc", "nested_filter": { "bool": { "must": [ { "bool": { "must_not": [ { "term": { "entries.disallow_countries": "US" } } ], "must": [ { "terms": { "entries.allow_countries": [ "*", "US" ] } } ] } }, { "terms": { "stock_status": [ 3 ] } } ] } } } } ], "from": 0, "facets": { "color_not_analyzed": { "facet_filter": { "bool": { "must": [ { "terms": { "gender_not_analyzed": [ "Men" ] } } ] } }, "terms": { "field": "color_not_analyzed", "size": 50 } }, "subcategory_not_analyzed": { "facet_filter": { "bool": { "must": [ { "terms": { "gender_not_analyzed": [ "Men" ] } } ] } }, "terms": { "field": "subcategory_not_analyzed", "size": 50 } }, "category_not_analyzed": { "facet_filter": { "bool": { "must": [ { "terms": { "gender_not_analyzed": [ "Men" ] } } ] } }, "terms": { "field": "category_not_analyzed", "size": 50 } }, "retailer_slug": { "facet_filter": { "bool": { "must": [ { "terms": { "gender_not_analyzed": [ "Men" ] } } ] } }, "terms": { "field": "retailer_slug", "size": 50 } }, "gender_not_analyzed": { "terms": { "field": "gender_not_analyzed", "size": 50 } }, "product_type_not_analyzed": { "facet_filter": { "bool": { "must": [ { "terms": { "gender_not_analyzed": [ "Men" ] } } ] } }, "terms": { "field": "product_type_not_analyzed", "size": 50 } } }, "filter": { "bool": { "must": [ { "terms": { "gender_not_analyzed": [ "Men" ] } } ] } }, "query": { "filtered": { "filter": { "bool": { "must": [ { "nested": { "filter": { "bool": { "must": [ { "bool": { "must_not": [ { "term": { "entries.disallow_countries": "US" } } ], "must": [ { "terms": { "entries.allow_countries": [ "*", "US" ] } } ] } }, { "terms": { "stock_status": [ 3 ] } } ] } }, "path": "entries" } }, { "nested": { "filter": { "terms": { "follows.owner_id": [ 1 ] } }, "path": "follows" } } ] } }, "query": { "match_all": { } } } }, "size": 10 }
(and this is not most complicated query I'm doing, far from it). There are few problems with such complex queries, which require support from the client:
- you have to keep up with quickly evolving ES syntax. If you are using deprecated or obsolete feature, client should warn you.
- you don't want to spend hours chasing typo, starring at ES "parsing error near..." response. Queries should be generated.
- you need to be able to easily modify queries to use ES efficiently. Client should provide high-level interface to do it.
- but you need to get everything out of ES - client should support every available feature and syntax option.
Beside that, I have standard expectations for every library:
- keep up with ES development
- fix bugs and release often
- provide good documentation
What I don't need:
- as mentioned: any advanced connection management
- integration with any framework. While useful at the beginning, it gets in the way later,
and can become a limitation. In my case ES index is highly independent from my database models.
Considering those requirements, what were my options then?
First lets have a brief overview of ES libraries that I'm not even considering as usable:
pyelasticsearch
ESClient
rawes
all of them (and many others you can find on pypi) provide not much more then thin wrapper over http request. While they are useful, and for most people are simply good enough, they really are not an option for me.
Here is the list of clients that I was looking at:
elasticutils
This one was really promising, as it allows you to write this:
In [1]: elasticutils.S().filter(foo__gte=4, baz__startswith='bar').order_by('-baz').facet('foo') Out[1]: <s {'filter': {'and': [{'range': {'foo': {'gte': 4}}}, {'prefix': {'baz': 'bar'}}]}, 'sort': [{'baz': 'desc'}], 'facets': {'foo': {'terms': {'field': 'foo'}}}}>
which is absolutely amazing, comparing with raw ES syntax. If you are choosing ES library now, you definitely should consider it.
Unfortunately when I was looking at it, it was relying on pyelasticsearch that wasn't compatible with recent ES version, making it completely useless.
I hope this has been fixed, but I moved one since then, so I don't know for sure. The only objection I would have would be lack of support for nested documents.
Other then that, it really makes using ES a pleasure.
elasticfun
haystack
Both provide similar queryset-ish syntax, although support much smaller subset of ES features. Likely good enough for many people, but not me.
Haystack supports many search engines, so you can't expect integration with ES as good as dedicated client.
And the winner is ... pyes:
Pyes provides:
- support for nearly every ES feature, via object-oriented interface. If there is anything missing (happened few times),
its really easy to add.
- queryset, for convenience:
In [1]: queryset.QuerySet(index='index', type='type').filter(foo=3, bar__startswith='joe').order_by('bar').facet('baz')._build_search().serialize() Out[1]: {'facets': {'baz': {'terms': {'field': 'baz', 'size': 10}}}, 'from': 0, 'query': {'filtered': {'filter': {'and': [{'term': {'bar.startswith': 'joe'}}, {'term': {'foo': 3}}]}, 'query': {'match_all': {}}}}, 'sort': [{'bar': 'asc'}]}
unfortunatelly queryset itself does not support nested documents, but all other pyes classes do.
- simple way of dealing with complex queries. Basically pyes provides python class
for every part of ES query, like filters, facets or queries. This gives you query generation (each class has serialize method that generates relevant part of ES syntax),
and yet allows to go as low-level as needed, to tweak anything you want. This oo-based approach makes pyes (and anything that uses it)
very easy to inspect and debug, which is something I frequently do. You have to deal with whole complexity of ES of course, but that is exactly what I often need to do.
- good (but not perfect) support for recent ES versions. While there were few details I had to fix or enhance, at least it was never completely broken (pointing finger at elasticutils here).
- it does support specifying connection timeouts and retries. Actually it does much more - I don't need it, but its good to have a choice.
- straightforward translation to ES syntax makes it easy to understand if you know ES syntax (otherwise it makes it very, very hard to understand anything)
The cons also exist:
- while actively maintained, official releases are rare. Use master. This is the biggest drawback.
- if you know nothing about full-text search engines, this may not be the best choice for you. It will allow you to dive as deep into ES as needed, but there is little automation. In that case, haystack might be the best choice.
- following standards set by ES itself, documentation sucks. You can easily do hello-world query, but then there is a lot of undocumented methods that accept **kwargs. Source is easy to read though.
but they don't outweigh the pros and for my needs there really was no other choice.
(if any of pyes authors is reading it, here is my wishlist: provide official releases, gather and publish list of unsupported ES features and keep up the good work you are doing)
No comments:
Post a Comment