Search engine selection: Elasticsearch vs Solr

Little scum · Posted on 12/11/2018 1:42:36 PM

Search engine selection research document

Introduction to Elasticsearch*

Elasticsearch is a real-time distributed search and analytics engine. It helps you process large-scale data faster than ever before.

It can be used for full-text search, structured search, and analytics, and of course, you can combine all three.

Elasticsearch is a search engine built on the full-text search engine Apache Lucene™, which can be said to be the most advanced and efficient full-featured open source search engine framework available today.

But Lucene is just a framework, and to take full advantage of its features, you need to use JAVA and integrate Lucene into your program. It takes a lot of learning to understand how it works, and Lucene is really complicated.

Elasticsearch uses Lucene as its internal engine, but when using it for full-text search, you only need to use a unified API without understanding the complex Lucene operating principles behind it.

Of course, Elasticsearch is not just as simple as Lucene, it not only includes full-text search functions, but also can perform the following tasks:

Distributed real-time file storage and indexing every field so that it can be searched.
Distributed search engine with real-time analytics.
It can scale to hundreds of servers to handle petabytes of structured or unstructured data.

With so many features integrated into one server, you can easily communicate with ES's RESTful API via the client or any of your preferred programming languages.

Getting started with Elasticsearch is very simple. It comes with a lot of very reasonable defaults, which makes it a good way for beginners to avoid having to deal with complex theories as soon as they get started.

It is installed and ready to use, and it can be very productive with a small learning cost.

As you learn more, you can also take advantage of more advanced features of Elasticsearch, and the entire engine can be configured flexibly. You can customize your own Elasticsearch according to your own needs.

Use Cases:

Wikipedia uses Elasticsearch to do full-text searches and highlight keywords, as well as search suggestions such as search-as-you-type and did-you-mean.
The Guardian uses Elasticsearch to process visitor logs so that editors can be informed of public reactions to different articles in real time.
StackOverflow combines full-text search with geolocation and relevant information to provide a representation of questions related to more-like-this.
GitHub uses Elasticsearch to retrieve more than 130 billion lines of code.
Every day, Goldman Sachs uses it to index 5TB of data, and many investment banks use it to analyze stock market movements.

But Elasticsearch is not just for large enterprises, it has also helped many startups like DataDog and Klout expand their capabilities.

Pros and cons of Elasticsearch**:

merit

Elasticsearch is distributed. No other components are required, and distribution is real-time, known as "Push replication".
Elasticsearch fully supports near real-time search with Apache Lucene.
Handling multitenancy requires no special configuration, while Solr requires more advanced settings.
Elasticsearch uses the concept of Gateway to make it easier to back up.
Each node forms an equal network structure, and when some nodes fail, other nodes are automatically assigned to work in their place.

shortcoming

Only one developer (the current Elasticsearch GitHub organization is more than that, it already has quite active maintainers)
Not automatic enough (not suitable for the current new Index Warmup API)

About Solr*

Solr (pronounced "solar") is an open-source enterprise search platform for the Apache Lucene project. Its main features include full-text search, hit marking, faceted search, dynamic clustering, database integration, and processing of rich text (e.g., Word, PDF). Solr is highly scalable and provides distributed search and index replication. Solr is the most popular enterprise-grade search engine, and Solr4 also adds NoSQL support.

Solr is a standalone, full-text search server written in Java that runs on a servlet container such as Apache Tomcat or Jetty. Solr uses the Lucene Java search library as the core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs. Solr's powerful external configuration capabilities make it easy to adapt to many types of applications without Java coding. Solr has a plugin architecture to support more advanced customization.

Because of the merger of the Apache Lucene and Apache Solr projects in 2010, the two projects were created and implemented by the same Apache Software Foundation development team. When it comes to technology or products, Lucene/Solr or Solr/Lucene is the same.

Pros and cons of Solr:

merit

Solr has a larger and more mature community of users, developers, and contributors.
Support adding indexes in multiple formats, such as HTML, PDF, Microsoft Office software formats, and plain text formats such as JSON, XML, CSV, etc.
Solr is relatively mature and stable.
Search while indexing is not considered, and the speed is faster.

shortcoming

When the index is established, the search efficiency decreases, and the real-time index search efficiency is not high.

Elasticsearch vs Solr*

Solr is faster when simply searching for existing data.

When indexing in real time, Solr will cause IO blocking and poor query performance, which Elasticsearch has a clear advantage.

As the amount of data increases, Solr's search efficiency becomes lower, while Elasticsearch doesn't change significantly.

In summary, Solr's architecture is not suitable for real-time search applications.

Real-world production testing*

The figure below shows a 50x increase in average query speed after switching from Solr to Elasticsearch.

Summary of Elasticsearch vs Solr comparison

Both are easy to install;
Solr leverages Zookeeper for distributed management, while Elasticsearch itself has distributed orchestration management;
Solr supports more formats of data, while Elasticsearch only supports JSON file formats;
Solr officially provides more features, while Elasticsearch itself focuses more on core functions, and advanced functions are mostly provided by third-party plugins.
Solr outperforms Elasticsearch in traditional search applications, but is significantly less efficient than Elasticsearch when handling real-time search applications.
Solr is a powerful solution for traditional search applications, but Elasticsearch is better suited for emerging real-time search applications.

Other Lucene-based open source search engine solutions*

1: Use Lucene directly

Note: Lucene is a JAVA search library that is not a complete solution on its own and requires additional development effort.

Advantages: Mature solution with many successful cases. Apache top-level projects that continue to advance rapidly. Large and active development community, a large number of developers. It is just a class library, with enough room for customization and optimization: after simple customization, it can meet most common needs; Optimized to support 1 billion+ searches.

Cons: Requires additional development effort. All scaling, distribution, reliability, etc. need to be implemented by yourself; In non-real-time, there is a time delay between indexing and searching, and the scalability of the current "Lucene Near Real Time search" search scheme needs to be further improved

The hyperlink login is visible.

2：Katta

Note: Lucene-based support distributed, scalable, fault-tolerant, near-real-time search scheme.

Pros: Distributed out of the box with Hadoop. It has a scaling and fault tolerance mechanism.

Disadvantages: It's just a search solution, and you still need to implement the indexing part by yourself. In terms of search function, only the most basic needs are realized. There are fewer success stories and the maturity of the project is slightly lower. Because it needs to support distribution, it will be difficult to customize for some complex query requirements.

The hyperlink login is visible.

3：Hadoop contrib/index

Note: Map/Reduce mode, a distributed indexing solution, can be used with Katta.

Advantages: Distributed indexing and scalability.

Disadvantages: Only the indexing scheme, not the search implementation. Works in batch mode with poor support for real-time search.

The hyperlink login is visible.

4: LinkedIn's open-source solution

Description: A range of solutions based on Lucene, including Zoie for near-real-time search, Bobo for facet search, Decomposer for machine learning algorithms, Krati for summarization repositories, Sensei for database schema wrapping, and more

Advantages: Proven solution that supports distributed, scalable, and rich feature implementation

Cons: Too closely connected with LinkedIn company and poor customizability

The hyperlink login is visible.

5：Lucandra

Note: Based on Lucene, the index exists in the cassandra database

Pros: Refer to the advantages of Cassandra

Cons: Refer to the disadvantages of Cassandra. Also, this is just a demo and has not been heavily verified

The hyperlink login is visible.

6：HBasene

Note: Based on Lucene, the index exists in the HBase database

Benefits: Refer to the advantages of HBase

Disadvantages: Refer to the disadvantages of HBase. Also, in the implementation, lucene terms are stored as rows, but the posting lists corresponding to each term are stored as columns. As the number of posting lists for a single term grows, the speed of the query will be greatly affected

The hyperlink login is visible.

7: Xunsearch

Note: Xunsearch adopts a structured hierarchical design, including back-end services and front-end development packages, with clear hierarchies and no intersection. The backend is a daemon written in C/C++, while the frontend uses PHP, the most popular scripting language, which is more convenient for web search projects. For details, see Architecture Design.

The hyperlink login is visible.

Search engine selection: Elasticsearch vs Solr

Related Posts

Sections viewed