Demystifying Elasticsearch

Demystifying Elasticsearch

Is your app’s searching capability not scaling to meet added load?

Do your searches slow as your transactional data volume grows?

Do you have complex, hard to maintain stored procedures that function as your main search components?

If you answered yes to any of these questions, you may have spent some time looking at search technologies, such as Elasticsearch, Solr, Lucene, and SQL Server Full Text Search. This can be intimidating because of the perceived difficulties of deployment and configuration—so instead, perhaps you just setup SQL Server Full Text and went about your business. While this may have met your immediate needs, more often than not, familiar roadblocks can once again rear their ugly head:

  • SQL server deployments are difficult to scale out
  • There is very limited ability to configure indexes

I’m here to tell you that configuring a scalable and redundant Elasticsearch environment is not that difficult. With a little pre-planning, you can get a highly scalable and redundant search environment. For this article, I will go over the high level concepts of Elasticsearch, and how they interact together.

Elasticsearch Cluster

An Elasticsearch environment is known as a cluster. Each cluster contains a number of nodes (servers), which store your data and can range from several to only one node. This architecture allows for an out-of-the-box configuration that is highly scalable and that allows you to easily add nodes to your cluster as your app and data volume grow.

Elasticsearch Documents

Documents are the root of everything in Elasticsearch – they represent your data!  Instead of being governed with a rigid structure of columns and rows like a database, your data is stored in its native structure as JSON. Each document is one instance of your data.

{
  "Email": "james@example.com",
  "Active": true,
  "CreatedDate": "2016-01-20T00:00:00Z",
    "Roles": [
    "User",
    "Admin"
   ]
}

Elasticsearch Indexes

In order to understand how Elasticsearch scales out, we first have to talk about indexes. An Elasticsearch index is where all of your documents are physically stored. Think of it as a something similar to a table in your database, but it’s not just that. All of your similarly typed documents would be stored in a specific index.  You would have a separate index for documents that are fundamentally different—just like having multiple tables. Since there is the very real potential to store massive amounts of data in your index, Elasticsearch allows you to partition your indexes out, typically by a dimension such as a date. For example, if you are storing a bunch of logging data, you could store each day’s logs in a separate index and would name your indexes something similar to this:

logs-2016-05-01
logs-2016-05-02
logs-2016-05-03

Elasticsearch’s APIs are flexible enough to know how to search across multiple partitions of indexes, giving you a powerful and seamless search experience.

When you create an index, you are also (optionally) configuring certain aspects of how that index will function. Mapping your documents is entirely optional, but also highly recommended. It allows you to configure how each field of your document is treated and searched on. A few examples are:

  • Defining each field type in your document (int, date, string, etc)
  • Defining how fields are searched (full text, phrases, tokens, etc)

Index Shards & Replication

There are two very important concepts you need to be aware of when creating a new index: shards and replication

Even after partitioning your indexes, the amount of data stored in one index can become too big for one node to handle by itself. This is where shards comes into play. When you shard your index, you break it up into smaller, more manageable indexes that can span across multiple nodes. This is important, because with the combination of multiple Elasticsearch nodes and shards, you have the ability to scale out. Additionally, searching now becomes distributed across the nodes of the cluster. By default, Elasticsearch creates five shards for each index, but can be configured based on the number of nodes and amount of data you plan on storing.

Because it is so easy to add multiple nodes to your cluster, you will naturally want to use the built-in redundancy that Elasticsearch offers. Replica shards are copies of your indexes that are replicated across the cluster. That means that if one node were to fail for any reason, you would suffer no data loss and would still have full copies of all your data spread across the cluster—providing high availability. By default, each Elasticsearch index’s shards have one replication copy, but this can be configured based on specific requirements you may have.

Summary

Hopefully these high level concepts of Elasticsearch are a little more clear now; and can guide you to explore building out an Elasticsearch cluster to serve your searching needs. I’ll be back soon to talk about how you can utilize Elasticsearch to dig deep into your data. Happy searching!

 

 

Your email address will not be published. Required fields are marked *

Phone: 312-602-4000
Email: marketing@westmonroepartners.com
222 W. Adams
Chicago, IL 60606
Show Buttons
Share On Facebook
Share On Twitter
Share on LinkedIn
Hide Buttons