What is Elasticsearch?
Elasticsearch is highly scalable, broadly distributed open-source full text search and analytics engine. You can in very near real-time search, store and index big volume of data. It internally use Apache Lucene for indexing and storing data. Below are few use cases for it.
- Product search for e-commerce website
- Collecting application logs and transaction data for analyzing it for trends and anomalies.
- Indexing instance metrics(health, stats) and doing analytics, creating alerts for instance health on regular interval.
- For analytics/ business-intelligence applications
Elasticsearch basic concepts
We will be using few terminologies while talking about Elasticsearch. Let's see basic building blocks of Elasticsearch.
Near real-time
Elasticsearch is near real-time. What it means is that the time (latency) between the indexing of document and its availability for searching.
Cluster
It is a collection of one or multiple nodes (servers) that together holds the entire data and provide you the ability to indexing and searching the cluster for data.
Node
It is a single server that is part of your cluster. It can store data, participate in indexing and searching and overall cluster management. Node could have four different flavours i.e. master, htttp, data, coordinating/client nodes.
Index
An index is collection of similar kind/characteristics of documents. It is identified by name(all lowercase) and is refer to by name to perform indexing, search, update and delete operations against documents.
Document
It is a single unit of information that can be indexed.
Shards and Replicas
Single index can store billions of documents which can lead to storage taking up TB's of space. Single server could exceed its limitation to store such a massive information or performing search operation on that data. To solve this problem, Elasticsearch sub-divide your index into multiple units called shards.
Replication is important primarily to have high availability in case of node/shard failure and to allow to scale out your search throughput. By default Elasticsearch have 5 shards and 1 replicas which could be configured at the time of creating index.
Installing Elasticsearch
Elasticsearch requiresJava to run. As of writing this article Elasticsearch 6.2.X+ requires at least Java 8.
Installing Java 8// Installing Open JDK sudo apt-get install openjdk-8-jdk // Installing Oracle JDK sudo add-apt-repository -y ppa:webupd8team/java sudo apt-get update sudo apt-get -y install oracle-java8-installerInstalling Elasticsearch with tar file
curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.4.tar.gz tar -xvf elasticsearch-6.2.4.tar.gzInstalling Elasticsearch with package manager
// import the Elasticsearch public GPG key into apt: wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add - //Create the Elasticsearch source list echo "deb http://packages.elastic.co/elasticsearch/6.x/debian stable main" | sudo tee -a /etc/apt/sources.list.d/elasticsearch-6.x.list sudo apt-get update sudo apt-get -y install elasticsearchConfiguring Elasticsearch cluster
Configuration file location if you have downloaded the tar file
vi /[YOUR_TAR_LOCATION]/config/elasticsearch.yml
Configuration file location if you used package manager to install Elasticsearch
vi /etc/elasticsearch/elasticsearch.ymlCluster Name
Use some descriptive name for cluster. Elasticsearch node will use this name to form and join cluster.
cluster.name: lineofcode-prodNode name
To uniquely identify node in the cluster
node.name: ${HOSTNAME}Custom attributes to node
Adding a rack to node to logically group the nodes placed on same data center/ physical machine
node.attr.rack: us-east-1Network host
Node will bind to this hostname or IP address and advertise this host to other nodes in the cluster.
network.host: [_VPN_HOST_, _local_]
To find and join a cluster, you need to know at least few other hostname or IP addresses. This could easily be set by discovery.zen.ping.unicast.hosts
proeprty.
You can configure the port number on which Elasticsearch is accessible over HTTP with http.port
property.
Configuring JVM options (Optional for local/test)
You need to tweak JVM options as per your hardware configuration. It is advisable to allocate half the memory of total server available memory to Elasticsearch and rest will be taken up by Lucene and Elasticsearch threads.
// For example if your server have eight GB of RAM then set following property as -Xms4g -Xmx4g
Also, to avoid performance hit let elasticsearch block the memory with bootstrap.memory_lock: true
property.
Elasticsearch uses concurrent mark and sweep GC and you can change it to G1GC with following configurations.
-XX:-UseParNewGC -XX:-UseConcMarkSweepGC -XX:+UseCondCardMark -XX:MaxGCPauseMillis=200 -XX:+UseG1GC -XX:GCPauseIntervalMillis=1000 -XX:InitiatingHeapOccupancyPercent=35Starting Elasticsearch
sudo service elasticsearch restart
TADA! Elasticsearch is up and running on your local.
To have a production grade setup, I would recommend to visit following articles.