Apache Solr is one of the powerful open-source search libraries available in the market. In this article, we will discuss on how to set up a Solr cluster i.e. SolrCloud on Hadoop so that Solr index data is stored in HDFS and is made available for search functionality.
Terminology
For more information on Solr terminology, please refer to https://wiki.apache.org/solr/SolrTerminology
Pre-requisites
Architecture
The illustration below depicts on how various elements are distributed across different machines in the cluster. For our purposes, we have configured each machine to work as both Hadoop data node and SolrCloud node. However, these can be put on different machines altogether, depending on the business need.
During this demonstration, we shall create a Solr collection called ‘Coforge-solr-collection’ which contains 3 shards with a replication factor of 2. As we have three machines in our SolrCloud cluster, each machine will have a maximum of 2 shards.
Steps
Upload configuration to Zookeeper
Note: As this article is only for setting up SolrCloud cluster, only solrconfig.xml is considered. However, when uploading configuration information to Zookeeper, schema.xml has to be modified and uploaded as per schema requirements. Modifications to schema.xml are not covered in this blog.
<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
<str name="solr.hdfs.home">hdfs://name_node:8020/solr_location</str>
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.blockcache.write.enabled">true</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
</directoryFactory>
<lockType>hdfs</lockType>
unzip SOLR_HOME/dist/solr/ solr-4.10.2.war TEMP
cp TEMP/WEB-INF/lib/* SOLR_LIB/
cp SOLR_HOME/example/lib/ext/* SOLR_LIB/
Note: The complete conf folder that has come along with the example server in Solr installation has been uploaded. However, only solrconfig.xml has been modified for our purposes.
java -classpath .:SOLR_LIB/* zkcli.sh -cmd upconfig -zkhost slave1:2181, slave2:2181,
slave3:2181 -confdir SOLR_HOME/example/solr/collection1/conf -confname Coforge_solr_conf
java -classpath .:SOLR_LIB/* zkcli.sh -cmd linkconfig -collection Coforge-solr-collection -confname Coforge_solr_conf -zkhost slave1:2181, slave2:2181, slave3:2181
Tomcat and solr.xml changes
Following changes and installations are to be made on ALL the Solr node machines. However, for simplicity purposes, all the below changes are described from Slave 1 perspective. These steps are to be repeated for each machine in the Solr cluster
#!/bin/sh
JAVA_HOME=$(YOUR_JAVA_HOME)
JAVA_OPTS="$JAVA_OPTS -server"
JAVA_OPTS="$JAVA_OPTS -Xms128m -Xmx2048m"
JAVA_OPTS="$JAVA_OPTS -XX:PermSize=64m -XX:MaxPermSize=128m -XX:+UseG1GC"
SOLR_OPTS="-Dsolr.solr.home=SOLR_CORE_HOME -Dhost=slave1 -Dport=8080
-DhostContext=solr -DzkClientTimeout=20000 -DzkHost=slave1:2181, slave2:2181, slave3:2181
JAVA_OPTS="$JAVA_OPTS $SOLR_OPTS"
<?xml version="1.0" encoding="UTF-8" ?>
<solr>
<solrcloud>
<str name="host">${host:}</str>
<int name="hostPort">${port:}</int>
<str name="hostContext">${hostContext:}</str>
<int name="zkClientTimeout">${zkClientTimeout:}</int>
<bool name="genericCoreNodeNames">${genericCoreNodeNames:true}</bool>
</solrcloud>
<shardHandlerFactory name="shardHandlerFactory"
class="HttpShardHandlerFactory">
<int name="socketTimeout">${socketTimeout:0}</int>
<int name="connTimeout">${connTimeout:0}</int>
</shardHandlerFactory>
</solr>
SolrCloud – Collection and Shards
As the configuration changes have been made and tomcat is been installed, let us create shards on Solr by calling the appropriate REST APIs
curl 'http://slave1:8080/solr/admin/collections?action=CREATE&name= Coforge-solr-collection &numShards=3&replicationFactor=2&maxShardsPerNode=2'
Shard 1 – Slave 1 & Slave 2
Shard 2 – Slave 2 & Slave 3
Shard 3 – Slave 1 & Slave 3
curl 'http://slave1:8080/solr/admin/cores?action=CREATE&name=shard1-replica-1&collection=Coforge-solr-collection&shard=shard1'
curl http://slave2:8080/solr/admin/cores?action=CREATE&name=shard1-replica-2&collection=Coforge-solr-collection&shard=shard1'
curl 'http://slave2:8080/solr/admin/cores?action=CREATE&name=shard2-replica-1&collection=Coforge-solr-collection&shard=shard2'
curl 'http://slave3:8080/solr/admin/cores?action=CREATE&name=shard2-replica-2&collection=Coforge-solr-collection&shard=shard2'
curl 'http://slave3:8080/solr/admin/cores?action=CREATE&name=shard3-replica-1&collection=Coforge-solr-collection&shard=shard3'
curl 'http://slave1:8080/solr/admin/cores?action=CREATE&name=shard3-replica-2&collection=Coforge-solr-collection&shard=shard3'
Once all the commands are run, SolrCloud should have been setup. In order to check if everything is set up perfectly, open the URL http://slave1:8080/solr and click on Cloud in the side navigation bar. On the right, SolrCloud cluster view will be displayed with all the shards and their corresponding replicas.
Coforge is a Hortonworks Gold Partner, an Authorised Reseller and Certified Consulting Partner for MapR and Cloudera Silver Partner.
If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at Salesforce@coforge.com