Professional Documents
Culture Documents
Real-Time Searching of Big Data With Solr and Hadoop: Rod Cope, Cto & Founder Openlogic, Inc
Real-Time Searching of Big Data With Solr and Hadoop: Rod Cope, Cto & Founder Openlogic, Inc
Real-Time Searching of Big Data With Solr and Hadoop: Rod Cope, Cto & Founder Openlogic, Inc
Agenda
Introduction The Problem The Solution Lessons Learned Public Clouds & Big Data Final Thoughts Q&A
OpenLogic, Inc.
Introduction
Rod Cope
CTO & Founder of OpenLogic 25 years of software development experience IBM Global Services, Anthem, General Electric
OpenLogic
Open Source Provisioning, Support, and Governance Solutions Certified library w/SLA support on 500+ Open Source packages
http://olex.openlogic.com
OpenLogic, Inc.
The Problem
Big Data
All the worlds Open Source Software Metadata, code, indexes Individual tables contain many terabytes Relational databases arent scale-free
Growing every day Need real-time random access to all data Long-running and complex analysis jobs
OpenLogic, Inc. 4
The Solution
Hadoop, HBase, and Solr
Hadoop distributed file system HBase NoSQL data store column-oriented Solr search server based on Lucene All are scalable, flexible, fast, well-supported, used in production environments
OpenLogic, Inc.
Solution Architecture
Web Browser Scanner Client Ruby on Rails Resque Workers Stargate Nginx & Unicorn MySQL Live replication Live replication Solr
HBase
OpenLogic, Inc.
How do find my data if primary key wont cut it? Solr to the rescue
Very fast, highly scalable search server with built-in sharding and replication based on Lucene Dynamic schema, powerful query language, faceted search, accessible via simple REST-like web API w/XML, JSON, Ruby, and other data formats
OpenLogic, Inc.
Solr
Sharding
Query any server it executes the same query against all other servers in the group Returns aggregated result to original caller
OpenLogic
Solr farm, sharded, cross-replicated, fronted with HAProxy
Load balanced writes across masters, reads across masters and slaves Be careful not to over-commit
Billions of lines of code in HBase, all indexed in Solr for real-time search in multiple ways Over 20 Solr fields indexed per source file
OpenLogic, Inc. 8
Machine 1
Masters
Solr Core A
Slaves
Solr Core Z Solr Core A Solr Core B Solr Core Y
OpenLogic, Inc.
Write Flow
HAProxy HAProxy
Machine 1
Masters
Solr Core A
Slaves
Solr Core Z Solr Core A Solr Core B Solr Core Y
OpenLogic, Inc.
10
Read Flow
HAProxy HAProxy
Machine 1
Masters
Solr Core A
Slaves
Solr Core Z Solr Core A Solr Core B Solr Core Y
OpenLogic, Inc.
11
Delete Flow
HAProxy HAProxy
Machine 1
Masters
Solr Core A
Slaves
Solr Core Z Solr Core A Solr Core B Solr Core Y
OpenLogic, Inc.
12
Machine 1
Masters
Solr Core A
Slaves
Solr Core Z Solr Core A Solr Core B Solr Core Y
OpenLogic, Inc.
13
Machine 1
Masters
Solr Core A
Slaves
Solr Core Z Solr Core A Solr Core B Solr Core Y
OpenLogic, Inc.
14
When youre done with the massive initial load/import, turn it back down for search performance
Minimize number of queries Start with something like 5 Example:
curl http://solr1:8080/solr/master/update?optimize=true&maxSegments=5 This can take a few minutes, so you might need to adjust various timeouts
Note that a small merge factor will hurt indexing performance if you need to do massive loads on a frequent basis or continuous indexing
OpenLogic, Inc. 15
OpenLogic, Inc.
16
OpenLogic, Inc.
17
At OpenLogic, we spread raw source data across many machines and hard drives via NFS
Be very careful with NFS configuration can hang machines
OpenLogic, Inc.
19
} public List filterLongerThan( List list, int length ) { List result = new ArrayList(); Iterator iter = list.iterator(); while ( iter.hasNext() ) { String item = (String) iter.next(); if ( item.length() <= length ) { result.add( item ); } } return result; }
OpenLogic, Inc. 20
Groovy
list = ["Rod", "Neeta", "Eric", "Missy"] shorts = list.findAll { name -> name.size() <= 4 } println shorts.size shorts.each { name -> println name } -> 2 -> Rod Eric
OpenLogic, Inc.
Cutting Edge
Hadoop
SPOF around Namenode, append functionality
HBase
Backup, replication, and indexing solutions in flux
Solr
Several competing solutions around cloud-like scalability and fault-tolerance, including ZooKeeper and Hadoop integration No clear winner, none quite ready for production
OpenLogic, Inc.
22
Configuration is Key
Many moving parts
Its easy to let typos slip through Consider automated configuration via Chef, Puppet, or similar
Commodity Hardware
Commodity hardware != 3 year old desktop Dual quad-core, 32GB RAM, 4+ disks Dont bother with RAID on Hadoop data disks
Be wary of non-enterprise drives
OpenLogic, Inc.
24
OpenLogic, Inc.
25
Hadoop datanode gets remaining drives Redundant enterprise switches Dual- and quad-gigabit NICs
OpenLogic, Inc. 26
OpenLogic, Inc.
27
OpenLogic, Inc.
28
Operating System
Kernel panics, zombie processes, dropped packets
Software Servers
Hadoop datanodes, HBase regionservers, Stargate servers, Solr servers
OpenLogic, Inc.
31
Final Thoughts
You can host your own big data
Tools are available today that didnt exist a few years ago Fast to prototype production readiness takes time Expect to invest in training and support
Q&A