CPython Embedded in Solr - Search Solution For Python Lovers With The Speed of Native Java

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 85

MontySolr: Embedding CPython in Solr

Roman Chyla, CERN roman.chyla@cern.ch, May 26, 2011

Thursday, May 26, 2011

Why should I care?


- Our challenge is to connect Python and Java - Without compromises - We created MontySolr extension
Robust, tested (will be used by our system) But works for any Python application (eg. Django) And for any C/C++ app that Python understands! Open source (GPL v2)

- Try it out!
- https://github.com/romanchyla/montysolr

2
Thursday, May 26, 2011

Outline
Context - The Challenge - Key components
- Available technologies - Our approach - Problems solved

- Evaluation - Wrap-up

3
Thursday, May 26, 2011

CERN
- European Organization for Nuclear Research
- Switzerland, Geneva

- The largest laboratory for High Energy Physics - Home to the Large Hadron Collider - 40-50K HEP scientists worldwide

4
Thursday, May 26, 2011

CERN
- European Organization for Nuclear Research
- Switzerland, Geneva

- The largest laboratory for High Energy Physics - Home to the Large Hadron Collider - 40-50K HEP scientists worldwide

4
Thursday, May 26, 2011

CERN
- European Organization for Nuclear Research
- Switzerland, Geneva

- The largest laboratory for High Energy Physics - Home to the Large Hadron Collider - 40-50K HEP scientists worldwide

4
Thursday, May 26, 2011

CERN
- European Organization for Nuclear Research
- Switzerland, Geneva

- The largest laboratory for High Energy Physics - Home to the Large Hadron Collider - 40-50K HEP scientists worldwide

4
Thursday, May 26, 2011

CERN
- European Organization for Nuclear Research
- Switzerland, Geneva

- The largest laboratory for High Energy Physics - Home to the Large Hadron Collider - 40-50K HEP scientists worldwide

4
Thursday, May 26, 2011

CERN
- European Organization for Nuclear Research
- Switzerland, Geneva

- The largest laboratory for High Energy Physics - Home to the Large Hadron Collider - 40-50K HEP scientists worldwide

4
Thursday, May 26, 2011

CERN
- European Organization for Nuclear Research
- Switzerland, Geneva

- The largest laboratory for High Energy Physics - Home to the Large Hadron Collider - 40-50K HEP scientists worldwide

4
Thursday, May 26, 2011

CERN
- European Organization for Nuclear Research
- Switzerland, Geneva

- The largest laboratory for High Energy Physics - Home to the Large Hadron Collider - 40-50K HEP scientists worldwide

4
Thursday, May 26, 2011

SPIRES
- Stanford Linear Accelerator Center - SLAC - High-Energy Physics Literature Database - Started December 1991
- The rst web outside Europe/CERN - The rst database on web

5
Thursday, May 26, 2011

SPIRES
- Stanford Linear Accelerator Center - SLAC - High-Energy Physics Literature Database - Started December 1991
- The rst web outside Europe/CERN - The rst database on web

5
Thursday, May 26, 2011

6
Thursday, May 26, 2011

7
Thursday, May 26, 2011

Invenio
- Integrated digital library software behind INSPIRE - Used by very large institutional repositories
- http://repositories.webometrics.info/toprep_inst.asp

- Customizable virtual collections - Flexible management of metadata


- 3 000 authors per article

- Powerful search engine


- Incl. citation map analysis

- Written in Python (since 2001)


- 290 000 lines of code
8
Thursday, May 26, 2011

Outline
- Context The Challenge - Key components
- Available technologies - Our approach - Problems solved

- Evaluation - Wrap-up

9
Thursday, May 26, 2011

The Challenge
- HEP scientic community
- Searches metadata oriented

- However fulltexts are changing the situation - And we want to provide even better service
- Bigger volumes of data - NLP processing - Semantic search

10
Thursday, May 26, 2011

The Challenge

Invenio

11
Thursday, May 26, 2011

The Challenge
Query: supersymmetry AND author:ellis

Invenio

11
Thursday, May 26, 2011

The Challenge
Query: supersymmetry AND author:ellis

Invenio

fulltext:supersymmetry

11
Thursday, May 26, 2011

The Challenge
Query: supersymmetry AND author:ellis

Invenio

fulltext:supersymmetry IDs: 1;2;3;9....

11
Thursday, May 26, 2011

The Challenge
Query: supersymmetry AND author:ellis

Invenio

fulltext:supersymmetry IDs: 1;2;3;9....

11
Thursday, May 26, 2011

The Challenge
Query: supersymmetry AND author:ellis

Invenio

fulltext:supersymmetry IDs: 1;2;3;9....

11
Thursday, May 26, 2011

The Challenge
Query: supersymmetry AND author:ellis

Invenio

fulltext:supersymmetry IDs: 1;2;3;9....

11
Thursday, May 26, 2011

The Challenge
Query: supersymmetry AND author:ellis

Invenio

fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9....

11
Thursday, May 26, 2011

The Challenge
Query: supersymmetry AND author:ellis

Invenio

fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9.... 1. only IDs, no score = no ranking


11

Thursday, May 26, 2011

The Challenge
Query: supersymmetry AND author:ellis

Invenio

fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9....

2. score merging difcult (if available)

1. only IDs, no score = no ranking


11

Thursday, May 26, 2011

The Challenge
3. push IDs ? Query: supersymmetry AND author:ellis (eg._faceting)

Invenio

fulltext:supersymmetry 1-6M IDs IDs: 1;2;3;9....

2. score merging difcult (if available)

1. only IDs, no score = no ranking


11

Thursday, May 26, 2011

What is the best solution?


- We love Python... - ...and our applications are written in Python... - But what if Solr is the master search engine? - Merge results inside Solr?
- Typical size: 1-10 mil. IDs - Expected latency: 1-2 s.

- What we want to achieve:


- Fast transfer of hits from Invenio to Solr - Leverage the power of both (no compromises) - Developer-friendly integration, simplicity
12
Thursday, May 26, 2011

Outline
- Context - The Challenge Key components
- Available technologies - Our approach - Evaluation

- Demonstration - Wrap-up

13
Thursday, May 26, 2011

To embed Solr (in Java app)


- Your app simulates Java web container?
- use EmbeddedSolrServer

- It knows nothing about Java servlets?


- use DirectConnect class

- Maybe we are too lazy?


- Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well

14
Thursday, May 26, 2011

To embed Solr (in Java app)


- Your app simulates Java web container?
- use EmbeddedSolrServer

- It knows nothing about Java servlets?


- use DirectConnect class

- Maybe we are too lazy?


- Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well

14
Thursday, May 26, 2011

To embed Solr (in Java app)


- Your app simulates Java web container?
- use EmbeddedSolrServer

- It knows nothing about Java servlets?


- use DirectConnect class

- Maybe we are too lazy?


- Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well

14
Thursday, May 26, 2011

To embed Solr (in Java app)


- Your app simulates Java web container?
- use EmbeddedSolrServer

- It knows nothing about Java servlets?


- use DirectConnect class

- Maybe we are too lazy?


- Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well

14
Thursday, May 26, 2011

To embed Solr (in Java app)


- Your app simulates Java web container?
- use EmbeddedSolrServer

- It knows nothing about Java servlets?


- use DirectConnect class

- Maybe we are too lazy?


- Embed the web container (in my case Jetty) - Seemed strange (webserver inside webserver) - ... but it worked well

14
Thursday, May 26, 2011

To use Solr in non-Java app


- Solr is already usable via HTTP requests, but we need something else here... - Remote objects/calls?
- Pyro, execnet, CORBA, SOAP... - or simply pipes?

- Access Python from Java?


- Jython - JEPP

- Access Java from Python?


- JPype - JCC
15
Thursday, May 26, 2011

Jython?
- Implementation of Python in 100% Java - Both Java and Python code - Truly multithreaded

- C modules will not work


- but see http://bit.ly/iTRYbb

- Slower than CPython

16
Thursday, May 26, 2011

Jython?
- Implementation of Python in 100% Java - Both Java and Python code - Truly multithreaded

- C modules will not work


- but see http://bit.ly/iTRYbb

- Slower than CPython

17
Thursday, May 26, 2011

Jython?
- Implementation of Python in 100% Java - Both Java and Python code - Truly multithreaded

- C modules will not work


- but see http://bit.ly/iTRYbb

- Slower than CPython

17
Thursday, May 26, 2011

JEPP - Java Embedded Python


- Python code runs inside Python interpreter - Embeds CPython interpreter via Java Native Interface (JNI) in Java - http://jepp.sourceforge.net/
- recently updated (27-Jan) - but JCC is more active

18
Thursday, May 26, 2011

JEPP - Java Embedded Python

19
Thursday, May 26, 2011

JCC
- Embeds JVM in Python - C++ code generator - C++ object interface wraps a Java library - C++ wrappers conform to Python's C type system - result: complete Python extension module

20
Thursday, May 26, 2011

JCC

21
Thursday, May 26, 2011

JCC

21
Thursday, May 26, 2011

JCC

21
Thursday, May 26, 2011

To use Solr in non-Java app


Jython Python CModules Speed No code changes Access from Python Access from Java
Thursday, May 26, 2011

JCC

JEPP ?

...
22

The rst try

Invenio Solr

JCC

23
Thursday, May 26, 2011

Devil is in details...

24
Thursday, May 26, 2011

GIL - Global Interpreter Lock


Unfortunately Python webapp is not like Java...

25
Thursday, May 26, 2011

GIL - Global Interpreter Lock

We can have 200 threads, but only 4 will run at time...


26
Thursday, May 26, 2011

GIL - Global Interpreter Lock

27
Thursday, May 26, 2011

Fortunately solution exists


- JCC can embed Python inside Java
- Special thanks to Andi Vajda! (JCC creator)

- We write empty classes in Java ... - ... and implement them in Python

Python /w Java inside


Thursday, May 26, 2011

Java /w Python inside

28

The second try


Solr /w Invenio (backend)

Invenio frontend XML

JCC

29
Thursday, May 26, 2011

Implementing the bridge


- Special Java class - With method pythonExtension() - Native method pythonDecRef()
- JCC provides its implementation

- And number of other native methods


- These will be implemented using Python

- Like writing JNI Java/C code but without compilation...

30
Thursday, May 26, 2011

MontySolr extension
- JCC has great potential, but also added complexity... - So the MontySolr project was born
- Modules must be built in shared mode - JCC dynamic library loaded and started from the main thread - Simple mechanism of the Python bridge and message - Congurable handlers on the Python side - Secured dereferencing of the native objects - Threading on the Java side - Multiprocessing on the Python side - Easy ant targets (compilation) ...
31
Thursday, May 26, 2011

Hello World - Java part


public class MontySolrBridge extends BasicBridge implements PythonBridge { private long pythonObject; public void pythonExtension(long pythonObject) { this.pythonObject = pythonObject; } public long pythonExtension() { return this.pythonObject; } public void finalize() throws Throwable { pythonDecRef(); } public native void pythonDecRef(); public void sendMessage(PythonMessage message) { PythonVM vm = PythonVM.get(); vm.acquireThreadState(); receive_message(message); vm.releaseThreadState(); } public native void receive_message(PythonMessage message); }
Thursday, May 26, 2011

32

Hello World - Python part


from montysolr import MontySolrBridge class SimpleBridge(MontySolrBridge): def __init__(self): super(SimpleBridge, self).__init__() def receive_message(self, message): query = message.getParam(query) message.setResults(Hello world!) print Python received from Java:, query

33
Thursday, May 26, 2011

Example - running MontySolr


- Java side
- JRE (32/64 bit) - Standard Solr/Lucene jars - JCC dynamic library

- Python side
- Python interpreter (32/64 bit) - 4 Python modules (jcc, solr, lucene, montysolr)

- In the main thread


- First we load JCC - Then start Python interpreter ... - ... load Python handlers
34
Thursday, May 26, 2011

Solr as search service


Solr /w Invenio (backend)

Invenio frontend XML

JCC

35
Thursday, May 26, 2011

Example
Solr

MyCustom Handler

36
Thursday, May 26, 2011

Example
refersto:author:ellis Solr

MyCustom Handler

37
Thursday, May 26, 2011

Example - Solr custom handler


MontySolrVM.INSTANCE.sendMessage(message); PythonMessage msg = MontySolrVM.INSTANCE .createMessage("perform_search") .setSender("Invenio") .setParam("query","refersto:author:ellis"); MontySolrVM.INSTANCE.sendMessage(msg); Object result = msg.getResults(); if (result != null) { int[] hits = (int[]) message.getResults(); }

38
Thursday, May 26, 2011

Example - JNI connection


refersto:author:ellis Solr

MyCustom Handler

Python Bridge

39
Thursday, May 26, 2011

Example - JNI connection


refersto:author:ellis Solr

MyCustom Handler

Python Invenio Bridge wrappers

40
Thursday, May 26, 2011

Example - Python side


# handler is made visible at startup SolrpieTarget('Invenio:perform_search', perform_search)

# search time - called from Java def perform_search(message): query = message.getParam(query) hits = call_real_search(query) # cast Python list into Java array message.setResults(JArray_ints(hits))

41
Thursday, May 26, 2011

Example
refersto:author:ellis Solr Invenio Invenio Invenio Invenio

MyCustom Handler

Python Invenio Bridge wrappers

42
Thursday, May 26, 2011

Example - Java side again


MontySolrVM.INSTANCE.sendMessage(message); PythonMessage msg = MontySolrVM.INSTANCE .createMessage("perform_search") .setSender("Invenio") .setParam("query","refersto:author:ellis"); MontySolrVM.INSTANCE.sendMessage(msg); Object result = msg.getResults(); if (result != null) { int[] hits = (int[]) message.getResults(); }

43
Thursday, May 26, 2011

Solr as search service


Solr /w Invenio (backend)

Apache webserver XML Invenio

Invenio

JCC

44
Thursday, May 26, 2011

Outline
- Context - The Challenge - Key components
- Available technologies - Our approach - Problems solved

Evaluation - Wrap-up

45
Thursday, May 26, 2011

Memory and garbage collection

46
Thursday, May 26, 2011

Comparing speed and load...

47
Thursday, May 26, 2011

The effect of cache

48
Thursday, May 26, 2011

Robust?
- Extensive siege tests show very good performance and stability under high load
- 100-200 users, complex searches - 50 concurrent users, citation analysis - JCC incurs small overhead

- We detected no memory leaks


- The same as dbpedia.org

- But watch out for errors in C


- An error in C module brings down the whole JVM - (errors in pure Python module can be handled)

49
Thursday, May 26, 2011

Easy to develop/maintain?
- Added complexity
- Java in the toolbox - Need to compile C++ extensions - Python/OS version dependencies

- For this we get


Easy integration with Invenio The best of two applications A lot of features for free And we can control Solr from Python!

50
Thursday, May 26, 2011

Outline
- Context - The Challenge - Key components
- Available technologies - Our approach - Problems solved

- Evaluation Wrap-up

51
Thursday, May 26, 2011

Wrap-up
- Our challenge was to connect two different languages/systems - And we wanted to get the best of the two...
- So we had to plug Python into Solr - And now our Solr knows citation analysis!

- We created MontySolr extension


Robust, tested (will be used by INSPIRE) Works for any Python application (eg. Django) And for any C/C++ app that Python understands! Free software license

- Try it out! Help us make it better!


- https://github.com/romanchyla/montysolr
Thursday, May 26, 2011

52

Questions?
- MontySolr

- https://github.com/romanchyla/montysolr
- Roman Chyla
Fellow, CERN Scientic Information Service roman.chyla@cern.ch @rchyla https://svnweb.cern.ch/trac/rcarepo

Thursday, May 26, 2011

Additional information

54
Thursday, May 26, 2011

Links
- Invenio platform
- http://invenio-software.org/

- INSPIRE Digital library


- http://inspirebeta.net/

- Diagrams of JCC and JEPP


- Andreas Schreiber : Mixing Java and Python - http://www.slideshare.net/onyame/mixing-python-andjava

- On Jython C Extension API


- http://stackoverow.com/questions/3097466/usingnumpy-and-cpython-with-jython

- Demo of a running service:


- http://insdev01.cern.ch
Thursday, May 26, 2011

55

#1 - How to embed Solr (standard)


- solr.client.solrj.embedded.EmbeddedSolrServer

56
Thursday, May 26, 2011

#2 - How to embed Solr (simplied)


- solr.servlet.DirectSolrConnection - like previous, but simpler - all the queries are sent as strings, everything is just a string - very exible and probably suitable for quick integration

57
Thursday, May 26, 2011

#2 - How to embed Solr (simplied)


- solr.servlet.DirectSolrConnection - like previous, but simpler - all the queries are sent as strings, everything is just a string - very exible and probably suitable for quick integration

57
Thursday, May 26, 2011

#3 - Example of a Solr custom handler

58
Thursday, May 26, 2011

#4 - Example Python handler

59
Thursday, May 26, 2011

You might also like