We have an old system which uses DataImportHandler to import data from PostgreSQL. The way we use it is:
We configure a xml with selects what will be indexed into solr from our database. Take a look at this: http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example We call URLs to perform the import. http://localhost:8983/solr/db/dataimport?command=full-import performs a full import and http://localhost:8983/solr/dataimport?command=delta-import imports only what wasn't imported since last time.
I am not saying we need exactly this solution, I just told you about DataImportHandler because we are used to it. However, your idea of Listeners sounds like something that would help realtime indexing, so that could be even better. Would just need some documentation or explanation on how to do it.
The most important thing is: we need to select what goes indexed by SOLR and how it goes. For instance, I might have two CFs: User and City
I might want to import just a document to SOLR, User document, and each user goes together with his/her associated cities. I might want to import the address column, but not import the birthDate column
I might want to import to documents to SOLR, User document and City document, each one independently.
Imagine a very big and complex data model.
I have the following entities:
User (has id, name, birthDate, List…)
All that goes to solr, but there is only one document: user.
A second scenario would be City as separate document in Solr too.
What if I receive a new user interest? I will add one more interest to the UserInterests CF, that would be the change in cassandra.
In Solr, however, I would need to reindex the entire user, as SOLR, AFAIK, doesn't allow you to reindex only part of the document, you can whether delete the indexed document or replace it, you can't update.
In the example bellow, savingEntity object would receive UserInterest entity, wouldn't it? But I want to reindex user.
Remember that realtime is good, but having the possibility to do it in batches is also desirable. Suppose the following scenario, indexing in real time:
I add a new interest to the user.
I add a new like to the user.
I add a new request to the user.
Indexing the entire user every event could be problematic in some cases, so in some cases it might be better to perform a delta index every hour, for instance… Final users still have near real time data, but the processing amount needed from the server decreases a lot.