Yarep Search

Yarep comes with a search feature. To turn it on you just have to configure a index location which is a directory where the index will be stored.

By default it uses Lucene as a standard search and indexing implementation. But if the standard implementation doesn't fit your needs you can implement you own search by implementing following interfaces:

  • org.wyona.yarep.core.search.Indexer (used for indexing)
  • org.wyona.yarep.core.search.Searcher (used for searching)
  • org.wyona.yarep.core.search.Metadata (used to pass additional information to the Indexer)

Have a look at the standard implementation at org.wyona.yarep.impl.search.lucene

 

Configuration

Element Name (namespace) Child-Element Attribute Explanation
search-index (http://www.wyona.org/yarep/search/2.0) index-location, repo-auto-index-fulltext, repo-auto-index-properties, lucene (implementation specific) indexer-class, searcher-class Root configuration element for search. The attribute indexer-class configures a specific implementation of org.wyona.yarep.core.search.Indexer. If not set it will fall back to the standard implementation (org.wyona.yarep.impl.search.lucene.LuceneIndexer). The attribute searcher-class configures a specific implementation of org.wyona.yarep.core.search.Searcher. If not set it will fall back to the standard implementation (org.wyona.yarep.impl.search.lucene.LuceneSearcher).
index-location   file Path to a directory where the index will be stored. absoluth or relative to this config file. (Be sure your application has read/write access)
repo-auto-index-fulltext   boolean true or false. if not set it will fall back to true. Indicates if indexing should be done when calling close() on a InputStream. if turned off, you can index a node by using Indexer.index(Node node)
repo-auto-index-properties   boolean true or false. if not set it will fall back to true. Indicates if indexing should be done when saving a property. if turned off, you can index a node by using Indexer.index(Node node, Property property)

Standard implementation (lucene) configuration

Element Name (namespace) Child-Element Attribute Explanation
lucene local-tika-config, fulltext-analyzer, property-analyzer, write-lock-timeout   Root configuration element for Standard Search/Index implementation (lucene)
local-tika-config   file file-path pointing to a tika-config file. absoluth or relative to this config file.
fulltext-analyzer   class A class which extends org.apache.lucene.analysis.Analyzer. if empty it will use org.apache.lucene.analysis.standard.StandardAnalyzer.
property-analyzer   class A class which extends org.apache.lucene.analysis.Analyzer. if empty it will use org.apache.lucene.analysis.WhitespaceAnalyzer.
write-lock-timeout   ms Milliseconds. Sets the maximum time to wait for a write lock (in milliseconds) for the index. (TODO: This is not implemented yet.)

Configuration Examples

The configuration is done in the data repository definition file (e.g. .../yanel/src/realms/from-scratch-realm-template/config/vfs-data-repository.xml).

Minimal Configuration

  <s:search-index xmlns:s="http://www.wyona.org/yarep/search/2.0" >
<index-location file="index"/>
</s:search-index>

Full configuration

  <s:search-index xmlns:s="http://www.wyona.org/yarep/search/2.0" 
indexer-class="org.wyona.yarep.impl.search.lucene.LuceneIndexer"
searcher-class="org.wyona.yarep.impl.search.lucene.LuceneSearcher">
<auto-indexing boolean="true"/>
<index-location file="index"/>
<index-fulltext boolean="true"/>
<index-properties boolean="true"/>
<lucene>
<local-tika-config file="tika-config.xml"/>
<fulltext-analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
<property-analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
<write-lock-timeout ms="3000"/>
</lucene>
</s:search-index>

Search resource configuration

Searching for properties must be turned on for each property in the resource configuration of the search resource (e.g. .../yanel/src/realms/from-scratch-realm-template/res-configs/en/search.html.yanel-rc):

<yanel:resource-config xmlns:yanel="http://www.wyona.org/yanel/rti/1.0">
  <yanel:rti name="search" namespace="http://www.wyona.org/yanel/resource/1.0"/>

  <yanel:property name="property-name" value="yarep_checkoutUserID"/>

</yanel:resource-config>

Implementation details of indexing

Yanel uses Tika (currently version 0.4) to parse the documents for indexing. The document is first parsed by Tika, then passed to Lucene for indexing. To configure Tika, use a Tika configuration file (see tika-core/src/main/resources/org/apache/tika/tika-config.xml in the Tika 0.4 source code package for an example).

The fulltext index is written by the class org.wyona.yarep.impl.search.lucene.LuceneIndexer, which is called when the InputStream is closed.

The actual sequence of indexing properties for a (virtual) file is:

  • YourResource sets the properties in Yarep:
    org.wyona.yarep.core.Node.setProperty(String, String)
  • Yarep saves these properties in
    org.wyona.yarep.impl.repo.vfs.VirtualFileSystemNode.saveProperties()
  • by calling
    org.wyona.yarep.impl.repo.vfs.VirtualFileSystemOutputStream.close()

For this reason, it is very important that all OutputStreams are closed, even if the compiler won't warn you if you don't.

Using the indexing/search features

If you use the default configuration of Yanel, only the fulltext of your content documents will be indexed. If you want properties to be indexed and searchable, you must:

  • Turn on property search in the search resource config (see above)
  • Index those properties when saving the content document by using
    node.setProperty("property-name", "property-value");

With the current implementation, it is not possible to search in fulltext mode and properties simultaneously, but it is possible to configure different searches via different resource-configs, e.g. one for each.

Custom parser

You can easily write your own (Tika) parser. The best way to do this is to copy an existing parser (e.g. org.apache.tika.parser.xml.DcXMLParser), and modify it according to your needs, and configure Tika to use your custom parser (in tika-config.xml). Also see the Tika documentation.

Caveats: With the current Yarep implementation, only metadata fields "org.apache.tika.metadata.Metadata.TITLE", "org.apache.tika.metadata.Metadata.KEYWORDS" and "org.apache.tika.metadata.Metadata.DESCRIPTION" will be indexed, and they will be indexed as fulltext! Also be aware that in the case of multiple instances of these metadata fields, only the first one will be indexed, so if you have e.g. several keywords to index, you must put them in a single KEYWORDS field as a white space separated list of words.



Your comments are much appreciated

Is the content of this page unclear or you think it could be improved? Please add a comment and we will try to improve it accordingly.