Wednesday, May 27, 2009

Compass - Lucene - Hibernate - PDF / Document (2)

In a previous post, I just discussed what would be needed for bring up the combination of Compass - Lucence - Spring - Hibernate. Now, I want to plug in some functionality for indexing and searching BLOB contents of PDF and documents. In Compass bean settings in Spring context, we had a configuration:
<prop key="compass.converter.blobConverter.registerClass">java.sql.Blob</prop>
<prop key="compass.converter.blobConverter.type">BlobConverter</prop>
This is required to, first, tell Compass that we are registering a converter for type java.sql.Blob, and second, the fully-qualified converter class is provided. Here, we are trying to index PDF files with Compass with JdbcDirectory. I chose PDFBox for stripping text out of the PDF content; so the coverter would like this:
public boolean marshall(Resource resource, Blob root, Mapping mapping, MarshallingContext context) throws ConversionException {
        if (root == null) {
            return false;
        }
        try {

            byte[] bytes = root.getBytes(1, (int) root.length());
            parser = new PDFParser(new ByteArrayInputStream(bytes));
            parser.parse();
            COSDocument document = parser.getDocument();
            logger.warn("Parsed an instance BLOB with PDF content type: " + document);
            String pdfText = stripper.getText(new PDDocument(document));
            logger.warn("Extracted text from PDF: " + pdfText);

            Property p = context.getResourceFactory().createProperty(mapping.getPath().getPath(), pdfText.getBytes(), Store.YES);
            logger.warn("Created a Compass property on PDF text: " + p);

            resource.addProperty(p);

            return true;

        } catch (IOException e) {
            throw new ConversionException("Failed to initialize a PDF Parser: ", e);
        } catch (SQLException e) {
            throw new ConversionException("Failed to initialize a PDF Parser: ", e);
        }
    }
Remember that this may not be complete as to retrieve something from the index you may need another piece of data such as the ID of the model or entity of which this PDF content is a part. And, if you'd like to index and search document formats in this way, you may want to use Apache POI. To wrap it up, using Compass's feature on JdbcDirectory and its extension on using Hibernated models and configuring it with Spring, you'd have a high-level API to index and search content that is stored in database.

Thursday, May 21, 2009

IBM ICU Persian (Farsi) / Arabic Shaping Bug Fix

Courtesy of Nima Honarmand and Seyyed Jamal Pishvayi, there is now a patch that fixes IBM's ICU project issue 6169. This issue addresses the Persian/Arabic Shaping Support of FB50 block. Hope this contribution would help resolve the issue faster. The patch source: ArabicShaping.java

Tuesday, May 19, 2009

Compass - Lucene - Spring - Hibernate (1)

Compass is a framework over Apache Lucene delivering robust and useful services. Besides, it provides pluggable points for frameworks such as Spring and Hibernate. Though having a reference manual, putting it all together is somehow complicated and cumbersome. This would be probably multi-part how-to on this issue. Consider this scenario: we are developing a web application with all the stuff that I do not delve into. The main focus here is rotating around an entity called "Story". So, we'd have an HBM:
<hibernate-mapping package="ir.asta.wise.core.datamanagement.textsearch.sample.story">
<class name="StoryEntity" table="LUC_STORY">
  <id name="id" column="STORY_ID" type="java.lang.String">
   <generator class="uuid"></generator>
               </id>
  <property name="name" type="java.lang.String" column="NAME" null="true"></property>
               <property name="content" type="java.lang.String" column="CONTENT" null="true" length="4096"></property>
  <property name="blobContent" type="java.sql.Blob" column="BLOB_CONTENT">  </property>
</class>
The first step is to tell Compass to index Story: we create a file named Story.cpm.xml; the Core Mapping for Story:
<compass-core-mapping package="ir.asta.wise.core.datamanagement.textsearch.sample.story">
    <class name="StoryEntity" alias="StoryEntity">
        <id name="id" />
        <property name="content">
            <meta-data>${wiseCompass.story}</meta-data>
        </property>
        <property name="blobContent" converter="blobConverter">
            <meta-data>${wiseCompass.story}</meta-data>
        </property>
</class>
</compass-core-mapping>
Along this mapping, we need to introduce Story to compass through a meta-data descriptor, call it compass.cmd.xml that provides a big picture of all indexables:
<compass-core-meta-data>
    <meta-data-group id="wiseCompass" displayName="WiSE Core Lucene/Compasss">
        <description>WiSE Core Lucene/Compass Core Meta Data</description>
        <uri>http://wise/compass</uri>
        <meta-data id="story" displayName="Story Metadata">
            <description>Story Entity Compass Metadata</description>
            <uri>http://wise/compass/story</uri>
            <name>story</name>
        </meta-data>
        <meta-data id="file" displayName="File Entity Content Metadata">
            <description>File Entity Content Metadata</description>
            <uri>http://wise/compass/file</uri>
            <name>file</name>
        </meta-data>
    </meta-data-group>
</compass-core-meta-data>
Now, let's get to Spring configuration. First, we need to create a bean to be the Compass object:
    <bean id="compass" class="org.compass.spring.LocalCompassBean">
        <property name="dataSource" ref="dataSource" />
        <property name="transactionManager" ref="transactionManager" />
        <property name="resourceLocations">
            <list>
                <value>classpath:config/lucene/compass/*.xml</value>
            </list>
        </property>
        <property name="compassSettings">
            <props>
                <prop key="compass.name">compass</prop>
                <prop key="compass.engine.connection">jdbc://</prop>
                <prop key="compass.converter.blobConverter.registerClass">java.sql.Blob</prop>
                <prop key="compass.converter.blobConverter.type">BlobConverter</prop>
            </props>
        </property>
    </bean>
Some comments on the 'compass' bean configuration:
  • transactionManager is the reference to your bean of Spring that is the actual transaction manager that is also used for Hibernate.
  • resourceLocations is an option to tell where all the *.cpm.xml and *.cmd.xml files are.
  • As pure Lucene does not implement the JDBC Directory concept, through jdbc:// we are telling Compass that we're using JdbcDirectory implementation of Compass and for that dataSource is injected.
  • In this example, we aim to index and search BLOB types (such as PDF or Document). So we need to configure Compass for our BLOB converters. I'd discuss this more in the second part of the tutorial.
The next steps fall into two parts: saving the index and searching it. For both, we need a Compass Session that is also bound to the Compass bean with all the Hibernate bindings. To do so, we need to define two other beans for adding Hibernate collaboration for Compass:
    <bean id="hibernateGpsDevice" class="org.compass.gps.device.hibernate.HibernateGpsDevice">
        <property name="name">
            <value>Hibernate-GPS-Device</value>
        </property>
        <property name="sessionFactory" ref="sessionFactory" />
        <property name="nativeExtractor">
            <bean class="org.compass.spring.device.hibernate.SpringNativeHibernateExtractor" />
        </property>
    </bean>
And:
    <bean id="hibernateGps" class="org.compass.gps.impl.SingleCompassGps" init-method="start" destroy-method="stop">
        <property name="compass" ref="compass" />
        <property name="gpsDevices">
            <list>
                <ref bean="hibernateGpsDevice" />
            </list>
        </property>
    </bean>
Now, we can use the hibernateGps bean to for using Compass Session API. To save an index, we assume that a StoryEntity has been saved and we want to save the index:
    @Transactional(readOnly = false)
    private void saveStoryIndex(StoryEntity s) {
        CompassIndexSession session = hibernateGps.getIndexCompass().openIndexSession();
        session.save(s);
        session.commit();
        logger.warn("Index saved.");
        session.close();
    }
And, to search:
    @Transactional(readOnly = false)
    private void searchSomeStory() {
        logger.warn("Searching....");
        CompassSearchSession session = hibernateGps.getIndexCompass().openSearchSession();
        CompassHits hits = session.find("sample");
        logger.warn("Hits: " + hits.getLength());
        logger.warn("First result: " + hits.hit(0).data());
        session.close();
    }
This is a brief overview on what the integration needs. On the next part, I'd discuss on how indexing and converting of PDF's and Document's could be handled. Hope this would help.

Monday, May 18, 2009

Acegi - CAS - Service Ticket - OC4J 10.1.3.*?! - Ticket lost

Thanks to Oracle OC4J (Standalone/Embedded) that from time to time, reminds me that we still can break the rules in software collaboration. The problem begins where you have a CAS - Acegi integrated SSO solution on some application server on a machine along with another application server with some applications on it using Oracle OC4J Standalone 10.1.3.*?! to host the applications. Now, when a client application goes to the CAS server and the SSO does the sing-in process, Acegi now should return to the client application using targetting the CasProcessingFilter:
https://oc4japphost:8443/myapp/j_acegi_security_check?ticket=[CAS SERVICE TICKET]
Here comes our here OC4J that, it seems, takes it as an offence that some referrer is going to some of its hosted applications with a Query String and a request parameter. So, very logically(!!!), the OC4J container just truncates the query string and this is how the CAS ticket gets lost in the midlle of nowhere. Thanks to Seyyed Jamal, a great friend, the idea is to pass the ticket through CAS using CLEAN URL's instead of query strings such as:
https://oc4japphost:8443/myapp/j_acegi_security_check/ticket/[CAS SERVICE TICKET]
This way the OC4J container is actually unaware of what's going on. To implement the solution:
  1. CAS login-webflow.xml should edited for external redirection after successful sing-in.
  2. Acegi's CasProcessingFilter need be edited for CAS ticket lookup based on clean URL's.
CAS: login-webflow.xml The end-state should be edited so that the value for its view would be:

<end-state id="redirect" view="externalRedirect:${externalContext.requestParameterMap['service']}${requestScope.ticket== null ? '' : '/ticket/' + requestScope.ticket}"></end-state>
Acegi: CasProcessingFilter
public Authentication attemptAuthentication(HttpServletRequest request) throws AuthenticationException {
        String username = CAS_STATEFUL_IDENTIFIER;
        String password = extractTicket(request);

        logger.warn("[ CUSTOMIZED OC4J CAS Processing Filter ] Found CAS ticket: " + password);

        if (password == null) {
            password = "";
        }
UsernamePasswordAuthenticationToken authRequest = new UsernamePasswordAuthenticationToken(username, password);

authRequest.setDetails(authenticationDetailsSource.buildDetails((HttpServletRequest) request));

        Authentication authenticationResult = this.getAuthenticationManager().authenticate(authRequest);

        if (authenticationResult != null) {
           logger.warn("[ CUSTOMIZED OC4J CAS Processing Filter ] CAS authentication completed. Its success will be decided afterwards.");
        }
        return authenticationResult;
    }
And, now:
    protected String extractTicket(HttpServletRequest request) {
        String ticket = request.getParameter("ticket");
        if (StringUtils.hasText(ticket)) {
            logger.warn("Service TICKET found on query string: " + ticket);
            return ticket;
        }
        String uri = request.getRequestURI();
        if (uri.indexOf("/ticket/") > 0) {
            ticket = uri.substring(uri.lastIndexOf('/') + 1);
            if (StringUtils.hasText(ticket)) {
                logger.warn("Service TICKET found on clean URL: " + ticket);
                return ticket;
            }
        }
        logger.error("No SERVICE TICKET FOUND on request: " + uri);
        return null;
    }
Then, remeber to edit you Acegi's bean of org.acegisecurity.util.FilterChainProxy and it property filterInvocationDefinitionSource so that the value for j_acegi_security_check would be:
/j_acegi_security_check*/**=httpSessionContextIntegrationFilter,casProcessingFilter
It seems that OC4J is working through this solution on separate CAS server and its applications get singed in through. This issue has also been discussed in Spring Forums: http://forum.springsource.org/showthread.php?t=38897 Hope this would help.