Wednesday, May 27, 2009

Compass - Lucene - Hibernate - PDF / Document (2)

In a previous post, I just discussed what would be needed for bring up the combination of Compass - Lucence - Spring - Hibernate. Now, I want to plug in some functionality for indexing and searching BLOB contents of PDF and documents. In Compass bean settings in Spring context, we had a configuration:
<prop key="compass.converter.blobConverter.registerClass">java.sql.Blob</prop>
<prop key="compass.converter.blobConverter.type">BlobConverter</prop>
This is required to, first, tell Compass that we are registering a converter for type java.sql.Blob, and second, the fully-qualified converter class is provided. Here, we are trying to index PDF files with Compass with JdbcDirectory. I chose PDFBox for stripping text out of the PDF content; so the coverter would like this:
public boolean marshall(Resource resource, Blob root, Mapping mapping, MarshallingContext context) throws ConversionException {
        if (root == null) {
            return false;
        }
        try {

            byte[] bytes = root.getBytes(1, (int) root.length());
            parser = new PDFParser(new ByteArrayInputStream(bytes));
            parser.parse();
            COSDocument document = parser.getDocument();
            logger.warn("Parsed an instance BLOB with PDF content type: " + document);
            String pdfText = stripper.getText(new PDDocument(document));
            logger.warn("Extracted text from PDF: " + pdfText);

            Property p = context.getResourceFactory().createProperty(mapping.getPath().getPath(), pdfText.getBytes(), Store.YES);
            logger.warn("Created a Compass property on PDF text: " + p);

            resource.addProperty(p);

            return true;

        } catch (IOException e) {
            throw new ConversionException("Failed to initialize a PDF Parser: ", e);
        } catch (SQLException e) {
            throw new ConversionException("Failed to initialize a PDF Parser: ", e);
        }
    }
Remember that this may not be complete as to retrieve something from the index you may need another piece of data such as the ID of the model or entity of which this PDF content is a part. And, if you'd like to index and search document formats in this way, you may want to use Apache POI. To wrap it up, using Compass's feature on JdbcDirectory and its extension on using Hibernated models and configuring it with Spring, you'd have a high-level API to index and search content that is stored in database.

No comments:

Post a Comment