Java DB

Apache Derby

Derby Full Text Seach Index

Derby Getting Started
Derby Reference Manual
Derby Developer's Guide
Derby Performance Tuning
Derby Server and Admin Guide
Derby Tools and Utilities
Derby Getting Started
-Introduction to Derby
-Installing and configuring Derby
-Self-study tutorial for users new to Derby
-Quick start guide for experienced JDBC users
-Documentation conventions
-Derby libraries and scripts: complete reference

Java DB / Derby Full Text Search Indexer

In databases, full text search refers to a technique for searching a computer-stored  database to find a relevant record that contains given words. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.

There are various ways to index these database records so that retrieval is fast at the cost of insertion time.

Apache Derby (aka Java DB) does not have a full-text search engine at this time. There is a JIRA issue “DERBY-472 - Full Text Indexing / Full Text Search” to integrate Lucene with Derby database. It is not implemented yet.

A Full Text Search Alternative

JavaDb.net is proud to present you an alternative solution to index your documents for full text search queries.

We have developed Full Text Indexer using existing Derby capabilities: triggers, table functions, Java stored procedures.

You can download ftindex.jar version 0.1 for free. It is written in Java using the Derby (Java DB) extension capabilities. It is a tiny jar that provides the following capabilities:

Create a full text index on one or columns of a table. Details:

  • There can only be 1 index on 1 table.
  • The index may contain more than 1 column but less than 10 columns.
  • Each indexed column’s size should be less than Derby’s VARCHAR length limit of 32672 characters. This is because the function that indexes a document takes the document as a VARCHAR parameter. Since the function is invoked inside a trigger, LOB (CLOB) types are not supported to be indexed.
  • User may specify additional columns to be included in the index table, so that they can be selected without using a join.

Please provide feedback to javadb@jdbcurl.com to improve full text search indexer Derby extension.

How to Install Full Text Search Indexer for Derby

You need to install the ftindexer.jar file to your Derby database, but having it on your application classpath (in embedded Derby mode) is good enough. To install into database, you can use:

CALL SQLJ.install_jar('C:\ftindex.jar', 'FTINDEX', 0)

Next, you need to create the following Derby function whose implementation is inside the ftindexer.jar:

    CREATE PROCEDURE CREATE_FTEXT_INDEX(
      TABLE_NAME VARCHAR(100),
      PK_COLUMNS VARCHAR(2048),
      INDEX_COLUMNS VARCHAR(2048),
      INCLUDE_COLUMNS VARCHAR(2048),
      MIN_WORD_LEN INTEGER,
      MAX_WORD_LEN INTEGER)
      PARAMETER STYLE JAVA
      LANGUAGE JAVA
      MODIFIES SQL DATA
      EXTERNAL NAME
    'net.javadb.index.FTIndexer.createFTIndex'

We will explain how to use above function to create full text search indices later. You may also need the following function to drop your indices:

    CREATE PROCEDURE DROP_FTEXT_INDEX(TABLE_NAME VARCHAR(100))
      PARAMETER STYLE JAVA
      LANGUAGE JAVA
      MODIFIES SQL DATA
      EXTERNAL NAME
    'net.javadb.index.FTIndexer.dropFTIndex'

Derby Command to Create Full Text Search Index

CALL CREATE_FTEXT_INDEX('APP.TEST', 'ID', 'COL1', NULL, 4, 30)

APP.TEST is the TABLE_NAME parameter. It is the table that you want to create an index for, prefixed by the schema name.

ID is the Primary Key column of your table. You can leave it as NULL if you like and CREATE_FTEXT_INDEX function will find the primary key or the unique index to use. If you have a composite key, you can specify all of them separated by commas. The reason this is an input is for the cases where there are multiple unique keys defined for the table, but you don’t like the one CREATE_FTEXT_INDEX function is picking automatically and you want another candidate key column(s) to be used.

COL1 is the column that contains the documents that you want to index. You can list up to 9 columns separated by commas.

NULL is the comma separated list of additional columns to include to the index table. These columns are the non-PK, non-document columns that you want to denormalize from your actual table to the index table so that some search queries may perform better (possibly with the help of additional indexes).

4 is the minimum length for a word to be indexed.

30 is the maximum word length to be indexed.

Database Objects Created for the Full Text Index

Let us take define an example and assume that you have the following table:

CREATE TABLE APP.TEST(
  ID INTEGER NOT NULL,
  AUTHOR VARCHAR(40),
  DOCUM VARCHAR(2048),
  PRIMARY KEY (ID)
)

ID is the primary key. DOCUM is the text column you wish to create a full text search index for so that you can quickly find records that contain a certain word.

When you execute:

CALL CREATE_FTEXT_INDEX('APP.TEST', 'ID', 'DOCUM', NULL, 4, 30)

Full Text Indexer creates:

1. TEST__ table that maps each word to the ID:

    CREATE TABLE APP.TEST__(
      WORD__ VARCHAR(30),
      ID INTEGER NOT NULL,
      FOREIGN KEY(ID) REFERENCES APP.TEST(ID)
    )
     

2. A unique index on above table:

    CREATE UNIQUE  INDEX APP.IDX_TEST__ ON APP.TEST__(WORD__,ID);

3. INSERT TRIGGER on table TEST:

    CREATE TRIGGER FTI_TEST
     AFTER  INSERT  ON APP.TEST
     REFERENCING  NEW AS N
     FOR EACH ROW  MODE DB2SQL
    INSERT INTO "APP"."TEST__"(WORD__,ID)
    (SELECT WORD,N.ID FROM TABLE("APP"."F_TEST__"(N."DOCUM")) AS T)

4. UPDATE TRIGGER on table TEST (to delete old indexed values):

    CREATE TRIGGER FTD_TEST
     AFTER  UPDATE  ON APP.TEST
     REFERENCING  OLD AS O
     FOR EACH ROW  MODE DB2SQL
    DELETE FROM "APP"."TEST__" WHERE ID=O.ID

5. UPDATE TRIGGER on table TEST (to populate new index values):

    CREATE TRIGGER FTU_TEST
     AFTER  UPDATE  ON APP.TEST
     REFERENCING  NEW AS N
     FOR EACH ROW  MODE DB2SQL
    INSERT INTO "APP"."TEST__"(WORD__,ID)
    (SELECT WORD,N.ID FROM TABLE("APP"."F_TEST__"(N."DOCUM")) AS T)

To drop these artifacts, you can use:

APP.DROP_FTEXT_INDEX('APP.TEST')

Using Full Text Search Index Table

Once your index is created, you can write a query like this:

    SELECT ID, AUTHOR, DOCUM
    FROM TEST T, TEST__ I
    WHERE T.ID=I.ID AND I.WORD__=?

Due to the composite index on TEST__(WORD, ID), given a word, we are able to locate the corresponding ID quickly and return results faster than a LIKE query on TEST.DOCUM column.

Performance of Full Text Search Index

In most cases, this relational table solution for full text indexing would be slower than using an integrated Lucene solution. However, this is available today with minimal effort. Also there is a good side to it. Since the entire index is a relational table (Unlike Lucene), additional columns can be added to the index table and composite database indices can be created on them to speed up special queries. How? Let’s give an example:

Consider the earlier example and assume that there is a DOCYEAR SMALLINT column in the APP.TEST table. If your application selects documents within a year, or a year range, then DOCYEAR__ column can be included in the full text index table TEST__ and become part of a composite index that will span (DOCYEAR__, WORD__, ID) and given a year and word, the ID record will be located faster if providing the year upfront limits the search space considerably.

FUTURE TO DO TO IMPROVE FULL TEXT SEARCH INDEXER

1. Provide indexer function to scan a table that is already populated at the time index was created. This may be invoked at the time index is created.

2. Provide FREQ INTEGER column to store frequency of each word within a document. This can be used as a measure of relevance.

3. Provide LEN SMALLINT column to store the length of each word. In some cases, limiting search by length may help performance.

4. Provide XPATH VARCHAR(?) column to index words in XML documents along with their XPath locations.

    XML Element Xpath example is  /department/description
    XML Attribute XPath example is /department/@id

 

javadb@jdbcurl.com