Lucene Index can be created using following DDL
CREATE INDEX [IF NOT EXISTS] index_name
ON TABLE main_table (index_columns)
AS 'lucene'
[PROPERTIES ('key'='value')]
index_columns is the list of string columns on which lucene creates indexes.
Index can be dropped using following DDL:
DROP INDEX [IF EXISTS] index_name
ON [TABLE] main_table
To show all Indexes created, use:
SHOW INDEXES
ON [TABLE] main_table
It will show all Indexes created on the main table.
NOTE: Keywords given inside
[]
is optional.
Lucene is a high performance, full featured text search engine. Lucene is integrated to carbon as an index and managed along with main tables by CarbonData. User can create lucene index to improve query performance on string columns which has content of more length. So, user can search tokenized word or pattern of it using lucene query on text content.
For instance, main table called index_test which is defined as:
CREATE TABLE index_test (
name string,
age int,
city string,
country string)
STORED AS carbondata
User can create Lucene index using the Create Index DDL:
CREATE INDEX dm
ON TABLE index_test (name,country)
AS 'lucene'
Properties
When loading data to main table, lucene index files will be generated for all the index_columns(String Columns) given in CREATE statement which contains information about the data location of index_columns. These index files will be written inside a folder named with index name inside each segment folder.
A system level configuration carbon.lucene.compression.mode
can be added for best compression of
lucene index files. The default value is speed, where the index writing speed will be more. If the
value is compression, the index file size will be compressed.
As a technique for query acceleration, Lucene indexes cannot be queried directly. Queries are to be made on the main table. When a query with TEXT_MATCH('name:c10') or TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the number of result to be returned, if user does not specify this value, all results will be returned without any limit] is fired, two jobs will be launched. The first job writes the temporary files in folder created at table level which contains lucene's search results and these files will be read in second job to give faster results. These temporary files will be cleared once the query finishes.
User can verify whether a query can leverage Lucene index or not by executing the EXPLAIN
command, which will show the transformed logical plan, and thus user can check whether TEXT_MATCH()
filter is applied on query or not.
Note:
The filter columns in TEXT_MATCH or TEXT_MATCH_WITH_LIMIT must be always in lowercase and filter conditions like 'AND','OR' must be in upper case.
Ex:
select * from index_test where TEXT_MATCH('name:*10 AND name:*n*')
Query supports only one TEXT_MATCH udf for filter condition and not multiple udfs.
The following query is supported:
select * from index_test where TEXT_MATCH('name:*10 AND name:*n*')
The following query is not supported:
select * from index_test where TEXT_MATCH('name:*10) AND TEXT_MATCH(name:*n*')
Below like
queries can be converted to text_match queries as following:
select * from index_test where name='n10'
select * from index_test where name like 'n1%'
select * from index_test where name like '%10'
select * from index_test where name like '%n%'
select * from index_test where name like '%10' and name not like '%n%'
Lucene TEXT_MATCH Queries:
select * from index_test where TEXT_MATCH('name:n10')
select * from index_test where TEXT_MATCH('name:n1*')
select * from index_test where TEXT_MATCH('name:*10')
select * from index_test where TEXT_MATCH('name:*n*')
select * from index_test where TEXT_MATCH('name:*10 -name:*n*')
Note: For lucene queries and syntax, refer to lucene-syntax
Once there is a lucene index created on the main table, following command on the main table is not supported:
UPDATE/DELETE
.ALTER TABLE DROP COLUMN
, ALTER TABLE CHANGE DATATYPE
,
ALTER TABLE RENAME
.Note: Adding a new column is supported, and for dropping columns and change datatype command, CarbonData will check whether it will impact the lucene index, if not, the operation is allowed, otherwise operation will be rejected by throwing exception.
ALTER TABLE ADD/DROP PARTITION
.However, there is still way to support these operations on main table, in current CarbonData release, user can do as following:
DROP INDEX
command.CREATE INDEX
command.
Basically, user can manually trigger the operation by refreshing the index.