Start spark-sql in terminal and run the following queries,
CREATE TABLE maintable(a int, b string, c string) stored as carbondata; insert into maintable select 1, 'ab', 'cd'; CREATE index index1 on table maintable(c) AS 'carbondata'; SELECT a from maintable where c = 'cd'; // NOTE: run explain query and check if query hits the SI table from the plan EXPLAIN SELECT a from maintable where c = 'cd';
Secondary index tables are created as indexes and managed as child tables internally by Carbondata. Users can create a secondary index based on the column position in the main table(Recommended for right columns) and the queries should have filter on that column to improve the filter query performance.
Data refresh to the secondary index is always automatic. Once SI table is created, Carbondata's
CarbonOptimizer with the help of
CarbonSITransformationRule, transforms the query plan to hit the
SI table based on the filter condition or set of filter conditions present in the query.
So the first level of pruning will be done on the SI table as it stores blocklets and main table/parent
table pruning will be based on the SI output, which helps in giving the faster query results with
Secondary Index table can be created with the below syntax
CREATE INDEX [IF NOT EXISTS] index_name ON TABLE maintable(index_column) AS 'carbondata' [PROPERTIES('table_blocksize'='1')]
NOTE: Keywords given inside
For instance, main table called sales which is defined as
CREATE TABLE sales ( order_time timestamp, user_id string, sex string, country string, quantity int, price bigint) STORED AS carbondata
User can create SI table using the Create Index DDL
CREATE INDEX index_sales ON TABLE sales(user_id) AS 'carbondata' PROPERTIES('table_blocksize'='1')
When a user executes a filter query, during the query planning phase, CarbonData with the help of
CarbonSITransformationRule, checks if there are any index tables present on the filter column of
query. If there are any, then the filter query plan will be transformed in such a way that execution will
first hit the corresponding SI table and give input to the main table for further pruning.
For the main table sales and SI table index_sales created above, following queries
SELECT country, sex from sales where user_id = 'xxx' SELECT country, sex from sales where user_id = 'xxx' and country = 'INDIA'
will be transformed by CarbonData's
CarbonSITransformationRule to query against SI table
index_sales first which will be input to the main table sales
case1: When the SI table is created and the main table does not have any data. In this case every consecutive load to the main table, will load data to the SI table once the main table data load is finished.
case2: When the SI table is created and the main table already contains some data, then SI creation will also load data to the SI table with the same number of segments as the main table. Thereafter, consecutive load to the main table will also load data to the SI table.
Direct query can be made on SI tables to check the data present in position reference columns. When a filter query is fired, and if the filter column is a secondary index column, then plan is transformed accordingly to hit the SI table first to make better pruning with the main table and in turn helps for faster query results.
Users can verify whether a query can leverage the SI table or not by executing the
command, which will show the transformed logical plan, and thus users can check whether the SI table
Running Compaction command (
ALTER TABLE COMPACT)[COMPACTION TYPE-> MINOR/MAJOR] on main table will
automatically delete all the old segments of SI and creates a new segment with same name as main
table compacted segment and loads data to it.
Where there are so many small files present in the SI table, then we can use the REFRESH INDEX command to compact the files within an SI segment to avoid many small files.
REFRESH INDEX sales_index ON TABLE sales
This command merges data files in each segment of the SI table.
REFRESH INDEX sales_index ON TABLE sales WHERE SEGMENT.ID IN(1)
This command merges data files within a specified segment of the SI table.
When Secondary indexes are created on a table(s), data fetching happens from secondary indexes created on the main tables for better performance. But sometimes, data fetching from the secondary index might degrade query performance in cases where the data is sparse and most of the blocklets need to be scanned. So to avoid such secondary indexes, we use NI as a function on filters within WHERE clause.
SELECT country, sex from sales where NI(user_id = 'xxx')
The above query ignores column
user_id from the secondary index and fetches data from the main table.
This command is used to get information about all the secondary indexes on a table.
SHOW INDEXES ON [TABLE] [db_name.]table_name
This command is used to drop an existing secondary index on a table
DROP INDEX [IF EXISTS] index_name ON [TABLE] [db_name.]table_name
This command registers the secondary index with the main table in case of compatibility scenarios where we have old stores.
REGISTER INDEX TABLE index_name ON [TABLE] [db_name.]table_name