SEGMENT MANAGEMENT

Each load into CarbonData is written into a separate folder called Segment.Segments is a powerful concept which helps to maintain consistency of data and easy transaction management.CarbonData provides DML (Data Manipulation Language) commands to maintain the segments.

SHOW SEGMENT

This command is used to list the segments of CarbonData table.

SHOW [HISTORY] SEGMENTS
[FOR TABLE | ON] [db_name.]table_name
[AS (select query from table_name_segments)]

By default, SHOW SEGMENT command will return following fields:

  • Segment ID

  • Segment Status

  • Load Start Time

  • Load Time Taken

  • Partition

  • Data Size

  • Index Size

    Example: Show visible segments

    SHOW SEGMENTS ON CarbonDatabase.CarbonTable
    

    Show all segments, include invisible segments

    SHOW HISTORY SEGMENTS ON CarbonDatabase.CarbonTable
    

    When more detail of the segment is required, user can issue SHOW SEGMENT by query.

    The query should be against table name with '_segments' appended and select from following fields:

  • id: String, the id of the segment

  • status: String, status of the segment

  • loadStartTime: String, loading start time

  • loadEndTime: String, loading end time

  • timeTakenMs: Long, time spent in loading of the segment in milliseconds

  • partitions: String array, partition key and values

  • dataSize: Long, data size in bytes

  • indexSize: Long, index size in bytes

  • mergedToId: String, the target segment that this segment has been compacted

  • format: String, data format of the segment

  • path: String, in case of external segment this will be the path of the segment, otherwise it is null

  • segmentFileName: String, name of the segment file

    Example:

    SHOW SEGMENTS ON CarbonTable AS 
    SELECT * FROM CarbonTable_segments
    
    SHOW SEGMENTS ON CarbonTable AS
    SELECT id, dataSize FROM CarbonTable_segments 
    WHERE status='Success' 
    ORDER BY dataSize
    
    SHOW SEGMENTS ON CarbonTable AS
    SELECT avg(timeTakenMs) FROM CarbonTable_segments  
    

DELETE SEGMENT BY ID

This command is used to delete segment by using the segment ID. Each segment has a unique segment ID associated with it. Using this segment ID, you can remove the segment.

The following command will get the segmentID.

SHOW SEGMENTS FOR TABLE [db_name.]table_name LIMIT number_of_segments

After you retrieve the segment ID of the segment that you want to delete, execute the following command to delete the selected segment.

DELETE FROM TABLE [db_name.]table_name WHERE SEGMENT.ID IN (segment_id1, segments_id2, ...)

Example:

DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.ID IN (0)
DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.ID IN (0,5,8)

DELETE SEGMENT BY DATE

This command will allow to delete the CarbonData segment(s) from the store based on the date provided by the user in the DML command. The segment created before the particular date will be removed from the specific stores.

DELETE FROM TABLE [db_name.]table_name WHERE SEGMENT.STARTTIME BEFORE DATE_VALUE

Example:

DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.STARTTIME BEFORE '2017-06-01 12:05:06' 

QUERY DATA WITH SPECIFIED SEGMENTS

This command is used to read data from specified segments during CarbonScan.

Get the Segment ID:

SHOW SEGMENTS FOR TABLE [db_name.]table_name LIMIT number_of_segments

Set the segment IDs for table

SET carbon.input.segments.<database_name>.<table_name> = <list of segment IDs>

NOTE: carbon.input.segments: Specifies the segment IDs to be queried. This property allows you to query specified segments of the specified table. The CarbonScan will read data from specified segments only.

If user wants to query with segments reading in multi-threading mode, then CarbonSession.threadSet can be used instead of SET query.

CarbonSession.threadSet ("carbon.input.segments.<database_name>.<table_name>","<list of segment IDs>");

Reset the segment IDs

SET carbon.input.segments.<database_name>.<table_name> = *;

If user wants to query with segments reading in multi-threading mode, then CarbonSession.threadSet can be used instead of SET query.

CarbonSession.threadSet ("carbon.input.segments.<database_name>.<table_name>","*");

Examples:

  • Example to show the list of segment IDs, segment status, and other required details and then specify the list of segments to be read.
SHOW SEGMENTS FOR carbontable1;

SET carbon.input.segments.db.carbontable1 = 1,3,9;
  • Example to query with segments reading in multi-threading mode:
CarbonSession.threadSet ("carbon.input.segments.db.carbontable_Multi_Thread","1,3");
  • Example for threadset in multi-thread environment (following shows how it is used in Scala code):
def main(args: Array[String]) {
Future {          
  CarbonSession.threadSet ("carbon.input.segments.db.carbontable_Multi_Thread","1")
  spark.sql("select count(empno) from carbon.input.segments.db.carbontable_Multi_Thread").show();
   }
 }