Each load into CarbonData is written into a separate folder called Segment.Segments is a powerful concept which helps to maintain consistency of data and easy transaction management.CarbonData provides DML (Data Manipulation Language) commands to maintain the segments.
This command is used to list the segments of CarbonData table.
SHOW [HISTORY] SEGMENTS
[FOR TABLE | ON] [db_name.]table_name
[AS (select query from table_name_segments)]
By default, SHOW SEGMENT command will return following fields:
Segment ID
Segment Status
Load Start Time
Load Time Taken
Partition
Data Size
Index Size
Example: Show visible segments
SHOW SEGMENTS ON CarbonDatabase.CarbonTable
Show all segments, include invisible segments
SHOW HISTORY SEGMENTS ON CarbonDatabase.CarbonTable
When more detail of the segment is required, user can issue SHOW SEGMENT by query.
The query should be against table name with '_segments' appended and select from following fields:
id: String, the id of the segment
status: String, status of the segment
loadStartTime: String, loading start time
loadEndTime: String, loading end time
timeTakenMs: Long, time spent in loading of the segment in milliseconds
partitions: String array, partition key and values
dataSize: Long, data size in bytes
indexSize: Long, index size in bytes
mergedToId: String, the target segment that this segment has been compacted
format: String, data format of the segment
path: String, in case of external segment this will be the path of the segment, otherwise it is null
segmentFileName: String, name of the segment file
Example:
SHOW SEGMENTS ON CarbonTable AS
SELECT * FROM CarbonTable_segments
SHOW SEGMENTS ON CarbonTable AS
SELECT id, dataSize FROM CarbonTable_segments
WHERE status='Success'
ORDER BY dataSize
SHOW SEGMENTS ON CarbonTable AS
SELECT avg(timeTakenMs) FROM CarbonTable_segments
This command is used to delete segment by using the segment ID. Each segment has a unique segment ID associated with it. Using this segment ID, you can remove the segment.
The following command will get the segmentID.
SHOW SEGMENTS FOR TABLE [db_name.]table_name LIMIT number_of_segments
After you retrieve the segment ID of the segment that you want to delete, execute the following command to delete the selected segment.
DELETE FROM TABLE [db_name.]table_name WHERE SEGMENT.ID IN (segment_id1, segments_id2, ...)
Example:
DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.ID IN (0)
DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.ID IN (0,5,8)
This command will allow to delete the CarbonData segment(s) from the store based on the date provided by the user in the DML command. The segment created before the particular date will be removed from the specific stores.
DELETE FROM TABLE [db_name.]table_name WHERE SEGMENT.STARTTIME BEFORE DATE_VALUE
Example:
DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.STARTTIME BEFORE '2017-06-01 12:05:06'
This command is used to read data from specified segments during CarbonScan.
Get the Segment ID:
SHOW SEGMENTS FOR TABLE [db_name.]table_name LIMIT number_of_segments
Set the segment IDs for table
SET carbon.input.segments.<database_name>.<table_name> = <list of segment IDs>
NOTE: carbon.input.segments: Specifies the segment IDs to be queried. This property allows you to query specified segments of the specified table. The CarbonScan will read data from specified segments only.
If user wants to query with segments reading in multi-threading mode, then CarbonSession.threadSet can be used instead of SET query.
CarbonSession.threadSet ("carbon.input.segments.<database_name>.<table_name>","<list of segment IDs>");
Reset the segment IDs
SET carbon.input.segments.<database_name>.<table_name> = *;
If user wants to query with segments reading in multi-threading mode, then CarbonSession.threadSet can be used instead of SET query.
CarbonSession.threadSet ("carbon.input.segments.<database_name>.<table_name>","*");
Examples:
SHOW SEGMENTS FOR carbontable1;
SET carbon.input.segments.db.carbontable1 = 1,3,9;
CarbonSession.threadSet ("carbon.input.segments.db.carbontable_Multi_Thread","1,3");
def main(args: Array[String]) {
Future {
CarbonSession.threadSet ("carbon.input.segments.db.carbontable_Multi_Thread","1")
spark.sql("select count(empno) from carbon.input.segments.db.carbontable_Multi_Thread").show();
}
}