Records that fail to get loaded into the CarbonData due to data type incompatibility or are empty or have incompatible format are classified as Bad Records.
The bad records are stored at the location set in carbon.badRecords.location in carbon.properties file.
By default carbon.badRecords.location specifies the following location
While loading data we can specify the approach to handle Bad Records. In order to analyse the cause of the Bad Records the parameter
BAD_RECORDS_LOGGER_ENABLE must be set to value
TRUE. There are multiple approaches to handle Bad Records which can be specified by the parameter
To ignore the Bad Records from getting stored in the raw csv, we need to set the following in the query :
The store location specified while creating carbon session is used by the CarbonData to store the meta data like the schema, dictionary files, dictionary meta data and sort indexes.
storepath specified in the following manner :
val carbon = SparkSession.builder().config(sc.getConf) .getOrCreateCarbonSession(<store_path>)
val carbon = SparkSession.builder().config(sc.getConf) .getOrCreateCarbonSession("hdfs://localhost:9000/carbon/store")
The Apache CarbonData acquires lock on the files to prevent concurrent operation from modifying the same files. The lock can be of the following types depending on the storage location, for HDFS we specify it to be of type HDFSLOCK. By default it is set to type LOCALLOCK. The property carbon.lock.type configuration specifies the type of lock to be acquired during concurrent operations on table. This property can be set with the following values :
In order to build CarbonData project it is necessary to specify the spark profile. The spark profile sets the Spark Version. You need to specify the
spark version while using Maven to build project.
Carbon support insert operation, you can refer to the syntax mentioned in DML Operations on CarbonData. First, create a source table in spark-sql and load data into this created table.
CREATE TABLE source_table( id String, name String, city String) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
SELECT * FROM source_table; id name city 1 jack beijing 2 erlu hangzhou 3 davi shenzhen
Scenario 1 :
Suppose, the column order in carbon table is different from source table, use script "SELECT * FROM carbon table" to query, will get the column order similar as source table, rather than in carbon table's column order as expected.
CREATE TABLE IF NOT EXISTS carbon_table( id String, city String, name String) STORED BY 'carbondata';
INSERT INTO TABLE carbon_table SELECT * FROM source_table;
SELECT * FROM carbon_table; id city name 1 jack beijing 2 erlu hangzhou 3 davi shenzhen
As result shows, the second column is city in carbon table, but what inside is name, such as jack. This phenomenon is same with insert data into hive table.
If you want to insert data into corresponding column in carbon table, you have to specify the column order same in insert statement.
INSERT INTO TABLE carbon_table SELECT id, city, name FROM source_table;
Scenario 2 :
Insert operation will be failed when the number of column in carbon table is different from the column specified in select statement. The following insert operation will be failed.
INSERT INTO TABLE carbon_table SELECT id, city FROM source_table;
Scenario 3 :
When the column type in carbon table is different from the column specified in select statement. The insert operation will still success, but you may get NULL in result, because NULL will be substitute value when conversion type failed.
Following are the aggregate queries that won't fetch data from aggregate table:
create table gdp21(cntry smallint, gdp double, y_year date) stored by 'carbondata'; create datamap ag1 on table gdp21 using 'preaggregate' as select cntry, sum(gdp) from gdp21 group by cntry; select ctry from pop1 where ctry in (select cntry from gdp21 group by cntry);
create table gdp21(cntry smallint, gdp double, y_year date) stored by 'carbondata'; create datamap ag1 on table gdp21 using 'preaggregate' as select cntry, sum(gdp) from gdp21 group by cntry; select cntry, sum(gdp) from gdp21 where cntry in (select ctry from pop1) group by cntry;
create table gdp21(cntry smallint, gdp double, y_year date) stored by 'carbondata'; create datamap ag1 on table gdp21 using 'preaggregate' as select cntry, sum(gdp) from gdp21 group by cntry; select cntry,sum(gdp) from gdp21,pop1 where cntry=ctry group by cntry;
Spark executor shows task as failed after the maximum number of retry attempts, but loading the data having bad records and BAD_RECORDS_ACTION (carbon.bad.records.action) is set as ?FAIL? will attempt only once but will send the signal to driver as failed instead of throwing the exception to retry, as there is no point to retry if bad record found and BAD_RECORDS_ACTION is set to fail. Hence the Spark executor displays this one attempt as successful but the command has actually failed to execute. Task attempts or executor logs can be checked to observe the failure reason.
SDK writer is an independent entity, hence SDK writer can generate carbondata files from a non-cluster machine that has different time zones. But at cluster when those files are read, it always takes cluster time-zone. Hence, the value of timestamp and date datatype fields are not original value. If wanted to control timezone of data while writing, then set cluster's time-zone in SDK writer by calling below API.
cluster timezone is Asia/Shanghai TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai"))