Object storage is the recommended storage format in cloud as it can support storing large data files. S3 APIs are widely used for accessing object stores. This can be used to store or retrieve data on Amazon cloud, Huawei Cloud(OBS) or on any other object stores conforming to S3 API. Storing data in cloud is advantageous as there are no restrictions on the size of data and the data can be accessed from anywhere at any time. Carbondata can support any Object Storage that conforms to Amazon S3 API. Carbondata relies on Hadoop provided S3 filesystem APIs to access Object stores.
To store carbondata files onto Object Store, carbon.storelocation
property will have
to be configured with Object Store path in CarbonProperties file.
For example:
carbon.storelocation=s3a://mybucket/carbonstore
If the existing store location cannot be changed or only specific tables need to be stored
onto cloud object store, it can be done so by specifying the location
option in the create
table DDL command.
For example:
CREATE TABLE IF NOT EXISTS db1.table1(col1 string, col2 int) STORED AS carbondata LOCATION 's3a://mybucket/carbonstore'
For more details on create table, Refer DDL of CarbonData
Authentication properties will have to be configured to store the carbondata files on to S3 location.
Authentication properties can be set in any of the following ways:
Set authentication properties in core-site.xml, refer hadoop authentication document
Set authentication properties in spark-defaults.conf.
Example
spark.hadoop.fs.s3a.secret.key=123
spark.hadoop.fs.s3a.access.key=456
Example:
./bin/spark-submit \
--master yarn \
--conf spark.hadoop.fs.s3a.secret.key=123 \
--conf spark.hadoop.fs.s3a.access.key=456 \
--class=xxx
Example:
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", "123")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.access.key","456")