This tutorial provides a quick introduction to using CarbonData.
Create a sample.csv file using the following commands. The CSV file is required for loading data into CarbonData.
cd carbondata
cat > sample.csv << EOF
id,name,city,age
1,david,shenzhen,31
2,eason,shenzhen,27
3,jarry,wuhan,35
EOF
Apache Spark Shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. Please visit Apache Spark Documentation for more details on Spark shell.
Start Spark shell by running the following command in the Spark directory:
./bin/spark-shell --jars <carbondata assembly jar path>
NOTE: Assembly jar will be available after building CarbonData and can be copied from ./assembly/target/scala-2.1x/carbondata_xxx.jar
In this shell, SparkSession is readily available as spark and Spark context is readily available as sc.
In order to create a CarbonSession we will have to configure it explicitly in the following manner :
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.CarbonSession._
val carbon = SparkSession.builder().config(sc.getConf)
.getOrCreateCarbonSession("<hdfs store path>")
NOTE: By default metastore location is pointed to ../carbon.metastore, user can provide own metastore location to CarbonSession like SparkSession.builder().config(sc.getConf) .getOrCreateCarbonSession("<hdfs store path>", "<local metastore path>")
scala>carbon.sql("CREATE TABLE
IF NOT EXISTS test_table(
id string,
name string,
city string,
age Int)
STORED BY 'carbondata'")
scala>carbon.sql("LOAD DATA INPATH '/path/to/sample.csv'
INTO TABLE test_table")
NOTE: Please provide the real file path of sample.csv for the above script.
If you get "tablestatus.lock" issue, please refer to troubleshooting
scala>carbon.sql("SELECT * FROM test_table").show()
scala>carbon.sql("SELECT city, avg(age), sum(age)
FROM test_table
GROUP BY city").show()