Thursday, August 1, 2024

Analyzing CSV files using SQL with Squirrel SQL client and Apache Calcite JDBC driver

 

Intro

I had written another article to work with CSV files as SQL, using the csvjdbc driver, but it does not yet support using joins, unions, etc, which is actually quite important. So in this article, i am using an alternative JDBC driver that does support working with multiple tables  : Apache Calcite CSV example.

Apache calcite is a generic framework which provides a way to work with different data sources like files using SQL. It defines interfaces like Schema, Table, Colum etc. that we can implement to enable a resource to be used in SQL. It also has the ability to execute SQL statements using these customized implementations. It also provides some implementations, e.g. one to read CSV files as relational data, and provides a JDBC driver for this purpose. This CSV JDBC driver is the one we will use in this article. It also provides a command line utility to execute SQL , called sqlline.

The JDBC driver

The way this driver works is that instead of the database name, we have to point to a JSON file that defines the database model. This file will in turn specify a directory that holds the csv files. And all the csv files under that directory will be considered as tables in that model. The CSV files need to have a header for column names.

The SQL client

Now, we could use the driver directly from a java program, perhaps passing it the sql from the command line, but why do that when a sql client written in java is available ? That's where Squirrel SQL comes in. It has been my favorite since last many years.
Btw, i have read some good stuff about DBeaver too, which we might cover later.

Installation

You need to download the jar installer from https://squirrel-sql.sourceforge.io/#installation, and then run it on the command line with java -jar <jarname-you-downloaded>. Run it as administrator, else it may throw errors about not being able to write to the installation directory. After successful install, it will create a shortcut and you can use that to launch the Squirrel SQL client.

Configuring the client

Firstly, we need to register our calcite csv jdbc driver with squirrel. There is a Drivers tab on the left side. Click on that, and the plus icon to add a new driver. The sample JDBC url looks like jdbc:calcite:model=path/to/model.json. The driver class name is org.apache.calcite.jdbc.Driver. 

We also need to specify here the JDBC driver jar itself, and the other jars that it depends on. One issue we face here is that there are a large number of dependencies. How do we collect them all ? Well its easy : Head over to the maven repo for the csv example. First download the csv example jar itself. Then download the pom.xml in the same folder. 
Now run 
mvn dependency:copy-dependencies -DoutputDirectory=.
This will download all the jars needed in the same folder. Add all these jars to the Driver's dependencies from the UI. It should say that the driver has been installed successfully, and you can see it with a green tick in the list of drivers.

See the screenshot below :





Now we need to create a data source, which in our case will point to the model json, which in turn will point to the folder containing csv files. So click now on the Aliases tab, then on the plus icon to create a new alias. Substitute the your path to model.json in the url. No user/password is needed. See the screenshot below :






But whats in the model.json ?

{
  version: '1.0',
  defaultSchema: 'Trading',
  schemas: [
    {
      name: 'Trading',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
      operand: {
        directory: 'C:\\projects\\trading\\db'
      }
    }
  ]
}

As you see, its a schema created using the CsvSchemaFactory, that points to the directory holding our Csv files.

Now you can connect to the data-source by double clicking the alias. Once connected, the objects tab will display the csv file names as table names, and the header names as column names. In the SQL tab, you can execute the SQL queries. Check out what table names and column names are displayed and use exactly those. You may need quotes around those with spaces, e.g. "Industry Name".

And yes, we can write SQLs on multiple tables using joins, subqueries, unions etc. But its not all happy sailing. All columns for instance, are defined as type VARCHAR, so we always need to do type conversions. Also empty string are not NULLs and need to be handled separately.
Can we change the model.json to redefine the proper column types ? This needs to be researched. A workaround is to define views with checking and conversion functions on columns to avoid the above issues.


Happy SQLing !





No comments:

Post a Comment