Sunday, July 28, 2024

Analyzing CSV files with SQL using Squirrel SQL Client and csvjdbc driver

Note

I just found that the csvjdbc driver does not yet support using joins, unions, etc, which is actually quite important, so if you want that, check this article that uses Apache Calcite.

Hola. So being an SQL fan, i somehow find it much easier to work on Excel files in SQL, rather than Excel itself. SQL is so beautiful, so easy, yet so powerful !

The JDBC driver

Firstly, we need a JDBC driver that can read CSV files. I am using the one from https://github.com/simoc/csvjdbc. Its read-only, and that's okay for me, since i only wish to analyze the files. The jar can be downloaded from maven : https://mvnrepository.com/artifact/net.sourceforge.csvjdbc/csvjdbc/1.0.42

The way this driver works is that instead of the database name, we need to specify a directory that holds the csv files. And all the csv files under that directory will be considered as tables in that database. The CSV files need to have a header for column names. The separator can be mentioned in the jdbc url.

The SQL client

Now, we could use the driver directly from a java program, perhaps passing it the sql from the command line, but why do that when a sql client written in java is available ? That's where Squirrel SQL comes in. It has been my favorite since last many years.

Installation

You need to download the jar installer from https://squirrel-sql.sourceforge.io/#installation, and then run it on the command line with java -jar <jarname-you-downloaded>. Run it as administrator, else it may throw errors about not being able to write to the installation directory. After successful install, it will create a shortcut and you can use that to launch the Squirrel SQL client.

Configuring the client

Firstly, we need to register our csvjdbc driver with squirrel. There is a Drivers tab on the left side. Click on that, and the plus icon to add a new driver. The sample JDBC url looks like jdbc:relique:csv:<foldername>?separator=,&fileExtension=.csv. We also need to specify here the JDBC driver jar itself, and the other jars that it depends on. See the screenshot below :




It should say that the driver has been installed successfully, and you can see it with a green tick in the list of drivers.

Now we need to create point to a data source, which in our case will be a folder containing csv files. So click now on the Aliases tab, then on the plus icon to create a new alias. Substitute your folder-name in the url. No user/password is needed. See the screenshot below :




Now you can connect to the data-source by double clicking the alias. Once connected, the objects tab will display the csv file names as table names, and the header names as column names. In the SQL tab, you can execute the SQL queries. Check out what table names and column names are displayed and use exactly those. You may need quotes around those with spaces, e.g. "Industry Name".




Happy SQLing !