Thursday, August 1, 2024

Analyzing CSV files using SQL with Squirrel SQL client and Apache Calcite JDBC driver

 

Intro

I had written another article to work with CSV files as SQL, using the csvjdbc driver, but it does not yet support using joins, unions, etc, which is actually quite important. So in this article, i am using an alternative JDBC driver that does support working with multiple tables  : Apache Calcite CSV example.

Apache calcite is a generic framework which provides a way to work with different data sources like files using SQL. It defines interfaces like Schema, Table, Colum etc. that we can implement to enable a resource to be used in SQL. It also has the ability to execute SQL statements using these customized implementations. It also provides some implementations, e.g. one to read CSV files as relational data, and provides a JDBC driver for this purpose. This CSV JDBC driver is the one we will use in this article. It also provides a command line utility to execute SQL , called sqlline.

The JDBC driver

The way this driver works is that instead of the database name, we have to point to a JSON file that defines the database model. This file will in turn specify a directory that holds the csv files. And all the csv files under that directory will be considered as tables in that model. The CSV files need to have a header for column names.

The SQL client

Now, we could use the driver directly from a java program, perhaps passing it the sql from the command line, but why do that when a sql client written in java is available ? That's where Squirrel SQL comes in. It has been my favorite since last many years.
Btw, i have read some good stuff about DBeaver too, which we might cover later.

Installation

You need to download the jar installer from https://squirrel-sql.sourceforge.io/#installation, and then run it on the command line with java -jar <jarname-you-downloaded>. Run it as administrator, else it may throw errors about not being able to write to the installation directory. After successful install, it will create a shortcut and you can use that to launch the Squirrel SQL client.

Configuring the client

Firstly, we need to register our calcite csv jdbc driver with squirrel. There is a Drivers tab on the left side. Click on that, and the plus icon to add a new driver. The sample JDBC url looks like jdbc:calcite:model=path/to/model.json. The driver class name is org.apache.calcite.jdbc.Driver. 

We also need to specify here the JDBC driver jar itself, and the other jars that it depends on. One issue we face here is that there are a large number of dependencies. How do we collect them all ? Well its easy : Head over to the maven repo for the csv example. First download the csv example jar itself. Then download the pom.xml in the same folder. 
Now run 
mvn dependency:copy-dependencies -DoutputDirectory=.
This will download all the jars needed in the same folder. Add all these jars to the Driver's dependencies from the UI. It should say that the driver has been installed successfully, and you can see it with a green tick in the list of drivers.

See the screenshot below :





Now we need to create a data source, which in our case will point to the model json, which in turn will point to the folder containing csv files. So click now on the Aliases tab, then on the plus icon to create a new alias. Substitute the your path to model.json in the url. No user/password is needed. See the screenshot below :






But whats in the model.json ?

{
  version: '1.0',
  defaultSchema: 'Trading',
  schemas: [
    {
      name: 'Trading',
      type: 'custom',
      factory: 'org.apache.calcite.adapter.csv.CsvSchemaFactory',
      operand: {
        directory: 'C:\\projects\\trading\\db'
      }
    }
  ]
}

As you see, its a schema created using the CsvSchemaFactory, that points to the directory holding our Csv files.

Now you can connect to the data-source by double clicking the alias. Once connected, the objects tab will display the csv file names as table names, and the header names as column names. In the SQL tab, you can execute the SQL queries. Check out what table names and column names are displayed and use exactly those. You may need quotes around those with spaces, e.g. "Industry Name".

And yes, we can write SQLs on multiple tables using joins, subqueries, unions etc. But its not all happy sailing. All columns for instance, are defined as type VARCHAR, so we always need to do type conversions. Also empty string are not NULLs and need to be handled separately.
Can we change the model.json to redefine the proper column types ? This needs to be researched. A workaround is to define views with checking and conversion functions on columns to avoid the above issues.


Happy SQLing !





Sunday, July 28, 2024

Analyzing CSV files with SQL using Squirrel SQL Client and csvjdbc driver

Note

I just found that the csvjdbc driver does not yet support using joins, unions, etc, which is actually quite important, so if you want that, check this article that uses Apache Calcite.

Hola. So being an SQL fan, i somehow find it much easier to work on Excel files in SQL, rather than Excel itself. SQL is so beautiful, so easy, yet so powerful !

The JDBC driver

Firstly, we need a JDBC driver that can read CSV files. I am using the one from https://github.com/simoc/csvjdbc. Its read-only, and that's okay for me, since i only wish to analyze the files. The jar can be downloaded from maven : https://mvnrepository.com/artifact/net.sourceforge.csvjdbc/csvjdbc/1.0.42

The way this driver works is that instead of the database name, we need to specify a directory that holds the csv files. And all the csv files under that directory will be considered as tables in that database. The CSV files need to have a header for column names. The separator can be mentioned in the jdbc url.

The SQL client

Now, we could use the driver directly from a java program, perhaps passing it the sql from the command line, but why do that when a sql client written in java is available ? That's where Squirrel SQL comes in. It has been my favorite since last many years.

Installation

You need to download the jar installer from https://squirrel-sql.sourceforge.io/#installation, and then run it on the command line with java -jar <jarname-you-downloaded>. Run it as administrator, else it may throw errors about not being able to write to the installation directory. After successful install, it will create a shortcut and you can use that to launch the Squirrel SQL client.

Configuring the client

Firstly, we need to register our csvjdbc driver with squirrel. There is a Drivers tab on the left side. Click on that, and the plus icon to add a new driver. The sample JDBC url looks like jdbc:relique:csv:<foldername>?separator=,&fileExtension=.csv. We also need to specify here the JDBC driver jar itself, and the other jars that it depends on. See the screenshot below :




It should say that the driver has been installed successfully, and you can see it with a green tick in the list of drivers.

Now we need to create point to a data source, which in our case will be a folder containing csv files. So click now on the Aliases tab, then on the plus icon to create a new alias. Substitute your folder-name in the url. No user/password is needed. See the screenshot below :




Now you can connect to the data-source by double clicking the alias. Once connected, the objects tab will display the csv file names as table names, and the header names as column names. In the SQL tab, you can execute the SQL queries. Check out what table names and column names are displayed and use exactly those. You may need quotes around those with spaces, e.g. "Industry Name".




Happy SQLing !





Wednesday, November 1, 2023

IOT with the Raspberry Pi Pico

A few months ago, I got a requirement for a POC, to create a system to monitor an UPS. The inverter would provide its status, e.g. "Mains On", "Inverter On", "Battery Voltage", "Mains High", "Mains Low" etc. over a RS232 serial port. We had to read it and send it over the net to our website where the data would be stored and monitored. The readings were to be taken every 1 minute.

The Microcontroller

Reading UART serial data is not difficult for a very basic microcontroller. Even without a UART port, it is simple to read it using a program, as I did here. However, the requirement to post the data real-time to a website, called for network connectivity, and hence, a more powerful solution.

This is where the Raspberry Pi Pico comes in.
  • Its a powerful microcontroller board, with many IOs including UART ports.
  • It has a Pico W version, which comes with WiFi connectivity.
  • Its cheap.( The base version is around Rs 350)
  • It can be programmed with MicroPython, which is much easier to program in than C, and the Thonny IDE is very user friendly. Kudos to the MicroPython and Thonny devs !
While I thought I was done with the choice of microcontroller, a problem arose. There was no WIFI connection available onsite, only an Ethernet one. Now the Pico does not have a version with an Ethernet port. So i had to do some research again.

There were options where I could connect an additional Ethernet board to the Raspberry Pico, but then, I wanted it to be inbuilt, to keep it simple.

Fortunately, Wiznet provides Ethernet versions, modified versions of the Pico board, with an Ethernet port. 
Now where would I get it in India ?
Digikey seems to be a great site for electronics, it has a huge catalogue and low prices, and volume discounts. However, i just wanted one item to start with, and the shipping charges were too high. Hence i opted for Hubtronics, which was costlier for the board, but overall cheaper due to the lower shipping charges.

Development

The Pico I had ordered is a development board, so it can be programmed easily. It is to be connected using a micro-usb cable to you computer. It has a switch which can be used to put it in 2 different modes :
  1. As a USB storage device. This mode will be used only once at the start, to copy the Micropython image to the Pico's internal storage. This is necessary to support the Micropython programming environment.
  2. As a USB device that interacts with the Thonny IDE to develop and test the Micropython programs.
Here's a tutorial on how to go about it.

Some Gotchas

Some problems that I encountered

  • Voltage compatibility - The serial port signals were 0-5V, whereas the Pico's IO, like the Pi, is 0-3.3V. There are various solutions for this, from transistors to chips, but I did it with a resistive voltage divider, since the RS232 baud rate used was low. 
  • Male and Female pinout differences - This may not be immediately clear, but is obvious when you think of it. Male and Female pinouts are different, since Rx in one has to go to Tx in another and so on. So lookup the pinout as per the type of socket.
  • UART parameters matching - When you initialize the Pico's UART, make sure that its parameters like baud-rate, parity, start-stop bits match that of the port you are interfacing with.
  • Reads and timeouts - If you read multiple bytes at a time, and it times out, you may be get lesser data than what you asked for.
  • Sleeps - Sleeps in between can make you miss data that is being sent in a stream.
  • Micropython differences - Since Micropython is not the full python version, you will not have some commonly used classes, e.g. not all data structures are available.



Tuesday, August 22, 2023

Git for Svners

Git, the currently popular versioning system is a pain to understand for people who come from CVS/Subversion, and are used to having just a central repo and local working directory.
Since Git is a distributed repository system, each developer has his own repository, that keeps track of local changes, as well as can sync with one or multiple remote repositories. Hence the concepts of Git are rather different from SVN, and the same terms like checkout mean different things in SVN and GIT. Being an SVNer earlier, i too found Git frustrating to start with.

Atlassian has a good tutorial that explains these concepts :
https://www.atlassian.com/git/tutorials/learn-git-with-bitbucket-cloud

Some important concepts :

  • The local repository is a full fledged repository, and  independently holds a history of its own branches, commits etc. One could create and work with a local repository, without ever connecting to a remote repository, say for some private work that is not shared.
  • We usually start by cloning a remote repository to a local one. See clone. This creates a copy of the remote repository to the local, as well as adds a reference to the remote repository usually as "origin" to the local repo's list of remotes( See the remote command). Thus, when syncing with the original repository, we can specify it as "origin".
  • The checkout command does NOT checkout files from remote. Instead it checks out the specified branch to the working directory. i.e. that branch now becomes the current one.
  • GIT has a staging area where the changes to be committed are kept. We specify which files are to be staged by using the add command. Without adding, the commit command will have nothing to work on.
  • When you push your changes to the remote, you are syncing changes committed in your local repo to the remote one. So you must have committed them first to your local repo. This may not be obvious to SVNers, who would expect the locally updated files to be automatically pushed !
  •  TODO : See the branch command
  • Unlike SVN, we do not need separate working folders for branches. We can switch to another branch in the same working folder, using the checkout command. This means that uncommitted changes can be lost, unless we stash them to a temp location.

Maven for ANTers

Coming from a build tool like ANT, it can be pretty frustrating for developers to understand maven. It seems to be doing too many unspecified things, too rigidly. This article tries to understand maven from that perspective.

So here's a quick summary :

  1. Maven executes Goals, just like Ant has targets.
  2. In addition, maven has Phases, each phase being a list of goals. We can also tell maven to execute one or more phases, and then each phase will execute the goals grouped under it. Phases themselves have an order, so that if we execute a given phase, the phases coming before that in the predefined order will be executed first.
  3. The default phases are validate, compile, test, package, verify, install, deploy.
  4. Each maven execution run happens in a lifecycle, which has phases under it. The default lifecycles are defaultclean and site
  5. We can create and define our custom goals. These are packaged in a Plugin.
  6. Maven also allows us to specify and manage dependencies of the project.
  7. Maven resources have a groupId, artifactId and version.
  8. Maven provides for a repository to fetch/store the artifacts. A central maven repository is the default .Also, maven keeps a local repository on the system where its run, to avoid fetching from the remote repository each time.

 

See also https://www.baeldung.com/maven-goals-phases


Read these links first for a basic understanding :
https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html

In ANT, we specify tasks and their dependencies. We can execute a specific task, and only that, along with its specified dependent tasks will be executed. But this also means that tasks like clean, compile, generate, copy, package, install, deploy etc have be specified in details for all projects, along with config like source, staging, target dirs etc. Also external dependencies like jars have to be managed, along with their versions.  These may be common amongst multiple projects. Should we check these into source control, or have a separate common location ? What if we want to add a common functionality to all builds ? How do we name and version output artifacts ?

Maven tries to answer these questions. It tries to provide a standard way to execute projects, by promoting convention over configuration :

  • Standardized project directory structure. e.g. The src/main/java,  src/main/resources src/test etc
  • Naming conventions for articfacts using groupid, artifactid, version
  • Providing a dependency configuration, and a repository mechanism to store and access needed dependencies
  • Out-of-the box implementation of standard build lifecycle, so that a project can be build with minimum configuration
  • An inheritance mechanism, so that a build may be shared amongst multiple projects, each overriding only the parts needed.
  • Profiles to have different builds for different case, e.g we might just need compile in dev mode, and the full jar with dependencies may have to be built in production mode.

So if we have a simple java project with no dependencies, then a minimal pom with just the groupid, artifactid and version id will be able to build the project from clean to deploy.

Maven works with lifecyles, phases and goals. It has default implementations of these. We can also create our own lifecycles, which is a list of phases, in order. Then there are goals, that execute in a particular phase which do the actual work. e.g. in the default lifecycle, the install phase has a goal install:install.

Goals are usually implemented using java classes called plugins. When creating plugins, we can specify what phase the plugin should execute, tho its not mandatory. The phase in which a plugin executes can also be specified via the build configuration, and this will override the default.

What can we execute with the maven command ?  

A list of mixture of phases and goals. Whatever phases are specified, all the phases before them in the life cycle will be implicitly executed.
However, thats not the case with goals, only the specified goals will be executed.

e.g. we can execute just the install goal using mvn instal:install, instal:install being the name of the goal in the install phase. Earlier phases will not be run. However, it can cause problems like in this case, since this goal looks for the anme of the output jar to install in the execution context, which is missing. In this case, it can be remedied by also installing the jar:jar goal before, so that it gets the jar name.

What does the maven build file consist of ?


  1. The groupid, artifactid and version id of the project being built, this the minimal info needed.
  2. The list of dependencies needed by the project.
  3. The list of dependency repositories, if using any other than the standard maven one.
  4. The list of repositories if any, to deploy/publish the final artifact( usually jar, war etc)
  5. The build section to use and configure non-standard maven plugins, in which phase and with what params are they to run. Similarly to customise/change execution of the standard maven plugins.
  6. A profiles




Saturday, April 9, 2022

What to watch on Prime Video

TODO

Movies

English

Youth 

The Unbearable Weight of Massive Talent

Dune

Kimi

The tender bar

Notting Hill

Many Tom Cruise movies, including the MI series

Many James Bond movies

Pride and Prejudice

Sense and Sensibility

The Matrix

The Jurrasic movies

The Harry Potter movies

The Lord of the rings

The tomorrow war

The Personal History of David Copperfield

The Big Wedding

Wild mountain thyme

Before I fall

2067

Pan's Labyrinth

Another round

Motherless Brooklyn

Mothers day

The kids are all right

Minari

Bliss

Gravity

Bottle Shock

Uncle Frank

A street cat named Bob

Aeronauts

Mr Holmes

Children of the bride

Gone girl

Wild oats

Mud

The secret : Dare to dream

Indian

Sharmaji Namkeen

Sherni

Missing

Bonus

Dev bhoomi

Series

Howards End
Tales from the Loop - Sci Fi
The mentalist - Detective thriller
Scorpion - Nerds rule
Person of Interest - AI and crime
Jack Ryan - CIA thriller
Jack Reacher - Private Investigator thriller
Wheel of time - High Fantasy
Picard - star trek
The good doctor
Doctor House
Sherlock Holmes
Seinfeld