Monday, August 26, 2019

Useful online sites/services

SQL Fiddles

It happens that we want to test sql on a particular db, and do not have it installed.
There are sites which provide us a testing environment for various databases. e.g.
Note that some sites may consider whatever you submit to be under commons creative license.

https://dbfiddle.uk/ ( This one has mariadb too )
http://www.sqlfiddle.com/#!3/e48975/1

Monday, August 12, 2019

Editing tips

Select Code Block

Awesome facility in Jedit, that allows us to select the text between matching braces. Not only curly, but square and round braces are also supported. The text can be cut or copied as needed. I found this useful when editing large JSON data. Most of the free online editors do not allow to cut the text of a collapsed node, and this was what i wanted. I was able to achieve this using the above facility in Jedit, tho it does not show a collapsed view. Its under Edit->Source menu. We can also just navigate to the matching braces.

Thursday, August 1, 2019

Starting with Sails.js

Having decided to explore Nodejs, i wanted to start with something like Rails that would be able to generate an MVC app from the database. While i haven't worked with Rails, i did work with the Rails inspired CakePHP, and liked it. So Sails.js was the equivalent in the Nodejs world.

On a side note, while Node is noted for its performance in handling I/O bound tasks, there are async frameworks in the Python world too like gevent and asyncio and uvloop, which can do the same, and there are some comparsions : https://magic.io/blog/uvloop-blazing-fast-python-networking/
Development wise, i think i would prefer python over Javascript.
Anyways, on to sails :

So i installed node and sails 1.1 and fired the app with sails lift. Its nice to have some example models and setup running like sails has, so we can get started quickly.

Though sails does not generate CRUD from the database, there is a sails-inverse-model module that does a basic CRUD generation. Haven't tested end to end tho.

I made some changes to the User model, added another model.
Then i wanted to point the default file-based database to my local mysql. Accordingly i toddled over to the config/datastores.js and made the necessary changes. Restarted sails and got a nasty shock ! There was a big stack trace about some auto-migration and some error about being unable to insert data. But i hadn't setup any migration. Seems that sails has a facility to apply to latest moel changes to the database, to keep the database in sync with the models. This is called migration, and is set to 'alter' by default in the config/models.js. This alter mode drops and recreates tables instead of altering them ! I fell this is a very dangerous behaviour, and should be been turned off by default.
So  first thing, set migration to 'safe', or risk losing your data !


The next problem i faced was that my changes to the User model were not being reflected, and i was getting and error on account of that. I thought it might be some sort of caching. But restart or deleting the temp folder did not help either. Neither could i find any such issues reported on the internet. Finally, when looking at the files, i saw that there was a User.js~ backup file created by jedit, and this had the old User model that was getting picked up. Apparently the filter to load the files is too lenient, it will even allow a *.xml extension as a model.
In order to see where the files were being loaded, i created a syntax error in the model file. Sure enough, Sails complained about the error, and i saw from the error trace that it was being done in (node_modules/sails/lib/hooks/moduleloader/index.js:304:18). Adding a [^~]$ at the end of the regex fixed the loading-backup-files issue at least.

Another issue was about having to restart sails after every change, which is really irritating. I saw options like forever, nodemon and sails-hook-autoreload. I tried the auto-reload-hook, but it seems to trigger the auto-migration  again, so i removed it. I tried forever, one issue is that the stop does not work. Also, difficult to see logs, and if there are errors, it crashes. Still to check out.

Trying to get the hang of files. Did not like the actions approach, where each action lives in a separate file, rather than all in a controller. But the controller with methods approach is also supported. Tried using the ajax-form from parasails as in the examples. However, my form was not being rendered. Read about the client asset js file, and how we need to add one and register the page there, and write form validations. What about server-side validations, these would need to be repeated there too ? There should be a way to share them. And the default layout is updated to load all of these client files at once ! Why not include the required one in the particular view instead ?
Model objects are global, tho there seems to be a setting to control this. I like to know what i am importing.

Overall, sails seems to be doing too much implicitly, and if something is not working, finding and changing the behaviour takes a lot of time. I also wonder whether how all this affects performance. The Waterline ORM has its limitations like not fetching associations data. Also, what about all those facilities like default apis, one would need to know how to turn them off. I think i would prefer to go with something minimal like express, rather than sails.

On the positive side, sails comes with responsive ui and ready-to-go facilities.
It seems that sails comes with an auto-generator, and just by defining a model, we can expose the model's api as json.

Tuesday, July 30, 2019

Single Page Applications, Server vs Client-side, Ej, React, Vue, Angular

Since a few years, single-page applications have become popular, along with the advent of micro-services. Why have they become popular, and what are the advantages in using them ?

Traditionally, typical web-applications followed the Model, View, Controller pattern, with all of these written on the server side. So a java based app would have the views in some java-related technology like JSP or JSF. Also the routing logic by the controller was in the same technology. Similarly, in a PHP app, it would all be in PHP.

Probably with the advent of AI/ML/Analytics, languages like python became the choice for those kind of applications, due to the extensive libraries/frameworks available.
So if I wanted to move from my Java application to python, I would need to rewrite not just the models, but the controllers and views too in python.
So people started thinking, could we not separate the UI part, and make it independent of the backend ? Enter the SPAs.

Single Page applications are so called because they keep all the UI and routing logic as one bundle( page) loaded once in the browser(might be split if its too large), and  subsequently, no UI has to be loaded from the server. Also, usually the routing, i.e controller logic is in the same bundle and not on the server side. Other than loading the UI quickly, without page refreshes apparent to the user, another advantage is re-usability. The UI has been detached from the server side code, and communicates with the server-side thru http apis for the business logic. As long as the apis produce the same output, the server-side tech-stack can be changed, without affecting the UI part.

Some of challenges for SPAs are :

  1. Search engine optimization : UI pages are not really urls on the server-side, but just one javascript resource containing all the UI/routing content, and hence are not available in the traditional way to the search engines. One solution may be to render the pages that need to be indexed, on the server side, and the others on the client.
  2. Initial load times :  Can be high for SPAs, since all UI pages are loaded in one go on the browser.
  3. The server side APIs exposed for the use of the SPA also become available for everyone, and increase the chance of hacking or un-authorized use. Also, since javascript for the controllers and views can be hacked thru javascript, one needs to be more careful in validating the flow on server-side as well.
Angular, React, Vue are some examples of SPAs. SPAs render UI on the client-side, i.e in the browser.

EJS, Pug, Handlebar are some server-side templating engines for javascript. Like JSP, or JSP with EL and JSTL etc. They will generate UI code on the server-side, so are like the traditional web applications.

Responsive means one that responds to its environment, changes behaviour depending on where it is loaded. e.g. menus, page layout gets re-arranged depending on whether its loaded on a mobile, or a tablet, or a desktop/laptop, so that it looks good in each. See https://www.w3schools.com/html/html_responsive.asp
While a responsive app can be used via a browser on both mobiles, and computers, it will not be able to use the full native functionality of that platform, e.g. use the camera, GPS or other system devices.

What are native mobile apps ?
Native means written for a particular platform using its tools/SDK, like Swift for iOS, java for Android etc. There are also frameworks like React Native, that will generate a native app from code written in web like technologies i.e.html-js.
A hybrid app is a web-app that can access some but not all native functionality. It uses a sort of bridge component that can invoke native functionality.

Bootstrap is a free and open-source CSS framework directed at responsive, mobile-first front-end web development. It contains CSS- and JavaScript-based design templates for typography, forms, buttons, navigation and other interface components

What is <script type="text/template">some html</script>

It will be ignored by the browser for rendering. It is used to define an html template into which values can be substituted to create html on the fly,e.g. adding a new row to a table, by client side templating frameworks like Vue, React etc. Nothing new here, it was possible since a long time to add user-defined xml, (not just under the script tag ) into a page and use it this way.

Tuesday, July 23, 2019

Router Port Forwarding

Sometimes,( at the risk of exposing your computer to the big bad world), you want to allow access from the internet to an application running on your P.C.

Your P.C is usually going to be behind a router, so the internet knows only the address of the router and not your P.C. Your router is keeping your internal network separate from the internet. The router has two I.P.s, an internal one to communicate with the internal network, and an external one to communicate with the internet.

You also should have a firewall setup on your router to not allow any incoming traffic to your internal network.

But suppose you do want to expose an application on your P.C. to the internet, how to go about it ?
The internet knows only the router's external I.P. So obviously, that has to be used.

We need to configure what is called "Port Forwarding" on the router. What it allows us to do is redirect  an incoming request to the router on a particular port to the same/another port on some machine in the internal network.
So for example, i might setup forwarding such that a request to http://router-external-ip:8080 is redirected to my P.C 192.168.1.200:80, thus making my application running at 192.168.1.200:80 available to the internet.

Note that the firewall rules may have to be changed to allow the incoming traffic, preferably restricted to the external address making the request, and the port being used.

There are sites online that will allow you to test whether the port forwarding is working or not, e.g. https://www.yougetsignal.com/tools/open-ports/

If you are getting connection refused errors, check that the firewall on the P.C allows the incoming call, as well as the firewall on the internet host that is making the call. E.g if its hosted, the hosting provider may have opened default ports like 80, but blocked others like 8080 for outgoing calls.

Tuesday, May 21, 2019

Eye-friendly dark styling for the browser

It is easier on the eyes to have a darker color than white as the background of the websites we browse.

I can think of 2 possible approaches to achieve this :

  1. Setting a web-proxy in the browser settings, and manipulating HTML thru the proxy. The problem is with https traffic which is encrypted and cannot be read/changed in between. There are proxy servers like https://www.charlesproxy.com/ which get around this by generating their own certificates.
  2. Using a JS browser plugin/extension, which allows us to execute JS/inject CSS on the fly after the page loads. Stylish was one such user extension, which allowed custom styles. However, it had issues with snoooping on your data, so i decided not to go with it. Stylus seems to be another alternative. But doing it via javascript is more powerful. I had written one such extension for chrome : https://chrome.google.com/webstore/detail/onload-scripts/gddicpebdonjdnkhonkkfkibnjpaclok
So i started off with injecting css using javascript to change the background color to gray. However, there are sites like Facebook, which seem to check that their divs are not being changed. I found that this is usually true for divs containing images. So i thought, maybe i can exclude such divs from the styling, and it seems to be working. Below is the javascript used. Disclaimer : I have copied some of it from a  stackoverflow answer.

UPDATE : The mutationsObserver seems to take too much time on FB.
The evenlistener seems to work well.

    document.addEventListener("DOMNodeInserted", function(e) {
        //console.log( "DNI" + e);
        anode = e.target;
        if( anode.tagName && (anode.tagName.toLowerCase() == 'div' || anode.tagName.toLowerCase()  == 'span'  || anode.tagName.toLowerCase()  == 'td')){
            setbgcolor(anode);
        }
    },
    false);

-----------

function addCss(rule) {
  let css = document.createElement('style');
  css.type = 'text/css';
  if (css.styleSheet) css.styleSheet.cssText = rule; // Support for IE
  else css.appendChild(document.createTextNode(rule)); // Support for the rest
  document.getElementsByTagName("head")[0].appendChild(css);
}
var bgcolor = 'lightgray';

function setbgcolor( elem){
    // Don't change if first child elem is image. For sites like FB
    if(  ! elem.firstElementChild || (elem.firstElementChild && elem.firstElementChild.tagName.toLowerCase() != 'img') ){
        elem.style.backgroundColor = bgcolor;
        //console.log( "ONLS:" + elem.outerHTML);
    }

}

// divs with images
var allDivs = document.getElementsByTagName("div");
for( var i=0; i< allDivs.length; i++){
    var currDiv = allDivs[i];
    setbgcolor(currDiv);
}
// CSS rules
let rule  = 'body {background-color: ' + bgcolor + '} ';
    //rule += 'div {background-color: ' + bgcolor + '} ';
    rule += 'pre {background-color:' + bgcolor + '} ';
    rule += 'td {background-color:' + bgcolor + '} ';

addCss(rule);


// Select the node that will be observed for mutations
var targetNode = document.getElementsByTagName('body')[0];

// Options for the observer (which mutations to observe)
var config = { attributes: true, childList: true, subtree: true };

// Callback function to execute when mutations are observed
var callback = function(mutationsList, observer) {
    for(var mutation of mutationsList) {
        if (mutation.type == 'childList') {
            for (var i = 0; i < mutation.addedNodes.length; i++) {
                var anode = mutation.addedNodes[i];
                //console.log( "ONLSMO" +  anode.tagName + anode.id );
                if( anode.tagName && anode.tagName.toLowerCase() == 'div' || anode.tagName.toLowerCase()  == 'span'  || anode.tagName.toLowerCase()  == 'td'){
                    setbgcolor(anode);
                }
            }
        }
    }
};

// Create an observer instance linked to the callback function
var observer = new MutationObserver(callback);

// Start observing the target node for configured mutations
observer.observe(targetNode, config);

// Later, you can stop observing
//observer.disconnect();

Monday, May 20, 2019

Relationships in sqlalchemy

Sqlalchemy is a powerful and flexible framework in python, to interact with relational databases.
This post will cover only a small portion of working with relationships.

What are the advantages of relationships ?

They allow us to query data of related objects along with the main object.
This can be done via a join, or separate queries, either eagerly, or lazily( when the related object is accessed)
Also, they make it easy to insert/delete/update data into related tables, especially in case of OTM/MTMs


Sample Entities

Consider the entities defined below :

----

class User(Base):
     __tablename__ = 'users'

     id = Column(Integer, primary_key=True)
     name = Column(String(50))

     addresses = relationship("Address", back_populates="user")

class Address(Base):
     __tablename__ = 'addresses'

     id = Column(Integer, primary_key=True)
     city = Column(String(50))
     street = Column(String(50))
   
     user_id = Column(Integer, ForeignKey('users.id'))
     user = relationship("User", back_populates="addresses")

----
Here, a user can have many addresses, reflected by the addresses relationship. An address on the other hand, belongs to a single user, reflected by the user relationship.

Sample Data

Consider the following data :

ed_user = User(name='Edward')
ed_user.addresses = [ Address(city='Pune'), Address(city='Mumbai')]
bob_user = User(name='Bob')
bob_user.addresses = [ Address(city='Pune'),Address(city='Delhi')]

session.add(ed_user)
session.add(bob_user)

Creating the tables

Its possible to create the tables needed for the entities using metadata.create_all() :

engine = create_engine('sqlite:///:memory:') # Memory engine
Session = sessionmaker(bind=engine)
session = Session()

User.metadata.create_all(engine) # Create the tables

Logging of sqls can be enabled with :
logging.getLogger('sqlalchemy.engine').setLevel(logging.INFO)

Test Scenarios

All Users, with all addresses

qry_users = session.query(User).all()

print( "All users", [ (qu.name, [add.city for add in qu.addresses]) for qu in qry_users])

This is quite straightforward. We did not explicitly query for addresses. The users will be queried. Since the relationship loading default is lazy, queries for addresses of each user will be fired when the address details are accessed. Since a separate query is fired for each user, the lazy option is not performant if there are many rows of the main object, and we need to access addresses for each.

Users with a Mumbai address,  Mumbai addresses only

Why do we specify the Mumbai condition twice ? The first part is to filter users, and fetch only those with Mumbai addresses. The second part is to filter the addresses for each user, and restrict only to Mumbai addresses. This can be a bit confusing first. To filter the main object, we could do :

qry_users = session.query(User).filter(User.addresses.any(city='Mumbai')).all()

However, this will filter user, not addresses, so we will get non-Mumbai addresses too for each user.

In this case, since both are to be filtered, an inner join will suffice.

qry_users = session.query(User).join(User.addresses).options(contains_eager(User.addresses)).filter(Address.city=='Mumbai').all()

We joined with User.addresses, this way, we do not have to repeat the join condition, it is picked up from the relationship.
What is the need for the contains_eager ? It says that the related addresses have already been loaded from this query, do not fire the relationship queries again. Without it, not only will the related addresses query fire again(poor performance), but all addresses will be fetched, which we do not want.

Lets try to query all users again. What's this ? Edward's addresses show only Mumbai ! This is a result of caching. The Edward user object was last populated only with Mumbai address, and was cached along with its related objects. A session.rollback(), or session.expire_all() or session.expire(obj) can be used to clear the cache and make sqlalchemy fetch the latest data from the db. It would be a good idea to put one of these before each test scenario, to get the expected results.

All Users,  Mumbai addresses only

Note that in this case, we are not filtering user, only addresses. So if a user does not have a Mumbai address,she should still be listed, albeit with an empty addresses collection. i.e an outer join. This is usually true for related objects. This seems to be a straightforward case, and maybe something like filterrelated( obj, condition) should have been available. But its not. We have to again perform a join, an outer one.

----
addresses = User.metadata.tables['addresses'] # reference to a table object
sel = addresses.select().where(Address.city=='Mumbai')
qry_users = session.query(User).add_entity(Address).outerjoin(('addresses',  sel)).options(contains_eager(User.addresses)).all()

----
We have used a slightly different format, with the select, since i wanted to avoid duplicating the join condition with addresses. Sqlalchemy has many such options. Again, note the contains_eager,  to avoid querying for related addresses again.

----
print( "All users, Mumbai addresses only", [ (qu[0].name, [add.city for add in qu[0].addresses]) for qu in qry_users])
----
Note that with multiple entities selected in the join, the output is not a single entity, but multiple, wrapped in a Result object. Also, unlike with a single entity, the results will contain duplicates, as in a sql join. If we choose specific columns instead of the entire entity, the result will wrap the column without any entity. This is undesirable : changing the query changes the way in which results are accessed.

**Actually, with a contains_eager, one would expect to get only the User entity, with the address as a related entity. There are some inconsistencies or difficult to understand usages. Dropping the add_entity above, leads to a single User entity in the output.



qry_users = session.query(User).outerjoin(('addresses',  sel)).options(contains_eager(User.addresses)).all()


All Users with Delhi address,  all addresses

Here, we want to filter user using addresses, but fetch all addresses of the filtered users. This scenario shows how filtering and fetching related objects are separate things.

qry_users = session.query(User).filter(User.addresses.any(city='Delhi')).all()

A common mistake here might be to try User.addresses.city. Try to print type(User.addresses). Its an InstrumentedAttribute, not a list of Address. So it won't have a city member and trying to access it will throw a "AttributeError: Neither 'InstrumentedAttribute' object nor 'Comparator' object associated with User.addresses has an attribute 'city'". However, the results of the query execution, will be entities, so  qu.addresses will be a list of addresses, as we have already seen above. Its important to understand the difference between Entity class and Entity instance.





Sunday, March 17, 2019

A rant against annotations

This is an old issue, ever since annotations came up. I usually dislike most use of annotations, except for those that belong to the source code, like assert or deprecated. The reason is that they break reusability by pulling into a class info about how the class is to be used by other classes, and many times, tying a class to a framework.

E.g. using spring/hibernate annotations in a class tie that class to that framework. What if i later want to change from hibernate to something else ? I will have to change the source code.

Consider the following case of a service class CustomerService using a customerDao. The dao uses something @Component(name='customerDao'),
and its injected into the service as @Resource(name='customerDao')
Lets say that this Dao uses hibernate. Later on, we decide to change the dao to use JdbcTemplate instead of hibernate. But i want to keep the hibernate implementation as well. And maybe, i need to use both of them in different places. So i now have a new customerDaoSpTpl say. But the service code now needs to be changed in order to use this new customerDaoSpTpl. So did we really achieve DI, if we needed to change the service class using the dao ?

The problem is due to annotations. We are putting stuff in a class, that does not really belong in a class definition, but is about the wiring or interaction of classes.
There are also issues when old classes from a jar are to be replaced with new ones, the annotation processor can fire annotations for both. ( There probably is some exclude facility for the annotation processor ) The names used in annotated classes from a jar can't be changed either.

Ideally the wiring should be kept separate from the class definition.
Spring does support this, thru xml as well as a java annotations format, but everyone seems to be putting annotations into source code, because thats easier to use upfront.

One disadvantage of having a single separate source for wiring maybe that it has frequent changes and always needs merging when checking in.

Saturday, March 9, 2019

Distributing the database

You want to design you application to scale, so need to choose a database solution that will scale.
And it would be nice to have the querying and integrity features of an RDBMS.

Assuming that we want to shard data, and not just have a replica,
Some questions arise :
  •  How easy would it be to add a new node ?Should sharding be automatic, or per some partitioning rules ? If as per rules, what happens when we add/remove nodes, the data will need to be redistributed.
  • If a table is distributed across nodes, what happens to primary key ids ? How to prevent duplicates across nodes ? Id ranges ?
  • One major use of ids as primary keys is for use in foreign key constraints. But will a distributed table support foreign keys ? Probably not.
  • Will ACID transactions be available ?

Found some open-source solutions to scale RDBMS :
  • CockroachDB (https://www.cockroachlabs.com/) : A key-value store that supports the wire-protocol of Postgresql, sql, ACID transactions etc. So it behaves as if its a distributed postgresql to the sql drivers, but internally is not.
  • Citus : https://www.citusdata.com : A Postgresql extension that allows us to scale Postgresql. The underlying db is indeed postgresql. However, not all postgresql features can be supported in distributed mode.
  • Posgres-XL (https://www.postgres-xl.org/overview/)

Sell Software online

Where can software developers sell their products online ?
There are many facilitators that help setup your own site from templates like opencart, with payment-processing baked in.
But what if I don't want the hassle of setting up a site ?
Are there online stores where one can  register and start selling ?

Payloadz : https://store.payloadz.com/
$19.95 per month 2.9%+$0.29 per transaction, billed monthly. Has store. No reviews ?

ClickBank: https://accounts.clickbank.com
This affiliate marketplace is for all things digital.  Though ClickBank has its own store, it focuses mainly on using affiliates to generate sales.
50$ activation, no monthly, 2.5$ per batch payment xferred to ur acct. You can decide what to pay affiliates. Has store, no reviews.

EBay:
Don't see much custom software sold here. Commissions hefty.

Amazon:
Don't see much custom software sold here. Commissions hefty.

E-junkie: https://www.e-junkie.com/
eBay affiliation ? Seems to have lots of e-books.
5-40$ per month, based on num of products, storage space, remote files.
Features like pay-what-you-want, pdf protection, affiliates.
Has store. No reviews.


PayPro: https://payproglobal.com/
Provides many features, including software protection.
4.9% + 1$ to 7.9% minimum 1$ of sales. Has store, reviews. Download trial ? Adds GST too. Does not need buyer to log in, but then how do you have verified reviews ? Does not support custom tags for searching. Does allow full text search. Does not show price on list page. Has poor reviews, unable to login etc.


Monday, February 18, 2019

Python

Static typing

https://docs.python.org/3/library/typing.html

Global Interpreter lock (GIL)

The GIL prevents true parallelism on multi-core processors, by allowing only one thread to run at a time.


Synchronization

Web concepts

www sub domain


Ever wondered about the "www" in urls ? usually, we can give it a miss. so whats its use ? Lets say our site is milunsagle.in
1. if we have subdomains like ftp.milunsagle.in, music.milunsagle.in ,etc, and wish to keep cookies betwen the main domain and the subdomains separate, the www acts like a subdomain placeholder for the main domain. Else cookies for all will be mixed.
2. In DNS entries, which link the domain name to the server, if we have a www, we can point to a host-name rather than a static i.p, and the host name can then further point to multiple ips for redundancy.
thanks to https://www.sitepoint.com/domain-www-or-no-www/

Software concepts


Concurrency

 

 Normalization in short :


1NF : Ensure atomic(not a collection of) values in a column,
Ensure values of same data-type in a column

2NF : Remove partial dependencies : i.e. a column not depending on all the columns forming the primary key, but on some of them.

3NF: Remove transitive dependencies, i.e. a column not depending directly on the the primary key,, but on another column that depends on the primary key.


Spark

 
 

Microservices

 
Introduction to microservices by Martin Fowler
https://martinfowler.com/microservices/
 

Image conversion tools

Imagemagick

 
Imagemagick is a great free tool to work with images. Below is a simple command to concatenate/combine images :

//+append for horz, - for vert
convert a.jpg b.jpg -append c.jpg

https://www.imagemagick.org/script/index.php

Raspberry Pi experiments

Connecting a TV


Wow, at last able to get the Rpi's video and audio working on my tv. It was not working even after the config.txt changes to turn off the HDMI. did the following do it ? Or was it just a pin connection issue ? dunno, but happy !

/opt/vc/bin/tvservice
o
/opt/vc/bin/tvservice -c "PAL 4:3"

UPDATE : 13-Nov-17
==================
I enabled the desktop mode using raspi-config and rebooted.
Did not need to execute above commands for tvservice.
Most importantly, i had to use the RED cable for video, even after using
the TRRS converter for the AV cable

------

To connect to pi over ssh using GUI, we do :
ssh
X pi@MyPiHostIP"
then
"pi@raspberrypi:~$ /etc/X11/Xsession"
This will start the X session on ur dektop

-------

When directly connected to the pi, it usually goes into command-prompt mode, and we can start a GUI session using "startx" command


Accessing the intranet 

Was unable to connect to my Pi from my Linux PC, both being on WLAN using tplink router. it was a router config issue. Under Wireless, Wireless Advanced, we have to turn off "Enable Client isolation" which is on by default.
Pi as a dumb(?) terminal
Using the raspberry pi to connect to VNC server on another machine. this way, one can share a single machine betw multiple people. of course, the pi can itself be used as a standalone computer, and the latest model 3, which is 64 bit, costs only around 3k. so if you are looking for a cheap computer, especially for kids, the pi + keyboard,mouse + monitor may be the way to go, and its GPIO pins can do many other interesting things.


Keyboard


If keys are not working properly on the raspberry pi, esp the special keys, try changing the layout :

sudo vi /etc/default/keyboard
change the XKBLAYOUT propery to us:
XKBLAYOUT="us"

Linux often used commands


Editing


# diff
vimdiff gives a graphical feel

# Edit files with windows line endings
Pass the -b option to vi, or, once vi is loaded, type :e ++ff=unix.

# change the typing language, e.g. to hindi from English
alt + caps


Shell

# to make the statements/exports in a.sh available in current process
source a.sh

# search recursively for pattern
grep -r --include "*.jsp" pattern

# Sort files by reverse timestamp
find . -name "*.jsp" -printf '%T@ %t %p\n' | sort -k 1 -n -r

# line count
wc -l


FileSystem

# disk usage
df
du

Software management

# to invoke mint's package manager
sudo mintinstall

#to update flash plugin for FF
apt-get install flashplugin-installer


Services

# manage startup of services
initictl list/start/stop/restart <service>

#view/manage services
systemctl


# Remove a service from startup. Here apache
sudo update-rc.d apache2 remove

# Add a service apache2 back into the autostart
sudo update-rc.d apache2 defaults


# Enable autostart
sudo update-rc.d apache2 enable

# Disable autostart. Difference from remove is that the entry is kept.
update-rc.d apache2 disable 

#Can also be done with systemctl, depending on the version
#or the service command :
service apache2 start

Apache start/stop

sudo apachectl start/stop

# apache conf location
/etc/apache2/apache2.conf



Mail

postfix start/stop

d a-z or d * to delete mail messages


Processes

# view ports, in this case for amqp service
nmap -p 1-65535 localhost | grep mq

# view processes
ps -ef
netstat


Pdf

# pdftk remove password from pdf. may throw error, its outdated
pdftk input_pw <pass> in.pdf output out.pdf

# or use qpdf if pdftk gives error
qpdf --password=<your-password> --decrypt /path/to/secured.pdf out.pdf

Images

# image magic join images +for horz, - for vert
convert a.jpg b.jpg -append c.jpg

# image magic pdf to image
convert -density 100 -colorspace rgb test.pdf -scale 200x200 test.jpg

Md5


# check md5 sum of file. Tr to convert case
md5sum spark-2.2.1-bin-hadoop2.7.tgz | tr '[:lower:]' '[:upper:]' | grep C0081F6076070F0A6C6A607C71AC7E95


System Settings

Swap

There is a swappiness setting on ubuntu, which might make use of swap even when main memory is available. A lower value will prevent this
cat /proc/sys/vm/swappiness
sudo sysctl vm.swappiness=10

swapoff -av

swapon -av to reset any already used swap

Remote

# shutdown/restart remote PC over remmima
shutdown /s /t 0
shutdown /r /t 0

teamviewer --daemon stop

Tools utilities

ffmpeg

Cutting a section of an mp3 file on linux with ffmpeg. To copy a section of 30 sec from starting point of 15.5 seconds, use
ffmpeg -t 30 -ss 00:00:15.500 -i inputfile.mp3 -acodec copy outputfile.mp3




Remote connection


Connected to my standby windows7 PC from my linux mint desktop, using Remmina.Yo ! Linux rocks !

http://www.digitalcitizen.life/connecting-windows-remote-desktop-ubuntu


Binary editors


xxd is a useful tool to convert to/from hex format. it can be used to feed binary data to text/newline based programs like sed/cut. e.g to remove first X bytes of X+Y byte records in binary data, do :

xxd
c X+Y -ps | cut -c 2X+1| xxd -r -p

Another out-of-the-box tool that can be used for binary data is : bbe - binary block editor

Also a java tool  : https://sourceforge.net/projects/bistreameditor/


REST tools


RESTED is a postman like extension for firefox, to test REST APIs : https://addons.mozilla.org/en-US/firefox/addon/rested/


Browser Styles

This global dark style from user-styles is great ! Easier on the eyes. Use with the stylish plugin for FireFox :
https://userstyles.org/styles/31267/global-dark-style-changes-everything-to-dark 
UPDATE : The stylish plugin is said to snoop on you, collecting data. There are alternative plugins like stylus.
Another option is to use a plugin that allows you to execute JS on load, and inject your style sheets via that. e.g. the onload scripts extension for chrome : https://chrome.google.com/webstore/detail/onload-scripts/gddicpebdonjdnkhonkkfkibnjpaclok



Java Streams

Streams in java are somewhat like iterators, in that they provide an-element-at-a-time interface.

However, they can also be processed in parallel.

And operations like filter, map, reduce, collect can be applied.

Below is a example of how the code can be really succinct.
We read a text file, and group by the first column, producing counts per group in a few lines:

Map mresult = Files.readAllLines( Paths.get("/home/test/testaccess.log")).stream()
            .map( line-> line.substring(0, line.indexOf(" ") ) )
            .collect( Collectors.groupingBy( line->line, Collectors.counting()) );
        System.out.println( mresult);

Friday, January 11, 2019

Hadoop quick look

Comparison with other systems :

RDBMS : Structured data, better for write/query of speific rows, can't scale linearly, as horizontal scaling not easy, maybe due to ACID constraints.

MapReduce : Better for batch processing of entire contents. Can be unstructured data. Can scale linearly. Can't work iteratively on changed data, i.e starts from scratch every time, but spark can do this.

HDD seek times are not growing as much as transfer speeds, so reading large data set from seeks performs
poorer than full scan, similar to indexes in RDBMS

SANs provide block-based n/w access to storage for servers. Earlier cloud computing used Message Passing Interfaces and SANS to distribute tasks to
    nodes, but when reading large amounts of data for processing, the SAN becomes a b/w bottleneck.

Hadoop shines here, as it co-locates data and processing on nodes.
*** This feature, known as data locality, is at the heart of data processing in Hadoop and is the reason for its good performance.
Also, Hadoop manages execution of the mapreduce jobs, leaving only business-logic to the programmer.
By contrast, MPI programs have to explicitly manage their own communication, checkpointing and recovery,
which gives more control to the programmer but makes them more difficult to write

HDFS architecture:

HDFS has a master or name-node and multiple slave or data-nodes. The data-nodes store the actual data as bocks.
The namenode has the hdfs file-system tree and metadata for files and dirs on it. This includes a list of data-nodes for each entry.
Usually, the namenode writes are also copied to another NFS mounted location as backup. Also, a secondary name-node helps in merging
edit log entries into the main entries for the name node. Since file-system entries can be too huge for a single namenode to handle,
a Federation facility provides multiple name-nodes, each managing a portion of the name-space, e.g. /user. A High Availability configuration
is also available, which allows a pair of name-nodes in active/standy mode.
There are command line tools as well apis to access hdfs.

YARN architecture

YARN comprises of a single cluster level process called the resource manager to use cluster resources,
and nodemanagers running on individual nodes, to launch and monitor containers on the node.
A container executes an application specific process with a constrained set of resources( memory, cpu etc)
A client contacts the resource manager, sending it the binaries to run an application master process.
Hadoop has its own application master to run map-reduce jobs, spark has its own to run DAGs etc.
YARN does not itself provide any means of communication betw client, master, process. Its done by the application.
YARN can use different types of schedulers : FIFO, Capacity which has buckets per job type,and Fair which allocated resources
evenly between jobs

Map Reduce Jobs

The Map step processes input data across the cluster. The Reduce steps collects the map output to create the results.
The Map step implemented by Mapper, creates data with key value fields. Input data to be processed is divided into (ideally) equal parts called splits, and those many instances of mappers are instantiated to process the splits,
usually on different nodes. InputFormat->RecordReader->deserialize into key,value pairs. TextInputFormat is the default input format, and it provides record num as key and record data as Text value. Its also possible to use a combiner to further group the mapping output. The output key-values of the mapper are the intermediate results. These correspond to the input for the Reducer. These are stored on the cluster. There is a shuffle stage, which now sorts the intermediate data by key, and depending on the size and partitioning, sends off to one or multiple reducers.
Each reducer is guaranteed to get sorted data for one or more keys. The partitioner controls this. Further there is a grouping control too, to decide which values are send to one invocation of reduce(). The reducer then produces the output specific to that key, producing a (potentially new) set of key value pairs. The default way for map and reduce tasks to create output is using context.write( key, value).
Both map and reduce methods get the input params( key, value, context). In case of reduce, it is multiple values against a key. They can also write directly the file-system for needs that do not exactly fit the key-value paradigm.
On the output side, we have key, value serialized using OutputFormat ->RecordWriter.
Its possible to have a job with only a Map task and no reducers. This defaults to the IdentityReducer. The output of the map stage then becomes the final output. Its even possible in this case to specify the number of reducers, and these many output files are created. shuffle/sort will happen in this case. However, if we specify reducers as 0, no shuffle/sort will take place. Its also possible to provide a custom shuffle/sort implementation from hadoop 2.9.2 onwards. This can be useful for e.g., when you don't need a sort, or a different type of sort.

Can reducers start running when some but not all maps have run ? -No, coz all values for a key are guaranteed to go to a single reducer, and this can't be known till all maps finish generating the key value pairs.

There is an OutputCommiter API to handle pre-post custom actions on tasks.
There are inbuilt counters to track the job execution, tasks, file-system, adn input-output formats.
Its possible to create user-defined counters as well.

There is a distributed cache, where frequently used files and jars can be stored.

The Hadoop Streaming API allows us to use any executable script like python or shell to execute map-reduce jobs.
We can pass data from/to our Map and Reduce code via STDIN (standard input) and STDOUT (standard output).

Questions
Related tools :

Avro

Language-neutral data-serialization system. Described using a language independent schema, usually json. The spec describes the binary format all implementations must support. Similar to SequenceFile, but portable.
A data file contains header with schema and metadata. Sync markers make the file splittable. Can use different schema version for reading and writing a file, making the schema easy to evolve. This can also be useful to read a few fields from a large number of fields.Record classes can be generated from avro schema, or the GenericRecord type can be used.

Parquet

Columnar storage format that can efficiently store nested data. i.e. objects within objects. Can reduce file size by compressing data of a column better. Can improve query performance if only a small subset of columns are read, since data is stored in columnar fashion. Supported by large number of tools. Uses a schema with small set of pre-defined types. A parquet file consists of a header with a magic number and a footer which has the meta-data along with block boundaries. Hence splittable. Organized as blocks -> row-groups -> column-chunks->pages. Each column-chunk has data only for a single column. Compression is achieved by using encodings like delta, run-length, dictionary etc. To write an object data, we need a schema, a writer and the object. For reading, its a reader. Classes like AvroParquetReader/Writer are available to interoperate between Avro and Parquet.
Parquet-tools are available to work with these files, e.g dump contents.

Flume
Event processing framework with sources and sinks for Hadoop.

Sqoop
Tool to import/export data from/to databases. Supports text,binary files and formats like Avro, Parquet, SequenceFiles etc. Can work with RBMS, as well as others like Hbase, Hive. Sqoop 2 has REST, java APIs, Web UI etc. Sqoop uses Map-reduce tasks to execute the import/exports in parallel. The check-column option splits the data to be imported into hadoop based on ranges of values in the specified column. Similar options are available for exporting from hadoop into a DB, along with last-update option for incremental updates. It can also generate java classes from tables to hold row data. It also allows a direct-mode for databases that support it, to import/export rows faster.  Support storing LOBs in separate files.