Curious but Lazy

Monday, February 18, 2019

Linux often used commands

Editing

# diff
vimdiff gives a graphical feel

# Edit files with windows line endings
Pass the -b option to vi, or, once vi is loaded, type :e ++ff=unix.

# change the typing language, e.g. to hindi from English
alt + caps

Shell

# to make the statements/exports in a.sh available in current process
source a.sh

# search recursively for pattern
grep -r --include "*.jsp" pattern

# Sort files by reverse timestamp
find . -name "*.jsp" -printf '%T@ %t %p\n' | sort -k 1 -n -r

# line count
wc -l

FileSystem

# disk usage
df
du

Software management

# to invoke mint's package manager
sudo mintinstall

#to update flash plugin for FF
apt-get install flashplugin-installer

Services

# manage startup of services
initictl list/start/stop/restart <service>

#view/manage services
systemctl

# Remove a service from startup. Here apache
sudo update-rc.d apache2 remove

# Add a service apache2 back into the autostart

sudo update-rc.d apache2 defaults

# Enable autostart
sudo update-rc.d apache2 enable

# Disable autostart. Difference from remove is that the entry is kept.

update-rc.d apache2 disable

#Can also be done with systemctl, depending on the version
#or the service command :
service apache2 start

Apache start/stop

sudo apachectl start/stop

# apache conf location
/etc/apache2/apache2.conf

Mail

postfix start/stop

d a-z or d * to delete mail messages

Processes

# view ports, in this case for amqp service
nmap -p 1-65535 localhost | grep mq

# view processes
ps -ef
netstat

Pdf

# pdftk remove password from pdf. may throw error, its outdated
pdftk input_pw <pass> in.pdf output out.pdf

# or use qpdf if pdftk gives error
qpdf --password=<your-password> --decrypt /path/to/secured.pdf out.pdf

Images

# image magic join images +for horz, - for vert
convert a.jpg b.jpg -append c.jpg

# image magic pdf to image
convert -density 100 -colorspace rgb test.pdf -scale 200x200 test.jpg

Md5

# check md5 sum of file. Tr to convert case
md5sum spark-2.2.1-bin-hadoop2.7.tgz | tr '[:lower:]' '[:upper:]' | grep C0081F6076070F0A6C6A607C71AC7E95

System Settings

Swap

There is a swappiness setting on ubuntu, which might make use of swap even when main memory is available. A lower value will prevent this
cat /proc/sys/vm/swappiness
sudo sysctl vm.swappiness=10

swapoff -av

swapon -av to reset any already used swap

Remote

# shutdown/restart remote PC over remmima
shutdown /s /t 0
shutdown /r /t 0

teamviewer --daemon stop

Tools utilities

ffmpeg

Cutting a section of an mp3 file on linux with ffmpeg. To copy a section of 30 sec from starting point of 15.5 seconds, use
ffmpeg -t 30 -ss 00:00:15.500 -i inputfile.mp3 -acodec copy outputfile.mp3

Free schema and ER tools :

https://www.leedsbeckett.ac.uk/qsee/

QSEE Technologies | CASE and Modeling tool specialists - Leeds Beckett University

leedsbeckett.ac.uk

Remote connection

Connected to my standby windows7 PC from my linux mint desktop, using Remmina.Yo ! Linux rocks !

http://www.digitalcitizen.life/connecting-windows-remote-desktop-ubuntu

Binary editors

xxd is a useful tool to convert to/from hex format. it can be used to feed binary data to text/newline based programs like sed/cut. e.g to remove first X bytes of X+Y byte records in binary data, do :

xxd
~~c X+Y -ps | cut -c 2X+1~~| xxd -r -p

Another out-of-the-box tool that can be used for binary data is : bbe - binary block editor

Also a java tool : https://sourceforge.net/projects/bistreameditor/

REST tools

RESTED is a postman like extension for firefox, to test REST APIs : https://addons.mozilla.org/en-US/firefox/addon/rested/

Browser Styles

This global dark style from user-styles is great ! Easier on the eyes. Use with the stylish plugin for FireFox :
https://userstyles.org/styles/31267/global-dark-style-changes-everything-to-dark
UPDATE : The stylish plugin is said to snoop on you, collecting data. There are alternative plugins like stylus.
Another option is to use a plugin that allows you to execute JS on load, and inject your style sheets via that. e.g. the onload scripts extension for chrome : https://chrome.google.com/webstore/detail/onload-scripts/gddicpebdonjdnkhonkkfkibnjpaclok

Java Streams

Streams in java are somewhat like iterators, in that they provide an-element-at-a-time interface.

However, they can also be processed in parallel.

And operations like filter, map, reduce, collect can be applied.

Below is a example of how the code can be really succinct.
We read a text file, and group by the first column, producing counts per group in a few lines:

Map mresult = Files.readAllLines( Paths.get("/home/test/testaccess.log")).stream()
            .map( line-> line.substring(0, line.indexOf(" ") ) )
            .collect( Collectors.groupingBy( line->line, Collectors.counting()) );
        System.out.println( mresult);

Friday, January 11, 2019

Hadoop quick look

Comparison with other systems :

RDBMS : Structured data, better for write/query of speific rows, can't scale linearly, as horizontal scaling not easy, maybe due to ACID constraints.

MapReduce : Better for batch processing of entire contents. Can be unstructured data. Can scale linearly. Can't work iteratively on changed data, i.e starts from scratch every time, but spark can do this.

HDD seek times are not growing as much as transfer speeds, so reading large data set from seeks performs
poorer than full scan, similar to indexes in RDBMS

SANs provide block-based n/w access to storage for servers. Earlier cloud computing used Message Passing Interfaces and SANS to distribute tasks to
nodes, but when reading large amounts of data for processing, the SAN becomes a b/w bottleneck.

Hadoop shines here, as it co-locates data and processing on nodes.
*** This feature, known as data locality, is at the heart of data processing in Hadoop and is the reason for its good performance.
Also, Hadoop manages execution of the mapreduce jobs, leaving only business-logic to the programmer.
By contrast, MPI programs have to explicitly manage their own communication, checkpointing and recovery,
which gives more control to the programmer but makes them more difficult to write

HDFS architecture:

HDFS has a master or name-node and multiple slave or data-nodes. The data-nodes store the actual data as bocks.
The namenode has the hdfs file-system tree and metadata for files and dirs on it. This includes a list of data-nodes for each entry.
Usually, the namenode writes are also copied to another NFS mounted location as backup. Also, a secondary name-node helps in merging
edit log entries into the main entries for the name node. Since file-system entries can be too huge for a single namenode to handle,
a Federation facility provides multiple name-nodes, each managing a portion of the name-space, e.g. /user. A High Availability configuration
is also available, which allows a pair of name-nodes in active/standy mode.
There are command line tools as well apis to access hdfs.

YARN architecture

YARN comprises of a single cluster level process called the resource manager to use cluster resources,
and nodemanagers running on individual nodes, to launch and monitor containers on the node.
A container executes an application specific process with a constrained set of resources( memory, cpu etc)
A client contacts the resource manager, sending it the binaries to run an application master process.
Hadoop has its own application master to run map-reduce jobs, spark has its own to run DAGs etc.
YARN does not itself provide any means of communication betw client, master, process. Its done by the application.
YARN can use different types of schedulers : FIFO, Capacity which has buckets per job type,and Fair which allocated resources
evenly between jobs

Map Reduce Jobs

The Map step processes input data across the cluster. The Reduce steps collects the map output to create the results.

The Map step implemented by Mapper, creates data with key value fields. Input data to be processed is divided into (ideally) equal parts called splits, and those many instances of mappers are instantiated to process the splits,
usually on different nodes. InputFormat->RecordReader->deserialize into key,value pairs. TextInputFormat is the default input format, and it provides record num as key and record data as Text value. Its also possible to use a combiner to further group the mapping output. The output key-values of the mapper are the intermediate results. These correspond to the input for the Reducer. These are stored on the cluster. There is a shuffle stage, which now sorts the intermediate data by key, and depending on the size and partitioning, sends off to one or multiple reducers.
Each reducer is guaranteed to get sorted data for one or more keys. The partitioner controls this. Further there is a grouping control too, to decide which values are send to one invocation of reduce(). The reducer then produces the output specific to that key, producing a (potentially new) set of key value pairs. The default way for map and reduce tasks to create output is using context.write( key, value).

Both map and reduce methods get the input params( key, value, context). In case of reduce, it is multiple values against a key. They can also write directly the file-system for needs that do not exactly fit the key-value paradigm.
On the output side, we have key, value serialized using OutputFormat ->RecordWriter.

Its possible to have a job with only a Map task and no reducers. This defaults to the IdentityReducer. The output of the map stage then becomes the final output. Its even possible in this case to specify the number of reducers, and these many output files are created. shuffle/sort will happen in this case. However, if we specify reducers as 0, no shuffle/sort will take place. Its also possible to provide a custom shuffle/sort implementation from hadoop 2.9.2 onwards. This can be useful for e.g., when you don't need a sort, or a different type of sort.

Can reducers start running when some but not all maps have run ? -No, coz all values for a key are guaranteed to go to a single reducer, and this can't be known till all maps finish generating the key value pairs.

There is an OutputCommiter API to handle pre-post custom actions on tasks.
There are inbuilt counters to track the job execution, tasks, file-system, adn input-output formats.
Its possible to create user-defined counters as well.

There is a distributed cache, where frequently used files and jars can be stored.

The Hadoop Streaming API allows us to use any executable script like python or shell to execute map-reduce jobs.
We can pass data from/to our Map and Reduce code via STDIN (standard input) and STDOUT (standard output).

Questions
Related tools :

Avro

Language-neutral data-serialization system. Described using a language independent schema, usually json. The spec describes the binary format all implementations must support. Similar to SequenceFile, but portable.
A data file contains header with schema and metadata. Sync markers make the file splittable. Can use different schema version for reading and writing a file, making the schema easy to evolve. This can also be useful to read a few fields from a large number of fields.Record classes can be generated from avro schema, or the GenericRecord type can be used.

Parquet

Columnar storage format that can efficiently store nested data. i.e. objects within objects. Can reduce file size by compressing data of a column better. Can improve query performance if only a small subset of columns are read, since data is stored in columnar fashion. Supported by large number of tools. Uses a schema with small set of pre-defined types. A parquet file consists of a header with a magic number and a footer which has the meta-data along with block boundaries. Hence splittable. Organized as blocks -> row-groups -> column-chunks->pages. Each column-chunk has data only for a single column. Compression is achieved by using encodings like delta, run-length, dictionary etc. To write an object data, we need a schema, a writer and the object. For reading, its a reader. Classes like AvroParquetReader/Writer are available to interoperate between Avro and Parquet.

Parquet-tools are available to work with these files, e.g dump contents.

Flume
Event processing framework with sources and sinks for Hadoop.

Sqoop
Tool to import/export data from/to databases. Supports text,binary files and formats like Avro, Parquet, SequenceFiles etc. Can work with RBMS, as well as others like Hbase, Hive. Sqoop 2 has REST, java APIs, Web UI etc. Sqoop uses Map-reduce tasks to execute the import/exports in parallel. The check-column option splits the data to be imported into hadoop based on ranges of values in the specified column. Similar options are available for exporting from hadoop into a DB, along with last-update option for incremental updates. It can also generate java classes from tables to hold row data. It also allows a direct-mode for databases that support it, to import/export rows faster. Support storing LOBs in separate files.

Wednesday, January 31, 2018

Getting started with AWS Free Tier

Introduction

Please note that i am a beginner with AWS, and this blog is my current understanding, and may not be totally accurate.

Knowledge of AWS has become sort of mandatory these days, so i created an account and will get to try out some facilities free for an year( within usage limits), whereas others ( the "Non-expiring Offers" ) are free for lifetime( again within limits). See https://aws.amazon.com/free/ for details.

For example, EC2 free-tier gives us 750 hours per month on a t2.micro instance. The hours are sufficient for one instance to run continuously per month. If we want to run multiple instances, the hours will have to be divided between them.
So its a good practice to stop your instances after you have finished your practice session with them, to save on hours.

Billing

Billing with AWS has quite complex clauses, e.g. transfer of data, transfer outside a region, transfer over public I.P, non-usage of elastic I.P.s, number of IO reads, etc. Also, the policy is never to stop execution when you exceed the limit, but to charge you. Thus, its better to subscribe to billing alerts at https://console.aws.amazon.com/billing/home?region=us-east-1#/preference
( region=us-east-1 will change as per your settings ) and get notified in time.

Programmatic access with APIs

In order to access your account programatically using Amazon APIs, or tools that use the API, like Boto, Fog, TerraForm, etc, you will to get the access keys from the Security credentials page : https://console.aws.amazon.com/iam/home?region=us-east-1#/security_credential( region=us-east-1 will change as per your settings ). These keys are for the amazon account, and not per instance or service. Its possible to install the aws tools on your local machine, and use them to work with your aws services. e.g. aws s3/s3api can be used from your local machine to download files from s3. The s3api offers some extra options, like ranges to download part of a file.

About Terraform, its a tool to initialize/destroy instances, but not for installing/updating/running software. Other tools like Ansible,Chef,Puppet should be installed using Terraform initialize, and later used as needed. Also, Terraform saves and reads states from files, and may not be as suitable as Boto/Fog for running on-the-fly configurations without using a file.
When we start/stop/terminate instances, we do not communicate with the instance itself, but rather its region-level handler. This should be clear for start/create, since the instance does not exist. Each instance has an instance-id, which can be used to stop/terminate an instance. We do not need/use the DNS/I.P address.

Storage services

All of these are free only for the trail period. Charges usually apply on amount of data stored, as well as transferred.

S3( Simple Storage Service)

See https://en.wikipedia.org/wiki/Amazon_S3

Grow/shrink as needed storage. Not for heavy writes.

Not a file-system with files, inodes, permissions etc. Accessible using its API.

S3 stores objects(files) up to 5 TB, each can have 2 KB of metadata. Each objects has a key, and is stored in a bucket, which itself has a name/id. So its rather like a nested hashmap. Buckets and objects can be created, listed, and retrieved using either REST or SOAP. Objects can be indexed and queried using the metadata/tags. Can be queried as SQL using the Athena service. Can be downloaded using the HTTP or BitTorrent. Bucket names and keys are chosen so that objects are addressable using HTTP URLs:

http://s3.amazonaws.com/bucket/key
http://bucket.s3.amazonaws.com/key
http://bucket/key (where bucket is a DNS CNAME record pointing to bucket.s3.amazonaws.com)

Because objects/files are accessible via HTTP, S3 can be used to host static websites. Some dynamic scripting could be provided by Lambda.
S3 can be used as a file-system for Hadoop.
Amazon Machine Images (AMIs) which are used in the Elastic Compute Cloud (EC2) can be exported to S3 as bundles.
https://aws.amazon.com/blogs/aws/amazon-athena-interactive-sql-queries-for-data-in-amazon-s3/

I just discovered that creating tags on S3 objects incurs costs, tho quite small ! AWS billing is really tricky. Fortunately, due to the billing alarm set up, i got notified in time.

EBS( Elastic Block Storage )

Fixed amount of block storage for high throughput. (Fast read/writes ) e.g. can store DB files. Multiple such allocations can be made. Needs to be attached to a file-system. Should be formatted before using. Attached and accessible only to a single EC2 instance.

EFS( Elastic File System )

Grow/shrink as needed, managed file-system, can be shared among multiple EC2 instances. Not HTTP accessible, no meta-data querying like S3.

Glacier

Cheap, long term, read-only archival storage

Serverless services( Lambda) :

Serverless means one does not have to setup servers or load-balancers. We just write a function that does the required processing. e.g store price updates into DB. The server, scaling is all handled by AWS. The charging is only for the use, not for the uptime. So if the functionality is not called frequently, one could use Lambda instead of an always-on EC2 instance, and be billed less. The function can be called from various sources like S3 object modifications, logging events, or an API Gateway interface that accepts HTTP calls. Using these sources may incur separate charges.

Working with EC2(Elastic Computing) instances

Its quite easy using the management console to launch an EC2 instance. The options that are eligible for the free tier are marked as such, e.g."t2.micro" instances. Sometimes, options that may incur additional charges are marked as such.

Regions

The region is important since the console usually lists instances/services for the current region. Also communication between different regions may incur additional charges. The APIs too usually query by region. So for testing, its better to keep everything under a single region. In real life, distributing your application across different regions will provide better fail-safety.

User-data

Specify a set of commands to run when the instance starts, e.g. set env variables, starts a DB daemon, etc. If using config/script files, where will be files come from ? Probably from an S3 or EFS storage that will have to be set up first. This option is available under "Configure Instance Data->Advanced Details". The user-data runs only the first time the instance is started, if we want it to run on every restart, use the "#cloud-boothook" directive at the start of the script.

Here is an example of setting env-variables, and copying and running a file from S3:

--------------

#cloud-boothook
#!/bin/bash
echo export MYTYPE=APPSRV > ~/myconfiguration.sh
chmod +x ~/myconfiguration.sh
sudo cp ~/myconfiguration.sh /etc/profile.d/myconfiguration.sh

aws s3api get-object --bucket <bucketname> --key test.sh /home/ec2-user/test.sh
sh test.sh

----------------

The EC2 parameter store seems to be another way to store parameters, with better security

Addresses

When an instance is created, it is assigned public and private I.Ps, as well as a public domain name. The domain name is of the form

ec2-<public I.P>.compute-1.amazonaws.com.

If we use the domain-name from outside AWS, e.g. from our local m/c, it will resolve to the public I.P of our instance. If used from within AWS, i.e. from an EC2 instance, it will resolve to the private I.P of our instance. The domain-name usually contains the public I.P address, and changes if the public I.P changes. The public I.P is not constant, as its allocated from a pool. A reboot does not change the I.P.s. However, a stop and start will change the public I.P, tho not the private I.P. One solution for a fixed public I.P is to use elastic I.P.s. However, they can incur charges in certain cases, e.g. if not used. A terminate and create-new will of course change the I.P.s.

For better security, it should also be possible to have only a private I.P. for some EC2 instances,and access them via SSH from the EC2 instances that have a public I.P. This is probably the "Auto-assign Public IP"option, enabled by default

Key-Pairs

We usually work on an instance using SSH, with public/private keys. (These are different from the API Access keys for the account.) These key-pairs can be generated from the console, and associated with an instance. ( Advanced : Can also be generated on your local m/c, and the public key copied to the proper directory on your instance ). Has to be done when creating an instance. If using the Launch wizard, you will be prompted for creating a key-pair, or using an existing one. A key-pair can be shared among multiple EC2 instances. Make sure to use a name that keeps the file unique on your local file system.e.g. use aws, account etc in the name.

Security Groups

Each EC2 instance is associated with a security group, which is like a firewall. It controls what protocols and ports are available for inbound and outbound calls.

IAM Roles

For internal use within AWS services. e.g. Accessing S3 from EC2 requires the account secret keys for auth. Instead one can create an IAM Role with S3 permissions, and grant to EC2 instance. Not very flexible though. Can be specified only when launching the instance. Also, combination of roles cannot be granted.

Storage

The instance launch wizard will by default create an 8 GB EBS root volume for the instance. In addition there is an option to attach more volumes. For free-tier, only EBS seems to be supported. There is a "Delete on termination" option, which if checked, will delete the EBS volume after the instance is terminated. Stopping an instance won't affect the EBS volume tho, and i checked that some files that i had added to the volume were intact after a Stop and Start.

Monday, October 3, 2016

A collection and player for Indian Classical Music Bandish

Some years ago, around 2010, i had added an "Indian Classical Music" section to my site http://milunsagle.in
I used ascii letters and symbols for the notation. e.g. SrRgGmMPdDnN.
A full notation help is available here : http://milunsagle.in/webroot/pages/bnd_notn_help
The java midi api was used to create midi files for the bandish( compositions) and play them.
However, with the restrictions on applets and Java Web start in browsers for security reasons, support for the java player in browsers became more and more difficult.
Now unlike reading/writing local files or executing a program, outputting music is not usually a security concern.
The relatively recent web audio api, addressed this issue and made playing audio thru JavaScript possible. There are many frameworks like Tone.js, which allow us to play tones thru javascript. I have moved the bandish-payer on my site for indian classical music from applets to Tone.js. Currently meend(glide) and andolan, which were covered in the java version, are not yet available.
The "Play" link will invoke the player. Here is a link to a bandish :
http://milunsagle.in/webroot/bandishes/view/28

Monday, July 4, 2016

Converting a pdf to csv using linux shell script

linux script to extract data from pdf and create a csv. The regular expressions for sed are rather different from the Perl like ones i am used to in java. So \d is not allowed, + needs to be escaped, etc.

Below, we iterate thru pdfs, use pdftk to get the uncompressed version that has text, use strings to extract string data, use tr to remove newlines, apply sed on it to extract particular fields that we want, assign those to variables, and echo the variables to a csv file.

rm pdf.csv
for FILE in *.pdf
do
  echo $FILE
  pdftk "$FILE" output - uncompress | strings | grep ")Tj" | tr '\n' ' ' | sed -e 's/)Tj /) /g'  > temptocsv.txt 
  AMOUNT=`sed -e 's/.*(Rs \:) \([0-9]\+\).*/\1/' temptocsv.txt`
  CHLDATE=`sed -e 's/.*(Date of) (challan :) (\([^)]\+\)).*/\1/' temptocsv.txt`
  SBIREFNO=`sed -e 's/.*(SBI Ref No. : ) (\([^)]\+\)).*/\1/' temptocsv.txt`
  CHLNO=`sed -e 's/.*(Challan) (No) (CIN) \(.*\) (Date of).*/\1/' temptocsv.txt`
  echo $FILE,$CHLDATE,$SBIREFNO,$CHLNO,$AMOUNT >> pdf.csv
done