linux script to extract data from pdf and create a csv. The regular expressions for sed are rather different from the Perl like ones i am used to in java. So \d is not allowed, + needs to be escaped, etc.
Below, we iterate thru pdfs, use pdftk to get the uncompressed version that has text, use strings to extract string data, use tr to remove newlines, apply sed on it to extract particular fields that we want, assign those to variables, and echo the variables to a csv file.
Below, we iterate thru pdfs, use pdftk to get the uncompressed version that has text, use strings to extract string data, use tr to remove newlines, apply sed on it to extract particular fields that we want, assign those to variables, and echo the variables to a csv file.
rm pdf.csv
for FILE in *.pdf
do
echo $FILE
pdftk "$FILE" output - uncompress | strings | grep ")Tj" | tr '\n' ' ' | sed -e 's/)Tj /) /g' > temptocsv.txt
AMOUNT=`sed -e 's/.*(Rs \:) \([0-9]\+\).*/\1/' temptocsv.txt`
CHLDATE=`sed -e 's/.*(Date of) (challan :) (\([^)]\+\)).*/\1/' temptocsv.txt`
SBIREFNO=`sed -e 's/.*(SBI Ref No. : ) (\([^)]\+\)).*/\1/' temptocsv.txt`
CHLNO=`sed -e 's/.*(Challan) (No) (CIN) \(.*\) (Date of).*/\1/' temptocsv.txt`
echo $FILE,$CHLDATE,$SBIREFNO,$CHLNO,$AMOUNT >> pdf.csv
done
No comments:
Post a Comment