I have coded one of my first bash scripts. My goal is to make my office "paper-free". I have a lot of scanned documents, that I want to save with the date (usually found at the top of each document) as filename-prefix. This is what the script should do:
- do ocr on the pdf
- find a date within the first 100 lines. The date is in German format, which is one of the following pattern (this is also the priority order): a) 01.02.2020 b) 01. Februar 2020 c) 01. Feb. 2020
- If a date is found convert it to a string in the format: 2020-02-01- and rename the original filename to the generated date-pattern 2020-02-01-file-##.pdf (otherwise keep the original filename)
This is my bash script so far. It works, but not perfectly as intended. My tests (so far) showed that it doesn't find dates in the listed format b) or c).
#!/bin/bash
shopt -s extglob
$datum
$twistdatum
$counter
FILES="$(find -name "*.pdf")"
for f in $FILES
do ocrmypdf $f $f -l deu --rotate-pages --clean --rotate-pages-threshold 5 less $f | head -100 > "tmp.txt" # read the first 100 lines and safe it to a temporary text file libreoffice --convert-to "pdf" "tmp.txt" # convert the temporary text file to pdf so that it can be processed with pdfgrep # pdfgrep to get the 3 listed types of dates by using 3 regular expressions datum="$(pdfgrep -o -m 1 --regexp="((0[1-9]|[12][0-9]|3[01])\.(0[1-9]|1[0-2])\.([2][0-9]{3}))|((0[1-9]|[12][0-9]|3[01])\. (Januar|Februar|März|April|Mai|Juni|Juli|August|Oktober|November|Dezember) ([2][0-9]{3}))|((0[1-9]|[12][0-9]|3[01])\. (Jan|Feb|Mär|Apr|Mai|Jun|Jul|Aug|Okt|Nov|Dez)\. ([2][0-9]{3}))" tmp.pdf)" case "$datum" in # the three cases a) b) and c) for the different conversions are listed here +([0][1-9]|[12][0-9]|[3][01]).+([0][1-9]|[1][0-2]).[2][0][0-4][0-9]) # this is case a); it works twistdatum="${datum:${#datum}-4:4}-${datum:${#datum}-7:2}-${datum:0:2}-filename.pdf" mv $f $twistdatum;; +([0][1-9]|[12][0-9]|[3][01])@(.)@( )+(Januar|Februar|M\u00e4rz|April|Mai|Juni|Juli|August|Oktober|November|Dezember)@( )[2][0][0-4][0-9]) # this is case b) which doesn't work firstspace="$(expr index "$datum" " ")" case "$datum" in # this is for the conversion of the German words to English Januar) datum="${datum/"Januar"/"January"}";; Februar) datum="${datum/"Februar"/"February"}";; # the other translations of the German months would be listed here esac langdatum="${datum:0:2} ${datum:$firstspace:3} ${datum:${#datum}-4:4}" twistdatum="$(date -d "$langdatum" +"%F")-filename.pdf" mv $f $twistdatum;; +([0][1-9]|[12][0-9]|[3][01])@(.)@( )+(Jan|Feb|M\u00e4r|Apr|Mai|Jun|Jul|Aug|Okt|Nov|Dez)@(.)@( )[2][0][0-4][0-9]) # this is case c) which doesn't work firstspace="$(expr index "$datum" " ")" case "$datum" in # this is for the conversion of the abbreviations of the German words to English Mär) datum="${datum/"Mär"/"Mar"}";; Mai) datum="${datum/"Mai"/"May"}";; # the other translations of the German months would be listed here esac langdatum="${datum:0:2} ${datum:$firstspace:3} ${datum:${#datum}-4:4}" twistdatum="$(date -d "$langdatum" +"%F")-filename.pdf" mv $f $twistdatum;; esac
doneI think the reason could be, that my pattern matching withing the case-blocks is not quite right. I have to admit, that I didn't fully understand pattern matching in bash. Regular expressions are more intuitive to me. :P Any help an code-optimization is very appreciated.
Thank you guys!
72 Answers
Writing a script with just the regexp and case patterns:
shopt -s extglob
f(){ echo "$1" | egrep -o -m 1 "((0[1-9]|[12][0-9]|3[01])\.(0[1-9]|1[0-2])\.([2][0-9]{3}))|((0[1-9]|[12][0-9]|3[01])\. (Januar|Februar|März|April|Mai|Juni|Juli|August|Oktober|November|Dezember) ([2][0-9]{3}))|((0[1-9]|[12][0-9]|3[01])\. (Jan|Feb|Mär|Apr|Mai|Jun|Jul|Aug|Okt|Nov|Dez)\. ([2][0-9]{3}))" case "$1" in +([0][1-9]|[12][0-9]|[3][01]).+([0][1-9]|[1][0-2]).[2][0][0-4][0-9]) echo a;; +([0][1-9]|[12][0-9]|[3][01])@(.)@( )+(Januar|Februar|M\u00e4rz|April|Mai|Juni|Juli|August|Oktober|November|Dezember)@( )[2][0][0-4][0-9]) echo b;; +([0][1-9]|[12][0-9]|[3][01])@(.)@( )+(Jan|Feb|M\u00e4r|Apr|Mai|Jun|Jul|Aug|Okt|Nov|Dez)@(.)@( )[2][0][0-4][0-9]) echo c;; *) echo fail;; esac
}shows that they work correctly in the following tests that match both the egrep regexp, and case a, b or c:
f 01.02.2020
f '01. Februar 2020'
f '01. Feb. 2020'However these do not match:
f '01. März 2020'
f '01. Mär. 2020'You should find the unicode sequence will work if enclosed in $'...', eg M$'\u00e4'rz.
Of course, since you have already matched with a regexp, you know the datum can only have one of 3 forms, so you are just duplicating the effort by providing such detailed case patterns. You might as well just use:
case "$1" in
*.??.*) echo A ;;
*.*.*) echo C ;;
*) echo B ;;
esac @meuh: Thank you so much for your detailed answer. It helped a lot. And the simplicity of your case-statement is just brilliant. I just realized that my sub-case-statements were incorrect.
case "$datum"
Januar) datum="${datum/"Januar"/"January"}";;should of course be:
case "$datum"
*Januar*) datum="${datum/"Januar"/"January"}";;The star-signs were just missing. Anyway I couldn't find this mistake without your help. The code now just works perfectly. Thanks. This "case" is closed now. :)