I'm currently refactoring a script that works well if executed in terminal directly but exits early due to a process check if executed from crontab. This early termination is caused by code using a ps command piped to several grep/grep -v commands. The intent is to check if a process is already running and, if so, not to execute this script again. I know the reason this code isn't working is because it's trying to catch all processes but doesn't grep -v out the /bin/sh -c <script name> process that crontab always uses to call scripts initially. In refactoring this code it just made sense to use something like pgrep over a ps piped to several greps.
Here's where my question comes in. My pgrep code works, I just don't fully understand why it works. When comparing the output of pgrep pgrep_test.sh | grep -v $$ to ps -ef | grep pgrep_test.sh there are additional processes that the pgrep command seems to remove. It seems to me that pgrep is grouping several PIDs together like, it understands and follows the PID/PPID relationship. The problem is I don't see anything about that written in the pgrep manpage.
I think to understand why pgrep is working in my code I need to better understand how pgrep groups PIDs/PPIDs. Here's the code I'm using to test this, it's executing via crontab:
* * * * * user1 /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log
The code itself:
#!/bin/bash
# Test how grepping for PIDs works when script is called from crontab
echo "+++++$(date +"%b %H:%M:%S") Beginning pgrep script+++++"
pgrep pgrep_test.sh | grep -v "$$" > /dev/null 2>&1
# RC=1 -- No additional processes running
# RC=0 -- Additional processes running
RETCODE=$?
echo "Return code: $RETCODE"
if [ $RETCODE -eq 0 ]; then echo "Additional test processes exist, exiting script" echo "$(ps -ef | grep pgrep_test.sh)" ; sleep 1 echo "$(pgrep -a pgrep_test.sh | grep -v \"$$\")" exit 1
else echo "No additional processes found, continuing execution" echo "$(ps -ef | grep pgrep_test.sh)" ; sleep 1 echo "$(pgrep -a pgrep_test.sh | grep -v \"$$\")" sleep 90
fiI've used a 90 second sleep in the code to ensure that a cronjob running every minute will fail out every other time. Here's what the logfile looks like with some additional annotations in form of comments.
First with no additional processes running:
"+++++Nov 17:04:01 Beginning pgrep script+++++"
# Sleeping every 90s means we should have alternating "no additional
# processes found" and "additional processes found" logs each execution
Return code: 1
No additional processes found, continuing execution
# ps -ef | grep pgrep_test.sh
# Initial /bin/sh -c call crontab executes
user1 12956 12954 0 17:04 ? 00:00:00 /bin/sh -c /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log
# Child process spawned from 12956 (Shouldn't this be PID for $$?)
user1 12957 12956 0 17:04 ? 00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# What even is PID 12961? PPID 12957 is the /bin/bash call but these two commands are identical otherwise
# Can't be the pgrep or sleep as these have not executed yet
# Technically there's more than 1 process here now so pgrep should be giving a return code of 0 and exiting the script
# Why does it work correctly here? How does pgrep know to group these?
user1 12961 12957 0 17:04 ? 00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
user1 12963 12961 0 17:04 ? 00:00:00 grep pgrep_test.sh
# pgrep -a pgrep_test.sh | grep -v $$
12957 /bin/bash /tmp/inferencing/pgrep_test.sh
12965 /bin/bash /tmp/inferencing/pgrep_test.shNext with a matching process already running:
"+++++Nov 17:05:01 Beginning pgrep script+++++"
# Since other process is still sleeping, we correctly get a return code of 0 and stop script execution
Return code: 0
Additional test processes exist, exiting script
# crontab process for (now sleeping) original script call
user1 12956 12954 0 17:04 ? 00:00:00 /bin/sh -c /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log
# Sleeping process
user1 12957 12956 0 17:04 ? 00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# New crontab process
user1 13733 13594 0 17:05 ? 00:00:00 /bin/sh -c /tmp/inferencing/pgrep_test.sh 2>&1 >> /tmp/inferencing/test.log
# New main bash process
user1 13734 13733 0 17:05 ? 00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# Second main bash process again -- this happens every time
user1 13738 13734 0 17:05 ? 00:00:00 /bin/bash /tmp/inferencing/pgrep_test.sh
# No grep -v grep, this doesn't show up to pgrep anyways
user1 13740 13738 0 17:05 ? 00:00:00 grep pgrep_test.sh
# pgrep -a pgrep_test.sh | grep -v $$
12957 /bin/bash /tmp/inferencing/pgrep_test.sh
13734 /bin/bash /tmp/inferencing/pgrep_test.sh
14105 /bin/bash /tmp/inferencing/pgrep_test.shWhy is it that when I execute the pgrep command, it seems to know how to filter out the additional child processes associated with $$ when ps -ef piped to greps is not capable?
1 Answer
This early termination is caused by code using a ps command piped to several grep/grep -v commands. The intent is to check if a process is already running and, if so, not to execute this script again.
Don't do that. It's a horrible method, more so if you don't even bother to check whether your 'grep' matches the input exactly and not just a substring. For example, if you leave vim pgrep_test.sh open, your script will think it's already running.
There are better ways to make a single-instance script:
run the script as a systemd .service (either by making your cronjob call 'systemctl start' or by using a systemd .timer to invoke it), as the same service cannot be started twice;
[Service] Type=oneshot User=user1 ExecStart=/tmp/inferencing/pgrep_test.shor use a lock file through
flock, which uses kernel-based exclusive locking to guarantee a single instance.* * * * * user1 flock -n /tmp/inferencing/lock /tmp/inferencing/pgrep_test.sh
What even is PID 12961? PPID 12957 is the /bin/bash call but these two commands are identical otherwise
It's the "subshell" that handles the command within $( ... ). Every time you use command substitution, bash spawns a child process to handle it. If a simple command is being substituted, that subshell process may directly 'exec' the command in-place (e.g. in the case of $(ps -ef)), but if a whole pipeline is being substituted, that won't necessarily happen.
While $$ always expands to the PID of the main shell process (i.e. its value is cloned when bash spawns subshells), you can use $BASHPID to get the real process ID of the current interpreter. For example:
$ echo $$, $BASHPID; ps $$ $BASHPID
208231, 208231 PID TTY STAT TIME COMMAND 208231 pts/3 Ss 0:00 bash
$ (echo $$, $BASHPID; ps $$ $BASHPID)
208231, 208287 PID TTY STAT TIME COMMAND 208231 pts/3 Ss 0:00 bash 208287 pts/3 S+ 0:00 bash
$ { echo $$, $BASHPID; ps $$ $BASHPID; }
208231, 208231 PID TTY STAT TIME COMMAND 208231 pts/3 Ss 0:00 bash
$ var=$(echo $$, $BASHPID; ps $$ $BASHPID); echo "$var"
208231, 208294 PID TTY STAT TIME COMMAND 208231 pts/3 Ss+ 0:00 bash 208294 pts/3 R+ 0:00 ps 208231 208294The 2nd and 4th examples use subshells (another easy way to detect this is to notice that variables set within a subshell do not get propagated back into the main shell), while the 1st and 3rd don't.