A performance benchmark, count number of occurrences of a character
I got a situation of finding the maximal number of occurrences of characters in a line of a file.
Based on this SO answer. I summarized all methods:
- variable expansion:
cnt="${line//[^:]}"; echo "${#cnt}"
- tr:
cnt="$(tr -dc : <<<"${line}")"; echo "${#cnt}"
- awk:
awk -F: '{print NF-1}' <<< "${line}"
- grep:
cnt="$(grep -o : <<< ":${line}" | grep -c .)"; echo $(($cnt - 0))
- per:
perl -nle 'print s/://g' <<<"${line}"
I made the following script to measure the benchmark of all these methods:
Important note: (thanks to @jm666) my benchmark result is only useful if the processing time is relatively small (compared to program startup time).
Although reading time (even though it is a line by line reading, which includes high overhead due to file open/close on every read) is the same among all methods, program startup time is different from method to method. That explains the difference between my final result to the result from the SO answer.
TLDR: if you are processing a big file at once, you should not trust the benchmark result below.
#!/usr/bin/env bash
set -e
# try many ways to count number of a character then measure performance
# https://stackoverflow.com/questions/16679369/count-occurrences-of-a-char-in-a-string-using-bash
DIR="$(cd "$(dirname BAS_SOURCES[0])" >/dev/null 2>&1 && pwd)"
input_path="$1"
if [[ -z ${input_path} ]]; then
echo "input path is required"
exit 1
fi
# read only
start="$(date +%s%N)"
echo "read only"
while read -r line; do
true
done <"${input_path}"
echo "$(( ( $(date +%s%N) - $start) / 1000000 )) msec"
# variable expansion
start="$(date +%s%N)"
max_colons=0
while read -r line; do
cnt="${line//[^:]}"
if [[ ${#cnt} -gt ${max_colons} ]]; then
max_colons="${#cnt}"
fi
done <"${input_path}"
echo "variable expansion: ${max_colons}"
echo "$(( ( $(date +%s%N) - $start) / 1000000 )) msec"
# tr
start="$(date +%s%N)"
max_colons=0
while read -r line; do
cnt="$(tr -dc : <<<"${line}")"
if [[ ${#cnt} -gt ${max_colons} ]]; then
max_colons="${#cnt}"
fi
done <"${input_path}"
echo "tr: ${max_colons}"
echo "$(( ( $(date +%s%N) - $start) / 1000000 )) msec"
# awk
start="$(date +%s%N)"
max_colons=0
while read -r line; do
cnt="$(awk -F: '{print NF-1}' <<< "${line}")"
if [[ ${cnt} -gt ${max_colons} ]]; then
max_colons="${cnt}"
fi
done <"${input_path}"
echo "awk: ${max_colons}"
echo "$(( ( $(date +%s%N) - $start) / 1000000 )) msec"
# grep
start="$(date +%s%N)"
max_colons=0
while read -r line; do
cnt="$(grep -o : <<< ":${line}" | grep -c .)"
if [[ ${cnt} -gt ${max_colons} ]]; then
max_colons="${cnt}"
fi
done <"${input_path}"
echo "grep: $((${max_colons} - 1))"
echo "$(( ( $(date +%s%N) - $start) / 1000000 )) msec"
# perl
start="$(date +%s%N)"
max_colons=0
while read -r line; do
cnt="$(perl -nle 'print s/://g' <<<"${line}")"
if [[ ${cnt} -gt ${max_colons} ]]; then
max_colons="${cnt}"
fi
done <"${input_path}"
echo "perl: ${max_colons}"
echo "$(( ( $(date +%s%N) - $start) / 1000000 )) msec"
Input file size
$ ls -alh colons.txt
-rw-rw-r-- 1 transang transang 787K Nov 12 15:06 colons.txt
Running result
$ ./max-colon-perf.sh colons.txt
read only
50 msec
variable expansion: 4
294 msec
tr: 4
10256 msec
awk: 4
16905 msec
grep: 4
14062 msec
perl: 4
17365 msec
The methods list after sorted from fastest to slowest: variable expansion > tr > grep > awk > perl