bash

A files intersection benchmark

Sang

Nov 14, 2019 • 4 min read

Based on this StackExchange question. I did a benchmark on intersection methods.

There are two files that contain uniquely sorted lines. Our task is to find the number of lines that are exactly the same in both these files.

Benchmark environment

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 19.10
Release:	19.10
Codename:	eoan

OS version

$ free
              total        used        free      shared  buff/cache   available
Mem:      131862924     8826584      719240      348492   122317100   121428700
Swap:       2097148     2097148           0

RAM usage

$ lscpu         
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          32
On-line CPU(s) list:             0-31
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           1
Model name:                      AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping:                        1
CPU MHz:                         3797.061
BogoMIPS:                        6798.97
Virtualization:                  AMD-V
L1d cache:                       512 KiB
L1i cache:                       1 MiB
L2 cache:                        8 MiB
L3 cache:                        32 MiB
NUMA node0 CPU(s):               0-31
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, STIBP disabled, RSB filling
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonst
                                 op_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy ab
                                 m sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev vmmcall fsgsbase bmi1 avx2 smep bmi2 rds
                                 eed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter
                                  pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

Files statistics:

$ wc a.txt b.txt
 10000000  10000403 213482749 a.txt
  1000000    999998  20353901 b.txt
 11000000  11000401 233836650 total

Note that both 2 files are uniquely sorted beforehand with LC_ALL=C sort -u.

Benchmark result

$ time (LC_ALL=C sort -u a.txt b.txt | wc -l);
10987323
( LC_ALL=C sort -u a.txt b.txt | wc -l; )  2.05s user 0.84s system 262% cpu 1.105 total
time (LC_ALL=C sort -u b.txt a.txt | wc -l);
10987323
( LC_ALL=C sort -u b.txt a.txt | wc -l; )  2.22s user 0.76s system 253% cpu 1.177 total
time (LC_ALL=C awk 'BEGIN{while( (getline k < "a.txt")>0 ){a[k]}} $0 in a' b.txt | wc -l );
12677
( LC_ALL=C awk 'BEGIN{while( (getline k < "a.txt")>0 ){a[k]}} $0 in a' b.txtw)  8.47s user 0.58s system 100% cpu 9.041 total
time (LC_ALL=C awk 'BEGIN{while( (getline k < "b.txt")>0 ){a[k]}} $0 in a' a.txt | wc -l );
12677
( LC_ALL=C awk 'BEGIN{while( (getline k < "b.txt")>0 ){a[k]}} $0 in a' a.txtw)  5.06s user 0.12s system 100% cpu 5.184 total
time (LC_ALL=C perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <a.txt b.txt | wc -l );
12677
( LC_ALL=C perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' b.txt <)  5.99s user 0.48s system 100% cpu 6.474 total
time (LC_ALL=C perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <b.txt a.txt | wc -l );
12677
( LC_ALL=C perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' a.txt <)  3.79s user 0.10s system 100% cpu 3.892 total
time (LC_ALL=C comm -12 a.txt b.txt --nocheck-order | wc -l);
12677
( LC_ALL=C comm -12 a.txt b.txt --nocheck-order | wc -l; )  0.82s user 0.04s system 100% cpu 0.854 total
time (LC_ALL=C comm -12 b.txt a.txt --nocheck-order | wc -l);
12677
( LC_ALL=C comm -12 b.txt a.txt --nocheck-order | wc -l; )  0.78s user 0.07s system 101% cpu 0.835 total
time (LC_ALL=C awk 'NR==FNR { lines[$0]=1; next } $0 in lines' a.txt b.txt | wc -l );
12677
( LC_ALL=C awk 'NR==FNR { lines[$0]=1; next } $0 in lines' a.txt b.txt | wc -)  8.75s user 0.58s system 100% cpu 9.326 total
time (LC_ALL=C awk 'NR==FNR { lines[$0]=1; next } $0 in lines' b.txt a.txt | wc -l );
12677
( LC_ALL=C awk 'NR==FNR { lines[$0]=1; next } $0 in lines' b.txt a.txt | wc -)  5.99s user 0.15s system 100% cpu 6.130 total

time (LC_ALL=C grep -Fxf a.txt b.txt | wc -l ): forever
time (LC_ALL=C grep -Fxf b.txt a.txt | wc -l ): never stop
time (LC_ALL=C join --nocheck-order a.txt b.txt | wc -l): do not use join with sort. They are not compatible.
time (LC_ALL=C join --nocheck-order b.txt a.txt | wc -l): same as above

List of methods and running time in the following table:

method	a.txt → b.txt	b.txt → a.txt
sort	2.05	2.22
awk getline	8.47s	5.06
perl	5.99	3.79
grep	forever	forever
comm	0.82	0.78
awk(array)	8.75	5.99
join	n/a	n/a

Conclusion:

comm leads the table as the fastest method followed by sort. perl also has reasonable performance. Note: sort performance highly depends on your locale (the LC_* environment). In my previous experiment, with LANG=en_US.UTF-8, sort is slower than perl and awk.
Upon method being used, if two files are different in size, the file order is a matter regarding performance.
grep is incredibly slow and should not be used. join should not be used due to localization-related bugs.

Sign up for more like this.