A files intersection benchmark
Based on this StackExchange question. I did a benchmark on intersection methods.
There are two files that contain uniquely sorted lines. Our task is to find the number of lines that are exactly the same in both these files.
Benchmark environment
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 1
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Stepping: 1
CPU MHz: 3797.061
BogoMIPS: 6798.97
Virtualization: AMD-V
L1d cache: 512 KiB
L1i cache: 1 MiB
L2 cache: 8 MiB
L3 cache: 32 MiB
NUMA node0 CPU(s): 0-31
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full AMD retpoline, STIBP disabled, RSB filling
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonst
op_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy ab
m sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev vmmcall fsgsbase bmi1 avx2 smep bmi2 rds
eed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter
pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
Files statistics:
$ wc a.txt b.txt
10000000 10000403 213482749 a.txt
1000000 999998 20353901 b.txt
11000000 11000401 233836650 total
Note that both 2 files are uniquely sorted beforehand with LC_ALL=C sort -u
.
Benchmark result
$ time (LC_ALL=C sort -u a.txt b.txt | wc -l);
10987323
( LC_ALL=C sort -u a.txt b.txt | wc -l; ) 2.05s user 0.84s system 262% cpu 1.105 total
time (LC_ALL=C sort -u b.txt a.txt | wc -l);
10987323
( LC_ALL=C sort -u b.txt a.txt | wc -l; ) 2.22s user 0.76s system 253% cpu 1.177 total
time (LC_ALL=C awk 'BEGIN{while( (getline k < "a.txt")>0 ){a[k]}} $0 in a' b.txt | wc -l );
12677
( LC_ALL=C awk 'BEGIN{while( (getline k < "a.txt")>0 ){a[k]}} $0 in a' b.txtw) 8.47s user 0.58s system 100% cpu 9.041 total
time (LC_ALL=C awk 'BEGIN{while( (getline k < "b.txt")>0 ){a[k]}} $0 in a' a.txt | wc -l );
12677
( LC_ALL=C awk 'BEGIN{while( (getline k < "b.txt")>0 ){a[k]}} $0 in a' a.txtw) 5.06s user 0.12s system 100% cpu 5.184 total
time (LC_ALL=C perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <a.txt b.txt | wc -l );
12677
( LC_ALL=C perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' b.txt <) 5.99s user 0.48s system 100% cpu 6.474 total
time (LC_ALL=C perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <b.txt a.txt | wc -l );
12677
( LC_ALL=C perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' a.txt <) 3.79s user 0.10s system 100% cpu 3.892 total
time (LC_ALL=C comm -12 a.txt b.txt --nocheck-order | wc -l);
12677
( LC_ALL=C comm -12 a.txt b.txt --nocheck-order | wc -l; ) 0.82s user 0.04s system 100% cpu 0.854 total
time (LC_ALL=C comm -12 b.txt a.txt --nocheck-order | wc -l);
12677
( LC_ALL=C comm -12 b.txt a.txt --nocheck-order | wc -l; ) 0.78s user 0.07s system 101% cpu 0.835 total
time (LC_ALL=C awk 'NR==FNR { lines[$0]=1; next } $0 in lines' a.txt b.txt | wc -l );
12677
( LC_ALL=C awk 'NR==FNR { lines[$0]=1; next } $0 in lines' a.txt b.txt | wc -) 8.75s user 0.58s system 100% cpu 9.326 total
time (LC_ALL=C awk 'NR==FNR { lines[$0]=1; next } $0 in lines' b.txt a.txt | wc -l );
12677
( LC_ALL=C awk 'NR==FNR { lines[$0]=1; next } $0 in lines' b.txt a.txt | wc -) 5.99s user 0.15s system 100% cpu 6.130 total
time (LC_ALL=C grep -Fxf a.txt b.txt | wc -l )
: forevertime (LC_ALL=C grep -Fxf b.txt a.txt | wc -l )
: never stoptime (LC_ALL=C join --nocheck-order a.txt b.txt | wc -l)
: do not usejoin
withsort
. They are not compatible.time (LC_ALL=C join --nocheck-order b.txt a.txt | wc -l)
: same as above
List of methods and running time in the following table:
method | a.txt → b.txt | b.txt → a.txt |
---|---|---|
sort | 2.05 | 2.22 |
awk getline | 8.47s | 5.06 |
perl | 5.99 | 3.79 |
grep | forever | forever |
comm | 0.82 | 0.78 |
awk(array) | 8.75 | 5.99 |
join | n/a | n/a |
Conclusion:
comm
leads the table as the fastest method followed bysort
.perl
also has reasonable performance. Note:sort
performance highly depends on your locale (the LC_* environment). In my previous experiment, withLANG=en_US.UTF-8
,sort
is slower thanperl
andawk
.- Upon method being used, if two files are different in size, the file order is a matter regarding performance.
grep
is incredibly slow and should not be used.join
should not be used due to localization-related bugs.