Remove lines that are in another file
I had an issue this week where I needed to remove lines from one file if they existed in another file. Looking back it was frustrating as such a task should be simple.
I tried all sorts of things. Differencing the two files and using grep to grab the lines I wanted. Whatever I tried just did not produce the expected results. Thanks to a buddy I found the solution which ended up being to sort the two files before using diff.
Example:
Assuming two files exist, File_1 and File_2. File_1 containing lines with a, b, c and. File_2 containing b and d. If we want to remove b and d from File_1 because they exist in File_2 you could use something like the this:
1 2 3 4 5 6 7 8 9 10 11 12 | owen@linuxblog:~$ cat File_1.txt a b c d owen@linuxblog:~$ cat File_2.txt b d owen@linuxblog:~$ diff File_1.txt File_2.txt | grep \< | cut -d \ -f 2 a c |
That’s all fine and dandy until File_2.txt contains the same lines in a different order. Running the same command produces different results. See Below:
1 2 3 4 5 6 7 8 | owen@linuxblog:~$ cat File_2.txt d b owen@linuxblog:~$ diff File_1.txt File_2.txt | grep \< | cut -d \ -f 2 a b c |
The solution as noted above is to use sort before hand and then difference them:
1 2 3 4 | owen@linuxblog:~$ sort File_1.txt >> File_1-sorted; sort File_2.txt >> File_2-sorted; owen@linuxblog:~$ diff File_1-sorted File_2-sorted | grep \< | cut -d \ -f 2 a c |
Obviously the example has been simplified, when dealing with thousands of lines the sort could take a while. With that said I’m sure there are more efficient ways to achieve the same results. I wouldn’t doubt there being a command better suited to do this. Have at it in the comments.