grepcut - a combined grep+cut tool for extracting information from huge logs


I was working on a remote system (limited bandwidth). The system generated massive amounts of log and I had to do some log analysis. The log sizes were huge (~30GB/day) so I couldn't transfer them back home. The system was heavily loaded already so I couldn't do the usual "grep..|cut...|awk..." trick to reduce the data I needed because that would gobble up 3 CPUs on the machine. A plain awk was also too slow.

The logs contained stuff like this:

20140722113118.691 10744.10762 D   48141 [Processed (success) blarg-i info=blarg.blarg.blarg.blarg.blarg;1382877319;0;Gy;, blurg=123456789, blurg= in 0.056 seconds (0.036 blurg time)]
20140722113251.991 10744.10762 D   48141 [Processed (success) blarg-t info=blarg.blarg.blarg.blarg.blarg;1382877319;0;Gy;, blurg=123456789, blurg= in 0.016 seconds (0.003 blurg time)]

And what I needed was:

20140722113118.691 blarg-u blarg.blarg.blarg.blarg.blarg;1382877319;0;Gy;

The normal way I would have extracted it would have been:

cat <a bunch of logs> | grep -e blarg-i -e blarg-u -e blarg-t |cut -c1-18,53- |awk '{print $1 " " $3 " " $4}'

but that simply used too much CPU

So I ended up writing a combined grep+cut tool. It also has a tiny bit of awk-flavor in it because plain 'cut' doesn't work with variably-spaced fields, which my logs had.

Simple examples

grepcutworks like a combination of fgrep/grep/egrep and cut, but with fewer features.

Finding text

Normal way:

    grep needle haystack.txt

Faster way:

    grepcut -e needle haystack.txt

Cutting columns

Normal way:

    cut -c40- <haystack.txt

Faster way:

    grepcut -c40- haystack.txt

Cutting fields

Normal way:

    cut -d :' -f4,7-10 <haystack.txt

Faster way:

    grepcut -d ':' -l4,7-10 haystack.txt

Cutting variable-spaced fields

Normal way:

    awk '{print $3 $5} <haystack.txt

Faster way:

    grepcut -d ' ' -w3,5 haystack.txt

Complex example

Find lines with the string 'needle' in two files (don't print the file names); take column 43..., and then extract the 8th field from that.

Normal way: grep -h needle haystack1.txt haystack2.txt |cut -c43- |awk '{print $8}'

Faster way: grepcut -e needle -c43- -d ' ' -w8 haystack1.txt haystack2.txt

As seen above multiple operations are possible and they stack


The goal was to use less CPU than piped commands. So does it?



I don't know completely why grepcut is so fast. For plain string searching grepcut uses <regex.h>, but why is that twice as fast as Linux's fgrep/grep/egrep? For plain column cutting operations my only guess why grepcut is faster than both linux' and solaris' cut is that it has fewer features and it doesn't have to deal multibyte character sets, but I'm not using them, so uhmm.. why is plain 'cut' so slow? For field cutting (-w) the higher speed than 'awk' is expected (grepcut does not have a built-in language)


Only available as source: grepcut.tar.gz