Text buffer implementation

Background

These classes were originally a part of my "The Perfect Editor" project which never got anywhere (all good programmers have tried implementing an editor and failed :-) But the basic ideas for the text buffer implementation is still right and may be useful to you. Please note that text viewers have very different requirements.

It started out when I had a whopping 8MB RAM and running OS/2 and I discovered that most editors did not work that well when the file size was more the amount of physical memory. The main problem seemed to be that they frequently touched the all memory pages they had allocated casuing major swap-in/out. They also had too much memory overhead. I began wondering how I could implement an editor that efficiently dealt with huge files.

Possibility 1: a simple array of lines

  char **line;
  int lines;

Very simple. Very clean. Unfortunately this incurs too much memory overhead (usually 8 bytes memory block overhead per line).

Possibility 2: linked list

  struct Line {
    Line *prev,*next;
    char *text;
  };

This would make insertion much faster but going to a specific line means you have to traverse all the nodes.

Possibility 3: tree

  struct Line {
    Line *left;
    Line *right;
    char *text;
    //and some stuff for keeping track of number of lines left and right
  };

Not bad. But the memory overhead would be too high. And the non-big-O complexity of dealing with the structure is too high.

Possibility 4: B-tree

I also considered variations of a B-tree but did not find an easy way of numbering lines. If the key in the B-tree is the line number then it would be very efficient to find a line, but inserting a line renumbers all the lines after it and you have to modify the B-tree.

The final implementation

The implementation that I designed is as follows: The file is split into blocks of less than 4KB. The lines inside each block is keept as NUL-terminated strings. The management stuff for the block is not kept in the same memory allocation to avoid swapping, but instead in a seperate array. Implementation overview:

  struct Block {
    int bytes_used;
    int lines;
    char *chunk;  //always 4KB
  };
  struct Buffer {
    Block *block_array;
    int blocks;
  };

This means that as many lines that fit into a 4KB chunk is stored in a single block. When you have to go to a specific line you traverse the block_array to find the correct block and then go into that block and traverse the chunk to find the line (counting ascii NULs). When inserting a line you traverse the block_array to find the block. If the block has spare room you insert the line. If the block does not have spare room for the line you split the block into two blocks and the insert the line.

These operations are theoretically awful O(N) operations, but quite efficient because:

Traversing the block_array is cheap. It is a linear scan in a tight loop which processers are very good at.
Traversing inside a block is also quite cheap (linear scan, tight loop, optimized instructions for doing that)
The block_array usually fits inside a single memory page and the TLB is not thrashed. And the block_array is kept "hot" and can probably fit in the L1 cache.

The memory overhead is also quite low. Assuming you load a 1MB file where the average line length is 80 bytes (including newline):

Each block will on the average contain:
  4KB - 80/2  =  4096 - 40  =  4056 bytes of useful data.
The block_array will be
  1MB / 4056  = 258 entries
The total memory allocation will be:
  258 * ((4+4+4) + 4096)  = 1059864
The overhead will be:
  (1059864-1024*1024) / (1024*1024) * 100 = 1%

Not bad.

Tuning

I have tuned the implementation a bit. Some things that I have tried:

Changing the types used for keeping track of line numbers, byte numbers, etc.: There were no gain and it is best to stick with size_t. Yes, you could reduce the overhead of the block_array by using 16-bit intergers for per-block line coutners and byte counters but 16-bit entities are the most expensive data type on modern CPUs.
Changing the block size.: The ideal block size is about 3.7KB. 4KB is also fine. Small block sizes cause major overhead traversing the block_array and block sizes more than 6KB cause major overhead scanning inside the blocks.
Hand-tuning the block-scanning loops instead of using memchr.: memchr() is the fastest. Period.

What this implementation offers

This implemenetation offers the underlying datastruct described above, wrapped in a class Buffer which offers interfaces for inserting and removing text, getting direct pointers to text lines, etc. It has built-in support for the BufferMark class which is kind of anchors in the text. The Buffer automatically updates the associated buffer marks when the text is modified. This makes it easy to keep track of where an anchor is (for cursors, top-of-screen, start-of-marked-block, etc)

Download

textbuffer-1.0.tar.gz (9KB)