These classes were originally a part of my "The Perfect Editor" project which never got anywhere (all good programmers have tried implementing an editor and failed :-) But the basic ideas for the text buffer implementation is still right and may be useful to you. Please note that text viewers have very different requirements.
It started out when I had a whopping 8MB RAM and running OS/2 and I discovered that most editors did not work that well when the file size was more the amount of physical memory. The main problem seemed to be that they frequently touched the all memory pages they had allocated casuing major swap-in/out. They also had too much memory overhead. I began wondering how I could implement an editor that efficiently dealt with huge files.
char **line; int lines;
Very simple. Very clean. Unfortunately this incurs too much memory overhead (usually 8 bytes memory block overhead per line).
struct Line { Line *prev,*next; char *text; };
This would make insertion much faster but going to a specific line means you have to traverse all the nodes.
struct Line { Line *left; Line *right; char *text; //and some stuff for keeping track of number of lines left and right };
Not bad. But the memory overhead would be too high. And the non-big-O complexity of dealing with the structure is too high.
I also considered variations of a B-tree but did not find an easy way of numbering lines. If the key in the B-tree is the line number then it would be very efficient to find a line, but inserting a line renumbers all the lines after it and you have to modify the B-tree.
The implementation that I designed is as follows: The file is split into blocks of less than 4KB. The lines inside each block is keept as NUL-terminated strings. The management stuff for the block is not kept in the same memory allocation to avoid swapping, but instead in a seperate array. Implementation overview:
struct Block { int bytes_used; int lines; char *chunk; //always 4KB }; struct Buffer { Block *block_array; int blocks; };
This means that as many lines that fit into a 4KB chunk is stored in a single block. When you have to go to a specific line you traverse the block_array to find the correct block and then go into that block and traverse the chunk to find the line (counting ascii NULs). When inserting a line you traverse the block_array to find the block. If the block has spare room you insert the line. If the block does not have spare room for the line you split the block into two blocks and the insert the line.
These operations are theoretically awful O(N) operations, but quite efficient because:
The memory overhead is also quite low. Assuming you load a 1MB file where the average line length is 80 bytes (including newline):
Each block will on the average contain: 4KB - 80/2 = 4096 - 40 = 4056 bytes of useful data. The block_array will be 1MB / 4056 = 258 entries The total memory allocation will be: 258 * ((4+4+4) + 4096) = 1059864 The overhead will be: (1059864-1024*1024) / (1024*1024) * 100 = 1%
Not bad.
I have tuned the implementation a bit. Some things that I have tried:
This implemenetation offers the underlying datastruct described above, wrapped in a class Buffer which offers interfaces for inserting and removing text, getting direct pointers to text lines, etc. It has built-in support for the BufferMark class which is kind of anchors in the text. The Buffer automatically updates the associated buffer marks when the text is modified. This makes it easy to keep track of where an anchor is (for cursors, top-of-screen, start-of-marked-block, etc)
textbuffer-1.0.tar.gz (9KB)