About MagmaCollections
Last updated at 9:22 pm UTC on 14 November 2006
Overview
Some programs must provide fast access to very large collections of objects without consuming a lot of memory. Magma can maintain and quickly "search" large, flat structures, but the normal Smalltalk collections such as Bag or OrderedCollection are not suitable for this. The contiguous ByteArray records Magma uses to store and transport Smalltalk objects would be impractical for a large Smalltalk Collection, not to mention its higher potential for concurrency.
Introducing MagmaCollection
Magma provides a new class for this large, flat kind of structure, called MagmaCollection and offers the following features:
- Can theoretically hold trillions of objects, but is currently limited by the worst of the following:
- the available storage on the device which holds the Magma files. Magma does not currently support distributed storage.
- Squeaks maximum file-size addressability
- Provides size and absolute position access (at: anInteger) making it suitable for scrolling lists.
- Supports multiple indexes for quickly locating any object in the collection.
- Supports "between-key" positioning, finding the next higher key when an exact key is not known.
- Supports key-order enumeration from any point.
- Allows multiple, simultaneous adds and removes without concurrency.
- Includes common index types and its easy to define new index types based on your program needs.
- Supports batch operations via slowlyDo: [ ... ].
A likeness to..?
MagmaCollections behave like a Bag in that they can hold multiple instances of the same object, and can very quickly answer occurrencesOf: anObject. After adding at least one index (via addIndex:), it actually can be queried for matching sub-collections.
Heterogeneality
MagmaCollections are heterogeneous, unlike relational databases they can hold very different kinds of objects in the same collection. The only constraint is objects in the MagmaCollection must respond to all of its index selectors. For example, if you wanted a heterogeneous collection of People and Organizations, adding an index on #name would require each of those classes to be able to respondTo: #name.
A convenience method is provided to allow you to check whether an object you might want to add can be:
myMagmaCollection canAdd: myObject
Creating a MagmaCollection
Creating a MagmaCollection is similar to creating many other kinds of objects:
MagmaCollection new
Despite its "size" and special nature, you treat it like any other object. To make it persistent, you commit it to another persistent object. The special support files required to support the collection will be created automatically on the server.
Persistent nature
MagmaCollections only maintain a "page" of objects at a time in your local image. Offering reduced-concurrency, objects added to a MagmaCollection by other users will be available upon the next page-retrieval, which can occur before many times between transaction boundaries. However, objects in the collection will not change state once you've read them until crossing a transaction boundary.
Adding and removing objects
If you've used the Smalltalk collection classes, then the API will be second-nature. For example, to add an object:
mySession commit: [ myMagmaCollection add: myObject ]
to remove it:
mySession commit: [ myMagmaCollection remove: myObject ]
Indexes
Initially, the collection is not indexed. Without indexes, a MagmaCollection is limited in its ability to access the objects it references. You can test includes: and occurrencesOf: anObject, but to actually get at the elements, you must add an index.
myMagmaCollection addIndex:
(MaAsciiStringIndex
attribute: #bookTitle
keySize: 64)
Magma defines two common index types. MaAsciiStringIndex for proper names, and MaSearchStringIndex for a more forgiving, case-insensitive index.
Depending on the keySize you specify, these indexes are sensitive to the first few characters:
type | bits | number of sensitive characters |
MaAsciiStringIndex | 64 | 9 |
MaAsciiStringIndex | 128 | 18 |
MaSearchStringIndex | 64 | 10 |
MaSearchStringIndex | 128 | 21 |
These index types are suitable for what they were intended for, but other index types will be useful and will need to be defined if your program has special needs. See Defining a new index type for more information.
Accessing elements with MagmaCollectionReader
A MagmaCollectionReader provides a "view" of the objects in MagmaCollection. These are useful for quickly obtaining subsets of the collection based on query critieria.
myReader := aMagmaCollection where:
[ :reader |
reader
read: #lastName
from: 'Jackson'
to: 'Muller' ]
This will answer a MagmaCollectionReader with all objects whose #lastName >= 'Jackson' and <= 'Muller'. It knows the size and can access by absolute integer position.
For more information, see Magma Queries.
Optimizing read performance with pageSize:
Internally, the reader maintains only a "page" of objects from the collection in memory. When your program accesses outside the range of the page, the reader automatically retrieves a new page from the server. To optimize performance, you may customize the number of objects in memory at once with the #pageSize: attribute.
myReader pageSize: 500 "fetch up to 500 objects at a time"
Unlike other Smalltalk objects, changes made to MagmaCollections by other users while you are between transaction boundaries will be visible to your program.
Batch operations
At some point, it may be necessary to enumerate an entire MagmaCollection. Because of their large size, this can take a long time, so enumeration is normally part of a utility script; e.g., you probably wouldn't want to make enumeration a regular part of an end-user program.
Most batch scripts will be concerned with reaching every object in the collection, so Magma marks the collection readOnly, preventing updates during the enumeration.
You also don't start your own transactions, but instead specify how frequently you want to commit.
myMagmaCollection
slowlyDo: [ :each | each doSomething ]
commitEvery: 1000 "commit every 1000 objects"
Support Files
When you create a new MagmaCollection or add a new index, an additional file will be automatically created on the server when you commit. The name of the file for the collection is actually its oid, followed by '.hdx' as the extension. hdx stands for "hash index," the file structure used to support these large collections. When you add an index, an additional file will be created with the name of the selector and the oid of its collection. Magma maintains these files, you don't need to be concerned with them, nor should you try to rename them.
How they work
The key to MagmaCollections and their indexes is a pretty robust file structure (like a Dictionary of Bags), implemented in the class MaHashIndex.
A MaHashIndex provides an interface to a file that:
- represents 0 as the lowest possible key
- represents the highest possible key according to how many bits you define the index to be (i.e., 32, 64, or 128).
- associates every key to an oid (which identifies the object)
- can search for any key or next-higher key at an exponential rate
- provides enumeration from any absolute position, or from any key position.
- handles insertion and deletion of keys and the associated space-organization dynamically
- always maintains key order
- allows fine-tuning record sizes to optimize for the different key-dispersions of various kinds of indexes.
Note: MagmaCollections are most suitable for collections with a wide-dispersion of key values.
There is one MaHashIndex file for the original MagmaCollection, and two for each index. The keys for original collection are the oid of the objects, the values are not currently used. For each index file the key value is generated based on a linear calculation made in the MaIndexDefinition in the client. The associated value is the oid of the object with that key.
Committing a change to the key of an object is not a problem.
Index updating
Use MagmaSession>>#noteOldKeysFor: before change of the indexed attribute.
mySession commit: [
| reader |
reader := myMagmaCollection read: #date from: Date today to: Date today.
reader do: [:event |
mySession noteOldKeysFor: event.
event date: Date tomorrow] ].