Really...
We are a bit scattered at the moment -- please give us a few months.
The
InterMix II
Voice of Humanity
Specifications

InterMix is Community Middleware for the Web, designed to promote human unity by implementing the Voice of Humanity project as detailed at the Voice of Humanity website and blog.


Table of Contents

Phases
InterMix Database Engine


Phases
 

1) build InterMix Database Engine
2) implement email interface
3) implement dialogue on top of email interface
4) implement annotated web
5) implement annotated web as a Chandler parcel
6) reimplement the old InterMix forums


InterMix Database Engine (version 0.0.7)
 

InterMix Engine Design (version 0.0.7)

Version 0.0.7 reduces phrases to two words and introduces the idea of “range by halves”, which expands on the earlier idea of tagging items as being in the top such and such percentile. Version 0.1.0 will decide between MixDex and the “range by halves” method.  Version 0.0.7 carries both methods as alternatives.  Also the idea of user keywords makes an appearance. 

Version 0.0.6 drops MixDexBase on the grounds that it is needed only to locate MixDex records for a given item, and that can be done more efficiently another way.  Also we need to be clear that we only add MixDex records for a particular SortBy on an item if that item has a value for that SortBy.  This means, for instance, that if there is no interest rating for an item, then there are no Interest SortBy MixDex records for that item.  We also introduce the distinction between "content" and "structural" keywords.

Version 0.0.5 fills in the "perspective" idea, and adds MixDexBase as the secondary index on MixItem on which MixDex is built.  So now we have MixDexBase as a secondary on Item and MixDex as a secondary on MixDexBase!  Maybe.  Please comment whether the new design seems good or problematical. 

----------------

MixKeyword records will have a fourth, fifth and sixth field called KeywordMetaType1, KeywordMetaType2 and KeywordType.  KeywordMetaType1 will be a one byte field  to indicate that the keyword is derived from the content of the item, or is structural to InterMix.  For example, KeywordMetaType1 'i' indicates an 'i'tem content derived keyword.  Structural KeywordMetaType1s will be 'h'ub, ''c'ategory, 'p'articipant, 'l'ink.  No doubt, other structural types will be needed.  See http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm.
----------------

Understanding "perspectives"

The core concept underlying InterMix is that groups of users rate items.  Of course this happens by individual group members rating the items and then averaging or accumulating for the group.  Nevertheless, the idea is that the group ratings for items will be very useful in establishing group priorities and even a group "consciousness" and a group "voice".  See http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm.  

InterMix "perspectives" are the groups whose collective ratings are being tracked by InterMix.  An individual may be a member of more than one perspective.  For instance one may live in Pune and be a citizen of India and be a speaker of English.  If these three perspectives are defined to an InterMix "hub", then when such a person rates an item, the average rating for the item must be recalculated for all three perspectives and if it changes for one of the perspectives, then item indexing must be redone for that perspective. 

------------

InterMix data will be kept in five Berkeley DB tables using the Berkeley DB Transactional Data Store.  See http://pybsddb.sourceforge.net/reftoc.html and
http://pybsddb.sourceforge.net/ref/intro/products.html.  Very possibly, we will end up handling both the item content keywords and the ratings updates outside the transaction system.  In this case, only the structural keywords will be maintained for an item inside the transaction system.  

1) MixItem table - items are messages, users, groups and so forth -- keyed by a 32 bit sequentially assigned numerical ItemID with an additional 8 bits of flags to distinguish up to 256 types of item by key for a 5 byte key and stored as a btree table.  All MixItem records will be wrapped in xml in a format inevitably called the MiXml format.  The item, as presented from external sources will be kept in one MiXml section.  Additional sections of interest at this point are the ItemKey, ItemUuid, ItemStructuralKeywords, ItemContentKeywords.  Structural Keywords  and Content Keywords will be kept  in separate sections as KeywordIDs pointing into the MixKeyword table.  The idea behind this distinction is that when another hub requests an item, the structural keywords will be translated back into their human readable phrase form and sent with the item, while the content keywords will not -- content keywords will be rebuilt on the foreign hub. 

2) MixKeyword btree table - links strings of any length with much shorter KeywordIDs.  The basic idea is to allow long strings to be indexed efficiently in the MixDex table.  For instance, a long http address could be reduced to just 10 bytes, say.  Importantly, the MixKeyword table will also keep a count of how many items in MixItem are keyed by the referenced string.  This count will also be used to increase lookup efficiency.  The KeywordIDs will be sequentially assigned.  The primary key will be the string.  There will also be an automatically maintained secondary index to allow lookup of the string from the KeywordID. 

MixKeyword records will have a fourth, fifth and sixth field called KeywordMetaType1, KeywordMetaType2 and KeywordType.  KeywordMetaType1 will be a one byte field  to indicate that the keyword is derived from the content of the item, or is structural to InterMix.  For example, KeywordMetaType1 'i' indicates an 'i'tem content derived keyword.  Structural KeywordMetaType1s will be 'h'ub, ''c'ategory, 'p'articipant, 'l'ink.  No doubt, other structural types will be needed.  See http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm.

3) MixRef table - relations between pairs of items are kept here, such as ratings (user/item), or parent-child relationship (item/reply) and so forth -- keyed by xref-type/item1/item2 and indexed both ways, so for a particular xref-type, such as "rating", for a particular item1, we easily find all the item2's, and the other way about.  For ratings, we can easily find the ratings for all items by a particular user, and the other way about, for a particular item we can easily lookup all ratings for that item.

4) MixDex is an InterMix maintained secondary index on MixItem.  The purpose of MixDex is to enable fast selections, sorted by ratings values, on very large collections of items. 

5) MixSys table is an easily extendable grabbag for miscellaneous purposes.  It is keyed by 3 byte printable keytype and 10 byte printable keyvalue.  The data is a string of any length. 

--------------

MixItem Details

The MixItem key is not going to be printable.  We will need a function to convert the numerical portion of the key into a printable number followed by 8 Y/N flags so: 0000020181YYNNNYNY.  

InterMix will be written so Big vs Little Endian issues never come up.  The key for the first five bytes is a number, but we will convert the number into a string internally, and then reconvert the string back to number byte by byte as needed. 

The trick is to shift the number in Python to get each byte separately and then concatenate all the bytes in little endian format so items that were added near to each other in time will have keys that are near to each other.  Btree is more efficient if "locality of reference" is leveraged.  Besides, we want the keys to be in the same order that the items were added, so we can use the keys as a standin for a date sort. 

-----------

MixDex will be a manually maintained secondary index on MixItem.  See http://pybsddb.sourceforge.net/ref/am/second.html and http://pybsddb.sourceforge.net/api_c/db_associate.html. 

By controlling the MixDex table ourselves through InterMix, we can be more efficient on retrieval by defining the key to be the entire record and the data to be null.  Then when we ask for the next MixDex record, bdb won't have to read the data -- looking up the key will be sufficient.  

MixDex records have five columns: Perspective, SortBy, KeywordID, SortValue, ItemID. 

MixDex
   Perspective
   SortByType (interest, approval, etc)
   KeywordID
   SortByValue
   ItemID

MixDex needs to be considered in two ways.  First, how do we maintain it when a MixItem changes or when participants rate an item thus changing SortBy values for the item?  Secondly, how do we use it for rapid lookup of items in sorted order with selection criteria?  

---------------

Maintaining MixDex from MixItem

MixDex records for an item come from four sources.  

1) Some MixDex records derive from InterMix "structural" keywords that define the use and place of the item for InterMix handling.  For instance, it will be important that InterMix keep track of other InterMix "hubs" that it discovers on the internet.  The information about these hubs will be kept as items that have a 'h'ub structural keyword applied to it.  All these structural keywords will be kept in the item itself in an xml section for that purpose.  They will be kept as KeywordIDs pointing into the MixKeyword table. 

2) Some MixDex records come from "content" keywords derived from the content of the item.  For instance, if the word "water" is found in the item, the KeywordID from the MixKeyword record for MetaType1 'i' for "water" will be kept in the item in an xml section for that purpose. 

3) From special content defined SortBy keywords.  The MiXml format, which will be the subject of another document, will allow multiple custom SortBys.  For instance, a movie database should be able to SortBy year.  The information to create these MixDex records is just the new SortBy value for the item, which then will cause the creation of one MixDex record for every structural and content keyword for the item.  It is worth noticing that these special SortBys do not have multiple perspectives, which limits the number of MixDex records required considerably. 

4) From "user keywords".  Each InterMix participant can keyword themselves.  "female" or "renter" or "love music".  Also participants can "join" groups, thus keywording themselves as members of that group.  User keywords are applied to every item that the user writes or rates. 

Whenever a rating average changes for an item, of if the contents of the item change or the structural keywords, then the Mixdex records need to be created/deleted/changed as needed in the most efficient way possible.   

--------------

Using MixDex for sorted lookup with selection
To understand the use of MixDex, we simplify MixDex action and say there is only one rating value that we want to sort items by, say, sum of interest ratings -- call this "Interest".  And say we just want to select items by keyword, allowing just one keyword in our query.  For example, we could ask to see a list of items that have the word "water" in them, sorted by Interest. 
In this simple case, the MixDex table would have just three columns: KeywordID, Interest, ItemID.  ItemID is the key of the MixItem table.  Interest is the sum of interest ratings for the Item pointed to by ItemID.  KeywordID is the 12 byte ID that we have reduced all keywords to via the MixKeyword table.  For every different word in the Item, there will be one record in the MixDex table.  

For lookup queries, MixDex will be keyed by KeywordID and Interest.  For any lookup query, we first go to the MixKeyword table to find the KeywordID for the query word.  Then we read MixDex in Interest order for that KeywordID.  Each record points to an item record for an item that has the query word in it. 

To handle AND logic, we work as many "logical nodes" of lookup as there are ANDed query elements.  For instance, if the query is for "water" AND "ice" AND "floe", we will have three nodes of lookup being worked together.  These logical nodes will be implemented as program objects.

First, we get the KeywordIDs for each of the three lookup words and the MixKeyword.Count for each KeywordID.  The MixKeyword.Count field tells us how many items in MixItem are keyed by this particular KeywordID.  Then we sort the nodes by MixKeyword.Count.  Wherever possible, we will consult first the node that has the least MixKeyword.Count, since there will be less hits for this KeywordID.  To understand, consider that many more items will have the word "water" than will have the word "floe".  So it will be less work to check first for the next item with the word "floe" and then see ifInterMix Engine Design (version 0.0.6) 
Version 0.0.6 drops MixDexBase on the grounds that it is needed only to locate MixDex records for a given item, and that can be done more efficiently another way.  Also we need to be clear that we only add MixDex records for a particular SortBy on an item if that item has a value for that SortBy.  This means, for instance, that if there is no interest rating for an item, then there are no Interest SortBy MixDex records for that item.  We also introduce the distinction between "content" and "structural" keywords. 
Version 0.0.5 fills in the "perspective" idea, and adds MixDexBase as the secondary index on MixItem on which MixDex is built.  So now we have MixDexBase as a secondary on Item and MixDex as a secondary on MixDexBase!  Maybe.  Please comment whether the new design seems good or problematical.  
---------------- 
MixKeyword records will have a fourth, fifth and sixth field called KeywordMetaType1, KeywordMetaType2 and KeywordType.  KeywordMetaType1 will be a one byte field  to indicate that the keyword is derived from the content of the item, or is structural to InterMix.  For example, KeywordMetaType1 'i' indicates an 'i'tem content derived keyword.  Structural KeywordMetaType1s will be 'h'ub, ''c'ategory, 'p'articipant, 'l'ink.  No doubt, other structural types will be needed.  See http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm. 
---------------- 
Understanding "perspectives" 
The core concept underlying InterMix is that groups of users rate items.  Of course this happens by individual group members rating the items and then averaging or accumulating for the group.  Nevertheless, the idea is that the group ratings for items will be very useful in establishing group priorities and even a group "consciousness" and a group "voice".  See http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm.   
InterMix "perspectives" are the groups whose collective ratings are being tracked by InterMix.  An individual may be a member of more than one perspective.  For instance one may live in Pune and be a citizen of India and be a speaker of English.  If these three perspectives are defined to an InterMix "hub", then when such a person rates an item, the average rating for the item must be recalculated for all three perspectives and if it changes for one of the perspectives, then item indexing must be redone for that perspective.  
------------ 
InterMix data will be kept in five Berkeley DB tables using the Berkeley DB Transactional Data Store.  See http://pybsddb.sourceforge.net/reftoc.html and 
http://pybsddb.sourceforge.net/ref/intro/products.html.  Very possibly, we will end up handling both the item content keywords and the ratings updates outside the transaction system.  In this case, only the structural keywords will be maintained for an item inside the transaction system.   
1) MixItem table - items are messages, users, groups and so forth -- keyed by a 32 bit sequentially assigned numerical ItemID with an additional 8 bits of flags to distinguish up to 256 types of item by key for a 5 byte key and stored as a btree table.  All MixItem records will be wrapped in xml in a format inevitably called the MiXml format.  The item, as presented from external sources will be kept in one MiXml section.  Additional sections of interest at this point are the ItemKey, ItemUuid, ItemStructuralKeywords, ItemContentKeywords.  Structural Keywords  and Content Keywords will be kept  in separate sections as KeywordIDs pointing into the MixKeyword table.  The idea behind this distinction is that when another hub requests an item, the structural keywords will be translated back into their human readable phrase form and sent with the item, while the content keywords will not -- content keywords will be rebuilt on the foreign hub.  
2) MixKeyword btree table - links strings of any length with much shorter KeywordIDs.  The basic idea is to allow long strings to be indexed efficiently in the MixDex table.  For instance, a long http address could be reduced to just 10 bytes, say.  Importantly, the MixKeyword table will also keep a count of how many items in MixItem are keyed by the referenced string.  This count will also be used to increase lookup efficiency.  The KeywordIDs will be sequentially assigned.  The primary key will be the string.  There will also be an automatically maintained secondary index to allow lookup of the string from the KeywordID.  
MixKeyword records will have a fourth, fifth and sixth field called KeywordMetaType1, KeywordMetaType2 and KeywordType.  KeywordMetaType1 will be a one byte field  to indicate that the keyword is derived from the content of the item, or is structural to InterMix.  For example, KeywordMetaType1 'i' indicates an 'i'tem content derived keyword.  Structural KeywordMetaType1s will be 'h'ub, ''c'ategory, 'p'articipant, 'l'ink.  No doubt, other structural types will be needed.  See http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm. 
3) MixRef table - relations between pairs of items are kept here, such as ratings (user/item), or parent-child relationship (item/reply) and so forth -- keyed by xref-type/item1/item2 and indexed both ways, so for a particular xref-type, such as "rating", for a particular item1, we easily find all the item2's, and the other way about.  For ratings, we can easily find the ratings for all items by a particular user, and the other way about, for a particular item we can easily lookup all ratings for that item. 
4) MixDex is an InterMix maintained secondary index on MixItem.  The purpose of MixDex is to enable fast selections, sorted by ratings values, on very large collections of items.  
5) MixSys table is an easily extendable grabbag for miscellaneous purposes.  It is keyed by 3 byte printable keytype and 10 byte printable keyvalue.  The data is a string of any length.  
-------------- 
MixItem Details 
The MixItem key is not going to be printable.  We will need a function to convert the numerical portion of the key into a printable number followed by 8 Y/N flags so: 0000020181YYNNNYNY.   
InterMix will be written so Big vs Little Endian issues never come up.  The key for the first five bytes is a number, but we will convert the number into a string internally, and then reconvert the string back to number byte by byte as needed.  
The trick is to shift the number in Python to get each byte separately and then concatenate all the bytes in little endian format so items that were added near to each other in time will have keys that are near to each other.  Btree is more efficient if "locality of reference" is leveraged.  Besides, we want the keys to be in the same order that the items were added, so we can use the keys as a standin for a date sort.  
----------- 
MixDex will be a manually maintained secondary index on MixItem.  See http://pybsddb.sourceforge.net/ref/am/second.html and http://pybsddb.sourceforge.net/api_c/db_associate.html.  
By controlling the MixDex table ourselves through InterMix, we can be more efficient on retrieval by defining the key to be the entire record and the data to be null.  Then when we ask for the next MixDex record, bdb won't have to read the data -- looking up the key will be sufficient.   
MixDex records have five columns: Perspective, SortBy, KeywordID, SortValue, ItemID.  
MixDex 
   Perspective 
   SortByType (interest, approval, etc) 
   KeywordID 
   SortByValue 
   ItemID 
MixDex needs to be considered in two ways.  First, how do we maintain it when a MixItem changes or when participants rate an item thus changing SortBy values for the item?  Secondly, how do we use it for rapid lookup of items in sorted order with selection criteria?   
--------------- 
Maintaining MixDex from MixItem 
MixDex records for an item come from three sources.   
1) Some MixDex records derive from InterMix "structural" keywords that define the use and place of the item for InterMix handling.  For instance, it will be important that InterMix keep track of other InterMix "hubs" that it discovers on the internet.  The information about these hubs will be kept as items that have a 'h'ub structural keyword applied to it.  All these structural keywords will be kept in the item itself in an xml section for that purpose.  They will be kept as KeywordIDs pointing into the MixKeyword table.  
2) Some MixDex records come from "content" keywords derived from the content of the item.  For instance, if the word "water" is found in the item, the KeywordID from the MixKeyword record for MetaType1 'i' for "water" will be kept in the item in an xml section for that purpose.  
3) From special content defined SortBy keywords.  The MiXml format, which will be the subject of another document, will allow multiple custom SortBys.  For instance, a movie database should be able to SortBy year.  The information to create these MixDex records is just the new SortBy value for the item, which then will cause the creation of one MixDex record for every structural and content keyword for the item.  It is worth noticing that these special SortBys do not have multiple perspectives, which limits the number of MixDex records required considerably.  
Whenever a rating average changes for an item, of if the contents of the item change or the structural keywords, then the Mixdex records need to be created/deleted/changed as needed in the most efficient way possible.    
-------------- 
Using MixDex for sorted lookup with selection 

To understand the use of MixDex, we simplify MixDex action and say there is only one rating value that we want to sort items by, say, sum of interest ratings -- call this "Interest".  And say we just want to select items by keyword, allowing just one keyword in our query.  For example, we could ask to see a list of items that have the word "water" in them, sorted by Interest.  
In this simple case, the MixDex table would have just three columns: KeywordID, Interest, ItemID.  ItemID is the key of the MixItem table.  Interest is the sum of interest ratings for the Item pointed to by ItemID.  KeywordID is the 12 byte ID that we have reduced all keywords to via the MixKeyword table.  For every different word in the Item, there will be one record in the MixDex table.   
For lookup queries, MixDex will be keyed by KeywordID and Interest.  For any lookup query, we first go to the MixKeyword table to find the KeywordID for the query word.  Then we read MixDex in Interest order for that KeywordID.  Each record points to an item record for an item that has the query word in it.  
To handle AND logic, we work as many "logical nodes" of lookup as there are ANDed query elements.  For instance, if the query is for "water" AND "ice" AND "floe", we will have three nodes of lookup being worked together.  These logical nodes will be implemented as program objects. 
First, we get the KeywordIDs for each of the three lookup words and the MixKeyword.Count for each KeywordID.  The MixKeyword.Count field tells us how many items in MixItem are keyed by this particular KeywordID.  Then we sort the nodes by MixKeyword.Count.  Wherever possible, we will consult first the node that has the least MixKeyword.Count, since there will be less hits for this KeywordID.  To understand, consider that many more items will have the word "water" than will have the word "floe".  So it will be less work to check first for the next item with the word "floe" and then see if it also has the word "water", than if we look first at items with the word "water" and check each one if it has the word "floe".  We assume for the example that more items have "water" than have "ice" than have "floe". 
So first we use the KeywordID_Interest index on MixDex to find the MixDex.MixItemID that has the word "floe" with the highest Interest.  Then we check for the MixItemID that has the word "ice" with the highest Interest.  There are three cases: the Interest field for the "floe" node is less than, equal to or greater than the Interest field for the "ice" node.  
If the two Interest fields are not equal, then we don't have to bother actually comparing the MixItemID's, because we know the Interest fields would have to be the same for the same MixItemID, so we already know the MixItemID's are not the same.  Therefore we can discard all MixItemID's for the node with the higher Interest, down to the Interest of the other node.  If the "floe" node shows an Interest of 12292 and the "ice" node shows an Interest of 223299, we can restart the "ice" node index at 12292 and look for the next record with an Interest of 12292 or less.  If we get a hit at 12292, then we check the "water" node, else we restart the "floe" node at the new Interest of the "ice" node and so forth.  
To handle OR logic, we sort the nodes so we look at the node with the largest instead of the smallest MixKeyword.Count first.  
Node:  
   NodeType - Simple or Compound 
   direction (sort high to low or low to high) 
   sortby - Interest, Approval, etc 
   perspective  
   KeywordID - only simple nodes have KeywordID value 
   list of SubNodes - only Compound Nodes have SubNodes  
   LogicalFunction (AND, OR, NOT) - only Compound Nodes have LogicalFunction 
   Count - for simple nodes, this is the count from the MixKeyword table 
                for compound nodes, the count is derived from the SubNode Counts 
   CurrentSortbyValue 
   CurrentMixItemID 
   function SetCursor() - uses CurrentSortbyValue 
   function GetNext() - returns next SortbyValue and MixItemID 
Some Notes:  
Compound nodes with NOT logic have only a single sub-node. 
The count for a compound AND node is the smallest count of any of the subnodes. 
The count for a compound NOT node is calculated as TotalNumberOfItems minus the count for the subnode. 
The count for a compound OR node is calculated over the list of subnodes using the ResultCount from the previous subnode to calculate the new ResultCount for the next subnode.  Beginning with a current ResultCount of zero, the next ResultCount = current ResultCount + (1 - (current ResultCount / TotalNumberOfItems)) * nextSubNodeCount. 
For example if there are 1000 items altogether and we are calculating the count for "water" OR "ice" where "water" has a count of 100 and "ice" has a count of 10, then 
a) 100 = next ResultCount = 0 + ((1 - (0 / 1000)) * 100) 
b) 109 = next ResultCount = 100 + ((1 - (100 / 1000)) * 10) 
This may seem like a lot of trouble for no effect -- except it does make a difference if we are dealing with a subnode that implements NOT logic where the count can be very large.  
For example if in the same database as above, we search for (NOT "water") OR (NOT "ice"), then the count for (NOT "water") is 900 and for (NOT "ice") is 990, so 
a) 900 = next ResultCount = 0 + ((1 - 0 / 1000)) * 900) 
b) 999 = next ResultCount = 900 + ((1 - (900 / 1000)) * 990) 
--------------------------------------------- 
The difficulty with the MixDex design is that adding a new item or updating a rating may be too slow because there are so many MixDex records to be added or updated.

To understand the difficulty, imagine a moderately sized item - say a one thousand word essay.  Assuming we will index phrases up to five words in length, that gives us maybe 4*975 for the phrases + 600 basic words = 4500 content KeywordIDs.  (There will also be perhaps 10 or 20 structural keywords, but these are so few, we can ignore them.)  Now for every SortBy we will need one MixDex record for each KeywordID.  Worse, for the rating SortBys we will need an additional set of MixDex records covering all the KeywordIDs.  We will want always to sort by 1) date and 2) thread/date, and 3) on average one special SortBy so that is three sets of MixDex records, or 13,500 of them.  Then for each rating SortBy, there will be one set for every perspective.  So that is Interest, Approval, Controversy, Value, and Usefulness - five rating type SortBys for maybe 8 perspectives gives 193,500 MixDex records!  Perhaps 8 megs of complex indexing for a single modest item entry.

To mitigate this design disaster, we begin by removing a fairly extensive list of common generic words, such as "is" and "very".  This only helps with the basic word list, not with the phrases, but still, that brings us down to maybe 4400 keywordIDs per SortBy/Perspective, or 189,200 MixDex records.

We can do even better by keeping two word phrases only (instead of 2,3,4 and 5 word phrases) and stopping phrases at periods and semi-colons.  We handle requests for longer phrases by building up the longer phrase from two-word components.  Thus a request for “the lazy brown fox” is handled as a request for “the lazy” AND “lazy brown” AND “brown fox”.  This method will give a few false positives but not so many as to make a real difference. Also, on phrase indexing, we do not need to look past periods or semi-colons.  On our 1000 word essay, this will bring us down to maybe 1450 keywords from 5000, and our total down to 62,350 from 215,000 MixDex records. 

Then we realize that "Usefulness" could be handled as "Approval" - so now we have only 4 rating SortBys, which brings us down to 46,400 records.

Moreover, we could say that we keep MixDex records only for the 20 percent of messages that have the most number of ratings by Perspective.  So we are down to 13,630 MixDex records.  Still too high, so maybe only the 10 percent of messages with the most ratings – brings us down to 8,990.  This is on the assumption of 8 perspectives.  

Finally, only the structural keyword MixDex records are written and deleted as part of the transaction that adds or updates the item.  The adding/updating of item content based MixDex records is queued for a work thread and handled outside of the transaction.  The down side to this is that if the system crashes, item content MixDex records will sometimes be out of sync with the item data.  We handle this by backing up the queue with a system file record for each item that is queued, and deleting that system file record only when the work thread has completed for that item and flushed the cache to disk.  Every time InterMix comes up, it will check the system file for unfinished business and recreate the queue as needed.

With disk space getting cheaper at a rapid rate, we are perhaps just into the realm of the possible for the home computer. 

It would be better though if every common enough user keyword could be a perspective, instead restricting the number of perspectives to a small preset number.  Thus if a sufficient number of people are tagged as “funny”, then there should be a perspective of funny people, and so forth  -  intellectuals, homeowners – perhaps hundreds of single factor perspectives.  We would limit the number of such perspectives by having a threshold number of participants who must share a perspective before the perspective is activated.  
Some of the more popular perspectives, like “woman” or “homeowner” may have sub-perspectives, such as “single woman” or “Santa Monica homeowner”.   However, the number of perspectives grows exponentially if every combination is going to be tracked, so multi-factor perspectives will have to be consciously applied as such by the users to be considered. 

Here, then, is an alternative plan for implementing selection with sort on large numbers in a way that allows for many perspectives to be tracked: Select first and then sort -- “SelectAndSort” we can call it.
In a way, this is the obvious method, except of course, a problem arises when the selection is huge, as for instance, when no selection criteria are used, so every item in the database is selected.  Clearly the selection must be limited in number if the sort is to be after selection, but in that case, how do we guarantee that the highest rated items make the cut?
A possible solution would be to have range keywords to use in the selection.  For instance we might divide the “Interest” range into hundreths, and look first for items in the top percentile, then do a fresh search for those in the second highest percentile and so forth until we have accumulated a preset number of items to be sorted by average Interest. 
If the number of items in the pool becomes very large, however, then any scheme based on a set number of ranges, such as 100, will fail, because too many items will always be selected.  We need a method whereby the number of items can increase indefinitely but in a way that is easy to handle programmatically and that will keep the number of ranges from becoming too large.  One possible such solution is what I call “range by halves”.
Here is an example of range by halves.  Let's say the limit for selection is 1000 items.  That is we will select 1000 items max and then sort them by rating.  When the number of items in the pool becomes 1001, we need to split the pool into a top half and a bottom half, assigning keywords that indicate which half an item falls into.  When the top half gets its own 1001st item, we split the pool into quarters, but this time we assign new keywords only to the top quarter and the bottom quarter.  Then when the top quarter reaches 1001, we divide the top and bottom quarters, assigning a keyword only for the top eighth and bottom eighth.  Assuming the pool is growing at a steady rate, the splits become rarer and rarer, and each split requires only the same number of new keyword assignments as the first split did – 500 items at the top and 500 items at the bottom are assigned an additional, new range keyword and have the old range keyword removed.  Whenever there is a split, the old range keywords remain accurate for all except the top and bottom 500.  
Notice that after a split, items in the new ranges no longer have their old keywords.  Thus if the top quarter is split, the items in the new top eights are keyed “top eighth”, but items in the 2nd from the top eighth are keyed “top quarter”.  To select the top quarter, one must then specify “top eighth” OR “top quarter”.  If one specifies only “top quarter for a range that is split into eighths, then one actually gets only the 2nd top eighth.  A bit confusing. 
In the range by halves method, the top and the bottom splits are coordinated.  When one splits, the other splits.  New range delimiters are calculated at split time, but existing keywords are not redone because that could be too massive.  Rather, existing range keywords are redone item by item when new ratings for the item come in. 
Interestingly, archiving can work with this scheme.  Archiving is the removal of items from the active database.  First, of course, old items that have no ratings are archived.  If rated items need to be archived, old items that are in the middle two quarters of the pool will be archived first.  The current set of range keywords is kept for a pool of items that have been partly archived, but a new set of range delimiters is calculated that will gradually reduce the numbers of items in the more extreme high and low ranges as existing range keywords are redone for each item that gets a new rating.
The advantage of this SelectAndSort with range by halves is that the number of keyword records is greatly reduced while a large number of perspectives can be accomodated.  Each perspective will have its own set of range by halves keywords that are based on the size of the pool of items for that perspective.  (An item is in the pool for a perspective if anyone from that perspective has rated the item.)  
One disadvantage of SelectAndSort with range by halves is that only the top or bottom limit number of items (1000 in the example) can be listed for a given keyword selection.  There are ways around this limitation, but they will be cumbersome.  A second disadvantage is that building the list will be slower than via MixDex, perhaps noticeably slower.  Once the list is built, then SelectAndSort with range by halves should be fast, but typically people only want to see the first page of a selection, so we build a list of 1000 to display a list of 40, and then build another list of 1000 and so forth.  This seems a serious drawback.

-------------------------------------------- 
    
Because there is more than one "SortBy" we have the possibility to add an additional column to the MixDex file for each SortBy.  One column for Approval rating, another column for Interest, and yet a third for Controversy.  Or we can add a single column SortByType in MixDex.   
Thus, option one, add a column for each SortBy would have MixDex records like this:  
MixDexRec 
   KeywordID 
   InterestSortBy 
   ApprovalSortBy 
   ControversySortBy 
   ItemID 
Option two gives us 
MixDexRec 
   KeywordID 
   SortByType (interest, approval, etc) 
   SortByValue 
   ItemID 
    
For every KeywordID for an Item, option two will have three records, while option one will have only one record.  Also, because in option two the KeywordID and ItemID has to be repeated in each record, and we also have a SortByType column, option two will take quite a bit more space.  
Nevertheless, I think we should go with option two because it is more flexible.  If we want to add a new SortBy, it will be relatively easy to do so.  
--------------- 
Rethinking the options.   
There is an option three, and that is to have a separate secondary index for each SortBy. 
    
This gives us:  
MixDexInterestRec 
   KeywordID 
   Interest 
   ItemID 
MixDexControversyRec 
   KeywordID 
   Controversy 
   ItemID 
and so forth.  This has the advantage of smaller tables.  It is simple.  However we do have to go change the code in multiple places when we want to add a new SortByType and therefore makes it difficult for us to give the user the ability to add new sorts.  So I think we still stick with option two for now.   
If the speed problem with option two is too great, we may implement both option one and option three, using a separate thread for each of the SortBy MixDexBase tables.  In any case, the work will be done in work threads so we can return immediately to the requestor.  
------------------  
A point to realize: Berkeley DB btree is optimized to insert new records in increasing sequence -- in other words, if the records are already sorted by key, then bdb btree is more efficient.  See http://peertech.org/BdbAndSmallRecords.   

Here is a quote from a Sleepycat article: "As records are inserted and pages in the B+tree fill up, they are split, with about half the keys going into a new peer page at the same level in the tree. Most B+tree implementations leave both nodes half-full after a split. This leads to poor performance in a common case, where the caller inserts keys in order. To handle this case, Berkeley DB keeps track of the insertion order, and splits pages unevenly to keep pages fuller. This reduces tree size, yielding better search performance and smaller databases." -- see http://www.sleepycat.com/docs/ref/refs/bdb_usenix.html

Last changed June 10, 2004.
page maintained by Roger Eaton

back to home page
 

InterMix Logo