== Preface == Tewwy: I'd like to see discussion of abstraction of data usage in general (not just Flow) between Features and it's data. Perhaps eventually a data abstraction "API" so data can be used as a general service that covers 90% of the the space. [For instance: ContentHandler allows many different types of data to live on the page and revision tables] == Abstraction of Data Usage? == Reusable code. Objects that will be used instead of raw database access. They should be natively backed by memcached as a write through cache. We should be able to write functionality once for a particular usage model and reuse it in multiple places. If it works in once place do some testing and it should work in the others as well. To acomplish this, we should build write-through indexes implemented on memcache. Some will directly mimic an index elsewhere, like a primary key index. Some will index denormalized data. Some will index rollups. Some indexes will be read/write, meaning we will save new and updated domain models through them. Some indexes will be read only, serving up denormalized data and updating by receiving create/update/remove events from the writable indexes. By definition these access patterns are more restricted than you would get querying MySQL directly. This is intentional for a couple reasons: * Later down the road we want to be able to shard the data set without making drastic changes to the code base. This setup should discourage the use of some aspects of mysql that we cant use when sharding anyways ** How? * All access patterns need to be supported by denormalization to memcache. By definition memcache is a more restrictive query environment than mysql(really no queries, just key lookup of answers). To go along with this * All id's used will be uuid's generated in the application servers. === How is this not an ORM with write-through memcache? === We discourage relations, they are not supported at the query level. === How is this not inventing NoSQL ontop of MySQL? === By allowing models to implement their own storage interface which is utilized by the indexing, this is more like building materialized views stored in distributed memory ontop of mysql. The aspect this does share with NoSQL is that all data usage needs to be thought about and declared up front. At write time we are pushing the answer to all questions we want to ask the data into memcache. At read time, when we are serving pages to a user, no caclucation should take place. We just take the pre-calculated answer out of memcache. == What does that look like? == Indexes in memcache. All indexes will, for consistency, store an array of values in a specifically prefixed set of memcache keys. The values are sorted, and the array is trimmed to a specific maximum length. Use Value Objects and Service Layer seperation. * Upsides: ** Value objects can be serialized directly into memcache * Downsides: ** Lazy loading becomes difficult, if not impossible. How to handle transactions ? You can't rollback memcache. One option is to buffer the memcache calls. If we buffer memcache calls that means anything that returns a value( like $bag->add() ) will not work. For this reason i dont think our buffer should implement BagOStuff. Instead create a minimized pass-thru class to BagOStuff with only the functions we need to implement the indexing. Clearly document the limitations, and add begin() and commit() methods to manage the buffering. I dont think its necessary to implement nested transactions, they dont appear to be supported in DatabaseBase. How to get our buffer triggered when db enters and exits transaction? $cache = new MemcacheAdapter( $wgMemc ); $buffer = new BufferedCache( $wgMemc ); $buffer->begin(); ... $buffer->commit(); What about the post tree? We need to be able to denormalize the tree into memcache as well. We are denormalizing. The post tree cache should be the structure, the metadata, and the content all put together. == What do indexes need to be able to stay up to date? == They need to be called whenver a new object is created, updated, and removed. They need to know what changed in the case of an update. === Updating normalized data, trigger related updates === ==== Limit ourselves to just user to user conversations, what are the actions? ==== Render user board: * Find most recent header content attached to users discussion workflow * Find all workflows user is subscribed to, sorted by last updated date, along with the timestamp of the most recently read items of that workflow ** Somehow the user needs to be able to subscribe to a 'discussion' object, but not have the actual discussion object listed in their board. The subscription to the discussion will be used for subscribing the user to new topics on that discussion. * Find post tree/content for all topics we are rendering to the board Actions from user board: * Mark All Read * Some sort of mark specific-topics read via api? Actions on a topic: * Need to be able to query a 'Topic History' of date/author/update header * (un)Subscribe to topic * Close topic * Delete topic * Mark All Read * Change topic title ** Should topic titles be stored as a header, so they have revision info and all? Actions on a post: * Revision history list * Edit (new revision) * Suppression (adjust flags on existing revision) ==== What indexes against what denormalized domain models will answer those questions? ==== Flow Object - Gets wrapped into a Topic block ==== Stuff ==== Post is created on a topic topic(flow object) needs new last updated timestamp discussion(parent flow object) also needs new last updated timestamp? subscriptions need updating, in the background Post needs to be cached into its revision list( one revision, just itself). Post needs to be cached into its topic Something about tagging? ---- $parent = $postStorage->get( $parentPostId ); $post = $parent->reply( $wgUser, $content ); // probably more stuff $postStorage->put( $post ); --- Ideally, we want simple. A revision list To support composite keys, must always use nested array of arrays - array( array( rev id ), array( rev_id ) ); - new revision gets added, list sorted by id(time uuid) from recent to oldest. trimmed to a max length. A flow object is uniquely indexed, for consistency still hold as array. Merge the new timestamp into existing state. - array( array( 'obj_id' => 123, 'obj_model_id' => 42, ... ) ) - metadata is denormalized into the object as 'obj_metadata' => array( ... ) Another thing to think about, not every denormalization will be saveable. Likely only a specific subset of data should be savable, the rest merely updated via lifecycle events from the savebale models. In the goal of simple: some indexes, which are strictly denormalizations, --- Each denormalization gets its own model class to populate, and its own object manager to manage storage interaction for it. // Post manager always contains the most recent denormalized post information $postManager = ...; // Store a post $postManager->put( $post ); // retreive a single post(not very usefull ) $postManager->get( $postId ); // batch retreive posts $postManager->batchGet( array( $postA, $postB, $postC ) ); // List of posts by user, sorted by most recently created. // Find only works against indexes. Ordering is pre-set by the index. // It will query storage if offset/limit isnt answerable from index. // The index itself will be shallow, fetching a list of ids and then // running $this->batchGet( $ids ); $postManager->find( array( 'user_id' => 9981234 ), array( 'limit' => 100, 'offset' => 100 ), ); // Revisions of a post. Is specific to a revision as opposed to post which // is first+last revision combined. $postRevListManager = ...; $list = $postRevListManager->find( array( 'post_id' => $postId ) ); // Instances of a workflow on a specific article $flowObjectManager = ...; $obj = $flowObjectManager->get( $objId ); // find all workflows of specific definition on the same article $list = $flowObjectManager->find( array( 'wiki' => 'enwiki', 'ns' => NS_MAIN, 'title' => 'Ada_Lovelace', 'definition' => 12 ), ); // Surely we need other queries? think about it // $userSubListManager = ...; $list = $userSubListManager->get( $userId ); $topicManager = ...; $topic = $topicManager->get( $topicObjId ); // same id as its backing flow object $topicListManager = ...; $list = $topicListManager->get( $objectId ); // same id as its backing flow object $headerManager = ...; $header = $headerManager->get( $objectId ); // same id as its backing flow object How do they interact? post -> flow object -> update timestamp -> user sub list -> backgrond job -> add post id, sort+unique topic id? -> topic I mean, how do they know what to do when triggered? Key .tha All object managers have an array of 'LifecycleHandler' classes. The manager, after creating a post through its storage class, will call onPostCreate on each one flow object - needs to know the timestamp, - Does a timestamp update write through to the db? seems wasteful. - How bad is the query to find the timestamp? - SELECT MAX(id) FROM flow_post_revision JOIN flow_tree_node ON node_ancestor = :topicId WHERE fpr_post_id = :postId - But only for topics, and maybe topic lists like discussion workflow (another join...) - SELECT MAX(id) FROM flow_header_revision WHERE fpr_object_id = :objectId - But only for workflow types that define headers - Whatever else ... basically its completely arbitrary depending on the blocks involved - Is updating the active objects with a timestamp on every update too much? - Can only be done in cache, could use redis for some more resilience == Possible Interace == Object Manager is created per-class and ties together a few pieces, doing no work itself: * ObjectManager ** get( $primaryKey ) -- batch from get(), or new batchGet() method? ** put( $object ) -- similar batching concerns ** remove( $object ) -- no remove by PK because we need to run triggers ** find( array $attrs ) -- only works against pre-specified indexes ** possible others working with the indexes * Mapper - Translates back and forth between storage results and value objects, ** serialize ** unserialize ? * LifecycleHandler - Interface for implementing indexes (could have pre as well) ** onPostCreate ** onPostUpdate ** onPostLoad ** onPostRemove * Storage ** put( $serialized ) ** get( $primaryKey ) - should also work for array of primary keys, composite keys, and array of composite keys. ** remove( $primaryKey) - Also, in the future it may be usefull to support batching between multiple ObjectManager instances, but that is more complex and should be implemented at a time when it appears useful. == Sample Index Definition == A few types of basic, generic indexes: * Unique Index -- One element per value * Global List -- One list of all objects of the type, sorted and limited in length * Secondary Index -- List of objects of the type, split by first field. Sorted and limited in length. Implementations should feel free to also implement type-specific indexes where appropriate. // read/write $definitionIndexes = array( new UniqueIndex( 'model_id' ), new UniqueIndex( 'model_name' ), ); // read/write $workflowIndexes = array( new UniqueIndex( 'object_id' ), new SecondaryIndex( array( 'object_wiki', 'object_namespace', 'object_title_text' ) ), // also need something to trigger subscription updates when last update time changes new GlobalList( 'object_last_update_timestamp', array( 'sort' => 'order' => 'desc', // sort from newest to oldest 'limit' => 1000, // only store last 1000 updates ) ), ); // read only? $postRevisionIndexes = array( new UniqueIndex( 'rev_id' ), new SecondaryIndex( 'descendant_id', array( 'sort' => array( 'PostRevision', 'sortNewestToOldest' ), 'limit' => 20, ) ), ); // read/write $headerRevision' = array( new UniqueIndex( 'rev_id' ), new SecondaryIndex( 'fsr_object_id', array( 'sort' => array( 'HeaderRevision', 'sortNewestToOldest' ), 'limit' => 20, ) ), ), ); == ObjectManager == To manage an index we need a few different events, and certain kinds of data available. Specifically we need to know when something is created, updated, or deleted. We also need to know the indexed values when it was loaded and the current values that were saved.. The same object manager class will likely be used for most use cases. Keep it simple. It doesn't exist to be anything fancy, it just provides consistent lifecycle events and error handling across data storage. Ontop of suggestion below needs to accept array of objects/keys for batch operations. Composite keys could be supported, but are outside the initial implementation. All exceptions inside the persistence layer should be caught and rethrown as PersistenceException from the ObjectManager for consistent error handling. class ObjectManager { ... public function get( $id ) { try { $data = $this->storage->get( $id ); if ( ! $data ) { return null; } $object = $this->mapper->unserialize( $data ); foreach ( $this->lifecycleHandlers as $handler ) { $handler->onPostLoad( $this, $object ); } return $object; } catch ( \MWException $e ) { ... } } public function put( $object ) { try { foreach ( $this->lifecycleHandlers as $handler ) { $handler->onPrePersist( $this, $object ); } $data = $this->mapper->serialize( $object ); $this->storage->put( $data ) ); } catch ( \MWException $e ) { ... } }; ... } == Mapper == Mapper is two simple functions, serialize and unserialize. serialize receives a domain model (php object) and returns an array for the storage layer to persist. unserialize receives an array from the storage layer and returns a domain model. Some mappers may be written specifically for their value objects. A generic mapper will also exist for usage with POPO's (directy property access), but anything with getters/setters will likely need a custom mapper. To make custom mappers as easy as possible there should be a delgating mapper that calls 2 callbacks. $mapper = new GenericMapper( array( 'PostRevision', 'toStorageRow' ), array( 'PostRevision', 'loadFromRow' ), ); class PostRevision { public function __construct( callable $serialize, callable $deserialize ) { ... } public function serialize( $obj ) { return call_user_func( $this->serialize, $obj ); } public function deserialize( $row ) { return call_user_func( $this->deserialize, $row ); } } == Storage == The storage class handles interaction with the database. This is where denormalization of joined tables should happen. For example a flow object also has metadata in a second field. The storage layer should query both and return as a single piece of data to be cached. It will also receive both pieces of data and must decide what tables to update with what information. == LifecycleHandler == The indexes will implement the LifecycleHandlerInterface to receive add/update/delete commands. The indexes will implement some method so the ObjectManager knows what queries the index can answer, and another method to actual answer those queries. When the index does not have the data in memcached it must fall back to a backing store. This could be implemented such that the backing store can be another index, or an sql storage class. A one function interface should do. The use of another index as a backing store is an optimization Managing the relationships between indexes, such that read-only indexes How does the Index get access to memcache and the database? Stuffing more into the constructor doesn't seem like the best idea. Perhaps the ObjectManager should hold references to db/cache, and pass them into the index when performing operations. A sample index class SecondaryIndex implements LifecycleHandler, Index { public function __construct( BagOStuff $cache, array $attributes, array $options ) { $this->cache = $cache; $this->attributes = $attributes; // $this->options = $options + array( // sensible defaults ); } public function onPostPersist( ObjectManager $om, $object, $overwrite = true ) { list( $_, $newAttrs ) = $om->collectOldAndNewAttributes( $object, $this->attributes ); if ( $this->invalidCacheKey( $newAttrs ) ) { return; } $this->cache->merge( $this->cacheKey( $newAttrs ), function( $value ) use( $om, $object, $overwrite ) { if ( $value !== false && $overwrite === false ) { return false; } elseif ( $value === false ) { // No value in memcache ... determine a method to decide if ok to create } $objectToAdd = $om->serialize( $object ); if ( array_search( $objectToAdd, $value ) ) { return false; } $value[] = $objectToAdd; usort( $value, new SortArrayByAttribues( $this->attributes ) ); return $value; } ); } public function onPostUpdate( $object ) { // doh ... need origional value } } === Dependent Models, i.e. denormalization === Backend storage for a post * text - points to content via ExternalStore * revision - first revision holds username and creation time * revision - most recent revision holds pointer to text table/lock state/flags * tree_node - contains posts position within the topic tree Denormalization Goal: All references to update/store/etc are in regards to memcache Make new post * store a single Post object containing content+first rev+last rev * update trees along root path with new post * create revision list with single entry New revision of post: * update post object with newest last rev * append to revision list, truncate to maximum cache size ==== How to implement that in a declarative manner? ==== this isnt really declarative ... but first attempt class Post { public static function newFromRow( $id, $firstRev, $lastRev, $content ); } // Only concerned with writing to database, indexing is external class PostSt public function __construct( TreeStorage $treeStorage, PostRevisionStorage $revStorage, TextStorage $textStorage ) { ... } public function get( $id ) { $revisions = $this->revStorage->find( array( 'descendant_id' => $id ) ); return array( 'first' => reset( $revisions ), 'last' => end( $revisions ), 'content' => $this->textStorage->get( end( $revisions )->getTextId() ) ); } public function put( Post $post ) { if ( !$post->getId() ) { $id = UIDGenerator::newTimestampedUID88(); $post->setId( $id ); $this->treeStorage->insert( $id, $post->getParentId() ); $rev = $this->createFirstRevision( $post ); } else { $rev = $this->createRevision( $post ); } $this->revStorage->put( $rev ); $post->setLastRevisionId( $rev->getId() ); } }