log in

tsunami: embedded data types

Luke Breuer
2010-01-08 08:44 UTC

I hope that tsunami will at some point be able to manage* any kind of data. The idea is to have a single portal, from which you can access any information, launch any application, etc. In the beginning, I plan to support the following specialized data types:
  • hyperlinks
  • quotations
  • code snippets (including syntax highlighting)
  • todo items

Note that some of the discussion below is a result of collaboration with others.

* tsunami won't necessarily contain all the data it manages, except perhaps for metadata and conversations about items
major objectives
  • enable searching, grouping, sorting, and displaying items in ways unique to their data type
  • combine data types intelligently, perhaps into composite data types
    • when there is overlap between types, allow treating them the same in certain ways
  • allow linking data types (for example link an image to an article, and also to a hyperlink to the original page)
    • using triples like RDF: object, subject, predicate (but allow simple, dumb links as well)
      (relationships are items (a special data type) in their own right, in that they can be commented upon, and even related to each other (yes, this gets quite meta))
  • allow related objects to be "infected" (for lack of a better word) with data from their peers. For example if a hyperlink is tagged, perhaps all the linked objects also become associated with that tag although less strongly
    • how to deal with tags is an interesting discussion in and of itself; tags are really just a simple linkage system, especially if you have one particular item that defines the tag
    • when looking at a given item, it should be easy to see what is connected and traverse those connections fluidly
The first data type would be the hyperlink. First, we want an easy way to add hyperlinks to tsunami, ideally via FF plugin, and plugins for other browsers at some point. The first model will probably be something akin to the myriad of bookmark managers in existence. Something like digg.com might be eventually useful, but not immediately.
We will need tsunami to be able to search only hyperlinks, maybe via type:url as a new operator for the search box (tag: is the only one for now). We want specialized search functionality for hyperlinks, probably very much like Google has. Ability to restrict by:
  • domain
  • title
  • url
  • any modified date that can be detected
  • etc.
Whether or not we store hyperlinks in their own table is up for debate. We will want the ability to write blobs of text about a hyperlink. This could argue for hyperlinks simply being a regular tsunami item with something special, like:

Except XML is not a very good serialization format. However, that's irrelevant for now.

Arguments for storage in a separate DB table include:
  • performance (at least cached versions would be useful)

I suppose I could store any XML that shows up at the front of the document in a separate DB column, and then utilize XML indexes... (this would definitely tie us to MSSQL or Oracle, though)
provider model
Tsunami needs to support data types in a scalable model. The best option is probably to pursue a provider model. The provider would be responsible for:
  • detecting and parsing its data type from raw text
  • provide search functionality
    • sometimes this might be just by adding additional and predicates to the where clause
    • it will also be desirable to add expressions to the order by clause
    • the provider will likely add operators (like intitle:)
  • specialized rendering
    • syntax highlighting is an example

It will probably be desirable to have providers even for raw text. This way, people can use their own markup languages.
It makes sense to have quotations be their own data type because they have their own set of attributes:
  • source
  • time
  • author

The source is particular important in the case of symbolic quoting: as a minimum, article, revision #, and character indexes will be needed. Alternatively, if the source is a URL, we have data type composition right there!
We will need to be able to cite all sorts of different kinds of sources.
If you were reading a tsunami article and stumbled across a particularly intriguing quotation, being able to find all other such quotations could be extremely useful.
type composition/inheritance
Someone could create a special "IRC Quote" data type, since many quotes the same formatting. The type would have special parsing abilities.

Syntax highlighting is a great example of why we need inheritance.
declaring types on the fly
I've toyed with the idea of just letting you use XML* to define data types on the fly; I'm not quite sure how that would work though.

* understand that I'm saying XML for lack of a better term, I understand XML sucks for serializing data a lot of the time
    <element type="string" name="author"/>
    <element type etc>

Perhaps we have "generic" providers which can parse all data types of a given form.
We want to be able to lazily make a type (think expando, which allows adding arbitrary properties) as well as more rigorously define a type — both should be options.
multiple formats
We could have multiple representations of the same data type. like a short form inline quotation, a full form quotation with timestamp, etc. but all referring to the same data.
Providers might be triggered based on certain tags (for example, a better guess of the programming languages being used could be gleaned by looking at tags).
to-do items
It would probably be wise to be able to embed to-do items in various pieces of text. Then, we will want a way to view these items in aggregate. In fact, we will also want ways to edit them from the aggregate screen, and have those changes propagated back to the appropriate items.