Data classification and the need for ontology…

How many organizations are struggling with data classification and building an ILM (Information Lifecycle Management) strategy? In the storage industry today we often talk about the differences between ILM and HSM (Hierarchical Storage Management) but does taxonomy actually provide enough to realize true ILM? Rather than taxonomy we should be addressing ontology. Ontology as defined in the realm of computer science is a data model that represents a domain and is used to reason about the objects in that domain and the relations between them. Known categorization systems used today were designed to optimize linear seek time not to optimize or categorize the intellectual aspects of information. Classification and categorization techniques used today while presented as organizing information are actually categorizing the physical objects that contain ideas or information. The industry is attempting to leverage traditional categorization methods by using tags which create meta data to try to depict the ideas and information inside the containers, more intelligent categorization methods today are applying lexicons to attempt to automate the generation of meta data. Again, I find it odd that we as an industry obsessed with data classification and life cycle management do not address ontology and our approach to ontology on a daily basis.

Ontology would need to consider owners, users, participant, openness of the domain and the potential for the control set to be altered and signal loss. The storage industry has avoided true ontology because the undertaking is massive. Until an ontological method for classification and categorization is developed can we ever really achieve true ILM?

The need for a thesaurus of terms, words or tags is an absolute requirement to enable true ILM. A canonical example of this would be imagine someone searches the web (largest know corpus of data) for “Movie” and another user searches a repository for “Cinema” would the return be the same? Most likely if the search is of a full text index the answer would be no, the reliance on tags to categorize a document using multiple words of terms makes it difficult to enforce and deliver true plug-and-play categorization and ILM. Now we also have to consider the signal loss, imagine a search of the web for “… Politics” and “… Agenda” while they might appear to be synonymous they may or may not be.

This is a complex problem that is not easily addressed but I believe there is a definite long term requirement for a transition to ontological approach.


Leave a Reply

Your email address will not be published. Required fields are marked *