The phrase ‘unstructured data’ has been around for some time and is typically applied to text, image and video. The complimentary phrase, ‘structured data’, has become synonymous with relational data. If we think about how much information is contained in some typical sources of data it would be something like this:
Simple tables are where I started my career – most data for an application stored in tables without necessarily normalising. Where there was related data we had to hand-code the joins! Relational data needs no introduction. Graph is interesting as it seems to be a way of making some ‘structure’ from what formerly was thought of as ‘unstructured’ especially when applied to text. The remainder then contain increasing amounts of information – natural language, images and video have far more complexity than relational data but have posed a problem for computers to process. The recent explosion in AI capabilities (thanks to Moore’s law) have started unlocking the values of this harder-to extract data.
I would argue that these phrases are misleading and that ‘unstructured’ sends the wrong message to non-technical colleagues or clients. The Wikipedia entry for unstructured data cites a study concluding around 90% of a corporation’s data is ‘unstructured’. As IT professionals we should be encouraging more desire to exploit this data. I would like a better terms for it, how about ‘Rich Data’? And so what to call ‘structured’ data – I did think ‘Simple’ but that makes it sound it should be cheaper to deal with than is actually is. So, to conclude, how about ‘Structured’ and ‘Rich’ or is there a better term out there?
I recently completed an excellent book which examines how to deal with information presented as text. It’s called Taming Text from Manning. The authors do a good job of introducing each topic and explain how a number of open source tools can be applied to the problems each topic presents. I’ve not studied the latter but the former are a great introduction.
I have summarised each topic, below:
- It’s hard to get an algorithm to understand text in the way humans can. Language is complex and an area of much academic study. Text is everywhere and contains plenty of potentially useful information.
- The first step in dealing with text is to break it down into parts and the most simplistic aim of this step is to extract individual words, however there are a number of approaches and more sophisticated ones will need to handle punctuation. The process of splitting text down is called tokenisation. Individual works may often then be put through a stemming algorithm in order to be able to equate pluralised and different tenses of the same stem. A stem might be a recognisable word but not necessarily.
- In order to search content it must first be indexed which will require tokenisation and stemming and maybe also stop-word removal and synonym expansion. It is also useful, for subsequent ranking, to use an index that allows the distance between words found from the search phrase to be calculated for each document searched. There are a number of algorithmic approaches for ranking results the simplest of which are based on the vector space model. Obviously ranking is an evolving area and the big internet search engines are constantly evolving it. Another refinement that can be applied to search is the key constituent of spell-checking: fuzzy matching.
- Fuzzy matching is another area of academic research with some established algorithms based on character overlap, edit distance and n-gram edit distance which may all be combined with prefix matching using a trie (prefix tree). The most important aspect of fuzzy matching to understand is that different algorithms will be more or less effective depending on the sort of information being matched, for example Movie Titles are best matched on Jaro-Winkler distance but Movie Actors are bet matched with a more exact algorithm given that they are used like brand names.
- It can be useful to be able to extract people, places and things (including monetary amounts, dates, etc.). Again there are a number of algorithms for achieving this including open source implementations from the OpenNLP project. Machine learning can play a part provided there is plenty of tagged (training) examples available, which is especially useful where domain-specific text needs to be ‘understood’.
- Given a large set of documents often there is a requirement to group similar documents. This is a process called clustering and can be observed in operation on news amalgamation sites. Note that clustering does not assign meaning to each cluster. There are a number of established algorithms, many of which are shared with other clustering problems. Given the large volumes and algorithm complexity a real-world clustering task is quite likely to want to make use of parallel processing and this is what the Carrot and Apache Mahout projects provide by building on top of Apache Hadoop.
- Another activity for sets of documents is classification which is similar to clustering but starts with sets of documents that have been assigned to a pre-determined category by a human or other mechanism, for example asking users to tag articles. Example classification tasks are sentiment analysis or rating reviews as positive or negative. Of course there are a number of algorithms and implementations to choose from with each having trade-offs in accuracy and performance.