In a recent post we touched upon some of the work Orchestr8 is doing with statistical language processing.
Statistical language processing relies on the application of machine-learning algorithms and data-mining techniques to large volumes of training materials (a “corpus“), in order to build probabilistic models for processing of natural language text.
We use these techniques to perform tasks including text categorization, topic/keyword extraction, language identification, sentiment analysis, named entity recognition, and so on.
The key to these methods is having huge amounts of training data; the more data, the better.
The Internet provides easy access to vast quantities of data: news-wire articles, blog posts, SEC company releases, etc.
However, this data isn’t labeled. It’s just raw data. Unsupervised learning techniques (which rely on unlabeled data) aren’t as robust for what we’re doing, compared to supervised methods. Therefore, we needed a mechanism for turning raw information into properly-labeled training data.
To do this, we’ve constructed a “Corpus Generation / Model Re-training” pipeline. Our process uses a combination of automated tagging / labeling techniques and teams of human annotators, to transform raw data (news, blogs, etc.) into “gold-standard” corpus text.
Here’s a simplified process-flow diagram that illustrates our pipeline:

Here’s what we’re doing, in more detail:
1. Automated web spiders (using AlchemyGrid’s content scraping facilities) constantly monitor various Internet content sources (news websites, blogs, financial websites, etc.), grabbing new articles and posts. These are stored in a database for subsequent processing.
2. New articles and posts are automatically labeled and tagged using our most-current statistical language models. This automatic labeling does the majority of the “work” associated with generating new “gold-standard” corpus data. Humans need only perform minor corrections to the output of this automated system.
3. The automatically-labeled texts are then sent to a randomly selected group of human annotators (between 2 and N individuals). These annotators manually re-check the previously tagged text, correcting any mistakes and adding missing labels.
4. The human-verified texts are then compared, to determine the level of “disagreement” between the human annotators. If the level of disagreement is above our pre-defined threshold, the document is re-assigned back to a *new* randomly selected group of human annotators. Otherwise, the annotations are merged into a “gold-standard” document and recorded into our database of annotated texts.
5. On a periodic basis, all “gold-standard” corpus data is utilized to re-train our statistical language models. This iterative, boot-strapping approach enables our system to steadily improve as more training data is made available.
This process enables us to generate significant quantities of “gold-standard” corpus data: hundreds of thousands of words each day. The system is also linearly scalable; additional human annotators can be added to the pipeline to increase corpus generation speeds to millions of words a day or more.
We’ve built a web-based annotation interface for our team of human annotators, and a work-flow management system that enables us to track corpus generation, monitor human annotator speeds and accuracy, and so on. In an upcoming post, we will show a video of this system; it’s pretty neat.
So what’s the point of all this? These language models are the core of AlchemyGrid’s relevancy capabilities; they enable our system to automatically determine the “context” of any text, extract meaning (relevant people, companies, places, things), determine sentiment or subjectivity, and more. Those of you using AlchemyGrid’s contextual widgets are already taking advantage of this functionality, as are those using our contextual analysis APIs. We have some really exciting stuff in the works that further builds upon this relevancy capability, so stay tuned!