Statistical Language Processing & Cloud Computing
Posted by: eturner on September 11th, 2008
During last week’s Boulder NewTech demo, one of the things we previewed was some of the work Orchestr8 has been doing with statistical language processing. This is something we’ve been working on for a while, and it was exciting to show publicly for the first time.
Statistical language models are a powerful tool for tasks such as text categorization, topic/keyword extraction, language identification, sentiment analysis, and so on. We’re using statistical techniques to perform these tasks and others.
Through the application of machine-learning algorithms and data-mining techniques to large volumes of training materials (a “corpus“), we’re building probabilistic models for processing of natural language text.
The Internet provides a wealth of training materials for language modeling, in the form of corpora, n-gram rankings, and so on. But statistical language modeling requires more than just algorithms and training data — it requires CPU (and lots of it!)
Training on big corpora (millions of sentences or larger) takes a long time, even on today’s CPUs. We’re talking about months or more of compute time. During early development, our team trained models on their own PCs — but this quickly became unworkable. We needed more CPU power, and quick.
Thank goodness for Amazon EC-2.
Cloud computing definitely ended up being the answer for us. Amazon’s elastic compute facility enables us to dynamically provision huge amounts of CPU resources (hundreds of systems or more) within a few minutes. It’s completely changed how we do corpora training.
EC-2 has truly been a blessing. Now instead of simply training a single language model, we can provision clusters of machines, each competitively training different models. By pitting different language models against one another, the best ones quickly rise to the top.
Using Amazon’s elastic compute service also enables us to compute models that are simply infeasible to calculate on a single system (even a large multi-processor, multi-core machine), due to RAM and other limitations.
EC-2 isn’t magic, however. It won’t instantly make your algorithms go faster. To take advantage of Amazon’s platform, we first needed to refactor our machine learning architecture to operate in a distributed fashion. A distributed approach enables an almost linear speed-up in training as additional CPU cores are added to the cluster.
Re-coding some of our algorithms to be distributed has been challenging, but lots of fun. The result is a scalable machine learning facility that enables processing of absolutely huge corpora. It also enables extremely fast processing of text (millisecond time scales).
These capabilities, integrated into our Grid platform, are providing some truly exciting advances in relevancy and contextualization (far beyond simple keyword/topic extraction). We’ll post more on that later. ![]()
