Posts filed under 'Data'


Statistical Language Processing & Cloud Computing

Posted by: eturner on September 11th, 2008

During last week’s Boulder NewTech demo, one of the things we previewed was some of the work Orchestr8 has been doing with statistical language processing. This is something we’ve been working on for a while, and it was exciting to show publicly for the first time.

Statistical language models are a powerful tool for tasks such as text categorization, topic/keyword extraction, language identification, sentiment analysis, and so on. We’re using statistical techniques to perform these tasks and others.

Through the application of machine-learning algorithms and data-mining techniques to large volumes of training materials (a “corpus“), we’re building probabilistic models for processing of natural language text.

The Internet provides a wealth of training materials for language modeling, in the form of corpora, n-gram rankings, and so on. But statistical language modeling requires more than just algorithms and training data — it requires CPU (and lots of it!)

Training on big corpora (millions of sentences or larger) takes a long time, even on today’s CPUs. We’re talking about months or more of compute time. During early development, our team trained models on their own PCs — but this quickly became unworkable. We needed more CPU power, and quick.

Thank goodness for Amazon EC-2.

Cloud computing definitely ended up being the answer for us. Amazon’s elastic compute facility enables us to dynamically provision huge amounts of CPU resources (hundreds of systems or more) within a few minutes. It’s completely changed how we do corpora training.

EC-2 has truly been a blessing. Now instead of simply training a single language model, we can provision clusters of machines, each competitively training different models. By pitting different language models against one another, the best ones quickly rise to the top.

Using Amazon’s elastic compute service also enables us to compute models that are simply infeasible to calculate on a single system (even a large multi-processor, multi-core machine), due to RAM and other limitations.

EC-2 isn’t magic, however. It won’t instantly make your algorithms go faster. To take advantage of Amazon’s platform, we first needed to refactor our machine learning architecture to operate in a distributed fashion. A distributed approach enables an almost linear speed-up in training as additional CPU cores are added to the cluster.

Re-coding some of our algorithms to be distributed has been challenging, but lots of fun. The result is a scalable machine learning facility that enables processing of absolutely huge corpora. It also enables extremely fast processing of text (millisecond time scales).

These capabilities, integrated into our Grid platform, are providing some truly exciting advances in relevancy and contextualization (far beyond simple keyword/topic extraction). We’ll post more on that later. :)

3 comments

Show Me the Data!

Posted by: Company on November 14th, 2006

One of the primary tasks of any mashup is getting at data. Data is the life-blood of a mashup; after all, without it there wouldn’t be anything to mash!

The problem is, data on the Internet is pretty much a mess. It’s in every conceivable format, spread across every conceivable location, with no built-in indexing, search, or publishing standards. Even when a data standard does exist, it’s often interpreted in wildly different fashions, or fragmented into multiple distinct “flavors” (think RSS).

The “getting at data” problem isn’t just about protocols or file formats either. Data comes in all forms; some of the most interesting types (to us, anyway!) are highly dynamic, temporal, even personal. This includes — but goes way beyond — click-stream information.

Accessing the world’s data in all its wide and varied forms is difficult. However, we feel that we’re up to the challenge. After many months of dealing with protocol specifications, data standards, and file formats, we have some technology that we’re pretty excited about. Think of AlchemyPoint as a large electronic sieve, with information of all sorts flowing through it. Inside, the magic happens: mashups poke, prod, analyze, and reformat.

Part of what excites us about AlchemyPoint isn’t simply that we’re opening the door to all sorts of various data types and sources, but that we’re trying to give users more meaningful ways to understand the data that’s out there.

By “meaningful,” we mean allowing users to operate more on a level that they’re used to. Despite the W3C’s great work on specifications, people just don’t think in XPath. We wanted to implement a way of accessing data in AlchemyPoint that even a child could understand.

Children see the world as it is. Technology lovers often get bogged down in details, formats, specs. We like to think of children as operating on the “presentation-level.”

If a child wants a blue ball, they’ll ask for the blue ball. They don’t create an XPath query. More interestingly, children can intuitively determine the distinct or important aspects of a scene. What if there are two blue balls? “I want the blue ball next to the red truck,” a child might say.

At its essence, these are constraints; constraints that specify aspects of the world around us. No specification (no matter how long) could begin to describe all the complexity of the real world. Why should we expect the digital world to be any different?

AlchemyPoint doesn’t. We’re working on a new way of accessing data, one that is independent of underlying file formats, specifications, or standards. We’re employing presentation-level constraints, positioning and other visual cues.

So go ahead! You want the blue ball? Just ask for it…

Add comment