Posts filed under 'Beta'


Named Entity Disambiguation

Posted by: eturner on March 17th, 2009

We’re back with another big update to our AlchemyAPI content analysis / text mining service!

What’s new in this release?  Named Entity Disambiguation

Human language is not exact. Text referring to the city “Roanoke” can mean “Roanoke, Virginia” or “Roanoke, Texas“, depending on the surrounding context. Organizations and companies often have multiple nicknames, name variations, or common misspellings. Famous persons (”Michael Jackson”) often share a name with many non-famous individuals.

Named Entity Disambiguation works to solve these and other text ambiguity problems.

So how does it work?

Our disambiguation engine employs tens of millions of contextual hints describing traits of the world’s objects, individuals, and locations. We employ a variety of public and non-public data-sets.

Hints vary depending on the specific type of entity being disambiguated. For example, when disambiguating people, we utilize information on a person’s career, where they’re located, who they work for, and so on. For companies: key executives, notable products, industry, location, etc.

Whenever an entity is successfully disambiguated, additional information is returned in API responses. This includes the fully resolved, disambiguated entity name, and if available, the entity’s website and geographic coordinates.

AlchemyAPI’s Named Entity Disambiguation system resolves approximately two-dozen entity types, more than any other commercially-available text mining system!

Disambiguation functionality is available to all API preview / beta users.  If you do not currently have an API access key, please apply for one.

Also new in this release:

  1. Source text can now optionally be returned in all named entity and keyword extraction API call results.
  2. Updates to online API documentation.
  3. New developer SDKs for Ruby, C, and C++.

Add comment

New SDKs, Website Updates

Posted by: eturner on February 17th, 2009

AlchemyAPI users, rejoice!  We’ve just released SDKs for over a half-dozen programming languages (Java, .NET, Python, Perl, PHP, etc!) that enable easy integration of AlchemyAPI into your development project.  For those of you not preferring to use a SDK, our entire API is (and will always be) available as an Internet-accessible REST web service.

In other news: We’ve made some significant updates to the Orchestr8 website, in preparation for the AlchemyAPI general availability release.

We’ll be announcing something pretty exciting in the next week or two (a definite commercial “first” in the NLP / text mining world).  Stay tuned :)

1 comment

Content Analysis API Updates

Posted by: eturner on January 22nd, 2009

We’ve just deployed a significant new update to the Orchestr8 Content Analysis API!

This release contains a number of exciting enhancements, including:

  1. JSON output support (easily integrate into any Javascript web application).
  2. Document / Content uploading support (analyze private / non-web-accessible content).
  3. Updated API Documentation.
  4. Microformats output support (automatically generate rel-tag Microformat content).
  5. Enhanced named entity detection accuracy (improved detection of sports teams, radio stations, and more).

Beta users can take advantage of the new API release effective immediately.  If you are not currently a beta user and would like to apply for an API key, please contact Orchestr8 Support.

Add comment

New Year & New Release

Posted by: eturner on January 12th, 2009

Happy 2009 from Orchestr8! We’re kicking off this new year with an exciting update to AlchemyGrid’s text analysis / entity extraction system.

We’ve blogged about text analysis and NLP (natural language processing) in the past. These capabilities are utilized within our contextual widget platform, and by our Content Analysis API.

Orchestr8 has put significant effort into building a robust natural language parsing capability: automated language identification, topic classification, sentence splitting / parsing, part-of-speech tagging, chunking, and so on. Our “language stack” is entirely statistical in its basis, operating in a much similar fashion to speech recognition systems and Google’s automated translation service, Google Translate.

This new release represents a significant improvement in overall precision and recall, moving our system significantly closer to human-level tagging performance. What this means is, we’re now able to detect People, Companies, Locations, and other entity types even better than we were before. Existing API and widget users will see these improvements automatically; no changes to API calls are necessary.

We’re able to continually improve our language capabilities over time, due to the statistical nature of our approach and the fact that we’re steadily generating larger and larger sets of training materials. To create training materials for our system, we rely on a dedicated team of human annotators (something we’ve discussed in the past). Our annotation team generates hundreds of thousands of words of training materials each day, which are then used to teach our machine-learning algorithms, improving system accuracy.

We have some really exciting announcements planned for coming weeks / months related to the AlchemyGrid platform and its natural language processing capabilities. Stay tuned for more information.

Existing Users: Enjoy the new release, and feel free to contact us with any comments, questions, or feedback!

Everyone else: We’ll be opening up public access to the AlchemyGrid Contextual Analysis API very soon! If you can’t wait and would like to apply for early access, send us an email.

Add comment

Statistical Language Processing & Corpus Generation

Posted by: eturner on October 21st, 2008

In a recent post we touched upon some of the work Orchestr8 is doing with statistical language processing.

Statistical language processing relies on the application of machine-learning algorithms and data-mining techniques to large volumes of training materials (a “corpus“), in order to build probabilistic models for processing of natural language text.

We use these techniques to perform tasks including text categorization, topic/keyword extraction, language identification, sentiment analysis, named entity recognition, and so on.

The key to these methods is having huge amounts of training data; the more data, the better.

The Internet provides easy access to vast quantities of data: news-wire articles, blog posts, SEC company releases, etc.

However, this data isn’t labeled. It’s just raw data. Unsupervised learning techniques (which rely on unlabeled data) aren’t as robust for what we’re doing, compared to supervised methods. Therefore, we needed a mechanism for turning raw information into properly-labeled training data.

To do this, we’ve constructed a “Corpus Generation / Model Re-training” pipeline. Our process uses a combination of automated tagging / labeling techniques and teams of human annotators, to transform raw data (news, blogs, etc.) into “gold-standard” corpus text.

Here’s a simplified process-flow diagram that illustrates our pipeline:

Here’s what we’re doing, in more detail:

1. Automated web spiders (using AlchemyGrid’s content scraping facilities) constantly monitor various Internet content sources (news websites, blogs, financial websites, etc.), grabbing new articles and posts. These are stored in a database for subsequent processing.

2. New articles and posts are automatically labeled and tagged using our most-current statistical language models. This automatic labeling does the majority of the “work” associated with generating new “gold-standard” corpus data. Humans need only perform minor corrections to the output of this automated system.

3. The automatically-labeled texts are then sent to a randomly selected group of human annotators (between 2 and N individuals). These annotators manually re-check the previously tagged text, correcting any mistakes and adding missing labels.

4. The human-verified texts are then compared, to determine the level of “disagreement” between the human annotators. If the level of disagreement is above our pre-defined threshold, the document is re-assigned back to a *new* randomly selected group of human annotators. Otherwise, the annotations are merged into a “gold-standard” document and recorded into our database of annotated texts.

5. On a periodic basis, all “gold-standard” corpus data is utilized to re-train our statistical language models. This iterative, boot-strapping approach enables our system to steadily improve as more training data is made available.

This process enables us to generate significant quantities of “gold-standard” corpus data: hundreds of thousands of words each day. The system is also linearly scalable; additional human annotators can be added to the pipeline to increase corpus generation speeds to millions of words a day or more.

We’ve built a web-based annotation interface for our team of human annotators, and a work-flow management system that enables us to track corpus generation, monitor human annotator speeds and accuracy, and so on. In an upcoming post, we will show a video of this system; it’s pretty neat.

So what’s the point of all this? These language models are the core of AlchemyGrid’s relevancy capabilities; they enable our system to automatically determine the “context” of any text, extract meaning (relevant people, companies, places, things), determine sentiment or subjectivity, and more. Those of you using AlchemyGrid’s contextual widgets are already taking advantage of this functionality, as are those using our contextual analysis APIs. We have some really exciting stuff in the works that further builds upon this relevancy capability, so stay tuned!

1 comment

Boulder New Technology Demo

Posted by: eturner on September 3rd, 2008

Thanks to everyone who came out to see our Grid demo at the Boulder New Technology Meetup last night!

Demoing our new service was a lot of fun, and we got some really great feedback from members of the Colorado tech. community.

2 comments

Boulder New Tech. Meetup Demo - Sept 2

Posted by: eturner on August 15th, 2008

Planning on being in Colorado on September 2?  Come out and join us at the Boulder New Technology Meetup, where you can see a public demo of our soon-to-be-launched APGrid service!  Orchestr8 will be presenting at this Meetup, and will be around afterwards in the demo area to answer any specific questions.

Add comment

It’s Coming!

Posted by: eturner on July 18th, 2008

We’ve got something big coming, folks! Soon we’ll be taking the wraps off our new AlchemyGrid service release, containing over four hundred (400!) new features and site enhancements. You asked, and we listened!

Stay tuned for more information .. :)

Add comment