Extracting geographic entities with Conditional Random Fields
Geographic Information Retrieval systems rely on the identification of place names in documents to determine the region about which they are relevant. Extracting location names from text is a common Natural Language Processing task, a simple approach is to used manually coded rules supported with dictionaries of place names or gazetteers. Despite these methods achieving good results, the rules are usually too restrictive and very specific in regard to a type of text.
Another approach is to use machine learning, based on extracting features from texts where the geographic entities are annotated. Features can be surrounding words or properties of the word itself, like capitalization, or frequency of the word in corpus. A probabilistic model is then built based on these features to discriminate when a given word is or not a geographic entity.
Work done on training and using Conditional Random Fields for extracting geographic references from a web crawl of the Portuguese web will be presented, and also available resources for research, such as a geographic ontology of Portugal.