Problem: In Text Mining extracting keywords (n-grams) alone cannot produce meaningful data nor discover “unknown” themes and trends.
Objective: The aim here is to extract dependency relation from sentence i.e., extract sets of the form {subject, predicate[modifiers], object} out of syntactically parsed sentences, using Stanford parser and opennlp.
Steps:
1) Get the syntactic relationship between each pair of words
2) Apply sentence segmentation to determine the sentence boudaries
3) The Stanford Parser is then applied to generate output in the form of dependency relations, which represent the syntactic relationships within each sentence
How this is different from n-gram?
Dependency relation allows the similarity comparison to be based on the syntactic relations between words, instead of having to match words in their exact order in n-gram based comparisons.
Example:
Sentence: “The flat tire was not changed by driver”
Stanford dependency relations:
root(ROOT-0, changed-6)
det(tire-3, the-1)
amod(tire-3, flat-2)
nsubjpass(changed-6, tire-3)
auxpass(changed-6, was-4)
neg(changed-6, not-5)
prep(changed-6, by-7)
pobj(by-7, driver-8)
Refer Stanford typed dependencies manual for full list & more info: http://nlp.stanford.edu/software/dependencies_manual.pdf
Triples output in the form (Subject : Predicate [modifier] : Object) :
driver : changed [not] : tire
Extraction Logic: You can use the below base logic to build the functionality in your favorite/comfortable language (R/Python/Java/etc). Please note that this is only the base logic and needs enhancement.
Challenges:
- Sentence level is too structured
- Usage of abbreviations and grammatical errors in sentence will mislead the analysis
Hope this article is useful!