macros - Blank results while using Tokens Regex rules to identify Named Entities -
i struggling writing correct rule involves macros identify organizations in text.
to identify matrix inc. in:
with it's rising share prices matrix inc. has come out winner quarter.
i trying check words inc within entity , defined macros , rule below:
$organization_titles = "/pharmaceuticals?|group|corp|corporation|international|co.?|inc.?|incorporated|holdings|motors|ventures|parters|llc|limited liability corporation|pvt.? ltd.?/" env.defaults["stage"] = 1 { ruletype: "tokens", pattern: ([$organization_titles]), action: ( annotate($0, ner, "organization") ) } env.defaults["stage"] = 2 { ( [{tag:nnp}]+? ($organization_titles)) => organization }
i tried using bindings , applying rule.
env.bind("$organization_titles", tokensequencepattern.compile(env,"/pharmaceuticals?|group|corp|corporation|international|co.?|inc.?|incorporated|holdings|motors|ventures|parters|llc|limited liability corporation|pvt.? ltd.?/"));
nothing seems working. need define more complex pattern rules involving macros like:
pattern: ( [ { ner:person } ]+ /,/*? ($titles_corporate_prefixes)*? $titles_corporate+? /,/*? /of|for/? /,/*? [ { ner:organization } ]+ )
where $titles_corporate_prefixes , $titles_corporate macros similar $organization_titles.
what doing wrong?
edit
here's code:
public static void main(string[] args) { string rulesfile = "d:\\workspace\\resource\\nerrulesfile.txt"; string datafile = "d:\\workspace\\resource\\goldsetsentences.txt"; properties props = new properties(); props.put("annotators", "tokenize, ssplit, pos, lemma"); stanfordcorenlp pipeline = new stanfordcorenlp(props); // pipeline.addannotator(new tokensregexannotator(rulesfile)); string inputtext = "bill edelman , ceo , chairman , paragonix commented on supply agreement essential pharmaceuticals ."; annotation document = new annotation(inputtext.tolowercase()); pipeline.annotate(document); list<coremap> sentences = document.get(sentencesannotation.class); coremapexpressionextractor extractor = coremapexpressionextractor.createextractorfromfiles(tokensequencepattern.getnewenv(), rulesfile); /* next can go on annotated sentences , extract annotated words, using corelabel object */ (coremap sentence : sentences) { list<matchedexpression> matched = extractor.extractexpressions(sentence); for(matchedexpression phrase : matched){ // print out matched text , value system.out.println("matched: " + phrase.gettext() + " value " + phrase.getvalue()); // print out token information coremap cm = phrase.getannotation(); (corelabel token : cm.get(tokensannotation.class)) { string word = token.get(textannotation.class); string lemma = token.get(lemmaannotation.class); string pos = token.get(partofspeechannotation.class); string ne = token.get(namedentitytagannotation.class); system.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + "ne=" + ne); } } } }
here rules file should work:
ner = { type: "class", value: "edu.stanford.nlp.ling.coreannotations$namedentitytagannotation" } $organization_titles = "/inc\.|corp\./" { pattern: ([{pos: nnp}]+ $organization_titles), action: ( annotate($0, ner, "rule_found_org") ) }
i have made changes our code base make tokensregexannotator more accessible. need latest version github: https://github.com/stanfordnlp/corenlp
java -xmx8g edu.stanford.nlp.pipeline.stanfordcorenlp -annotators tokenize,ssplit,pos,lemma,ner,tokensregex -tokensregex.rules organization.rules -file samples.txt -outputformat text -tokensregex.caseinsensitive
if run command or equivalent java api call should work:
Comments
Post a Comment