macros - Blank results while using Tokens Regex rules to identify Named Entities -

September 15, 2015

i struggling writing correct rule involves macros identify organizations in text.

to identify matrix inc. in:

with it's rising share prices matrix inc. has come out winner quarter.

i trying check words inc within entity , defined macros , rule below:

$organization_titles = "/pharmaceuticals?|group|corp|corporation|international|co.?|inc.?|incorporated|holdings|motors|ventures|parters|llc|limited liability corporation|pvt.? ltd.?/"  env.defaults["stage"] = 1  {   ruletype: "tokens",   pattern: ([$organization_titles]),    action:  ( annotate($0, ner, "organization") ) }   env.defaults["stage"] = 2  { ( [{tag:nnp}]+? ($organization_titles)) => organization }

i tried using bindings , applying rule.

env.bind("$organization_titles", tokensequencepattern.compile(env,"/pharmaceuticals?|group|corp|corporation|international|co.?|inc.?|incorporated|holdings|motors|ventures|parters|llc|limited liability corporation|pvt.? ltd.?/"));

nothing seems working. need define more complex pattern rules involving macros like:

pattern:  ( [ { ner:person } ]+ /,/*? ($titles_corporate_prefixes)*? $titles_corporate+? /,/*? /of|for/? /,/*? [ { ner:organization } ]+ )

where $titles_corporate_prefixes , $titles_corporate macros similar $organization_titles.

what doing wrong?

edit

here's code:

public static void main(string[] args)     {         string  rulesfile = "d:\\workspace\\resource\\nerrulesfile.txt";         string datafile = "d:\\workspace\\resource\\goldsetsentences.txt";          properties props = new properties();         props.put("annotators", "tokenize, ssplit, pos, lemma");         stanfordcorenlp pipeline = new stanfordcorenlp(props);         // pipeline.addannotator(new tokensregexannotator(rulesfile));         string inputtext = "bill edelman , ceo , chairman , paragonix commented on supply agreement essential pharmaceuticals .";           annotation document = new annotation(inputtext.tolowercase());         pipeline.annotate(document);         list<coremap> sentences = document.get(sentencesannotation.class);         coremapexpressionextractor extractor = coremapexpressionextractor.createextractorfromfiles(tokensequencepattern.getnewenv(), rulesfile);         /* next can go on annotated sentences , extract annotated words,          using corelabel object */         (coremap sentence : sentences)         {              list<matchedexpression> matched = extractor.extractexpressions(sentence);              for(matchedexpression phrase : matched){                  // print out matched text , value                 system.out.println("matched: " + phrase.gettext() + " value " + phrase.getvalue());                 // print out token information                 coremap cm = phrase.getannotation();                 (corelabel token : cm.get(tokensannotation.class))                 {                      string word = token.get(textannotation.class);                     string lemma = token.get(lemmaannotation.class);                     string pos = token.get(partofspeechannotation.class);                     string ne = token.get(namedentitytagannotation.class);                     system.out.println("matched token: " + "word="+word + ", lemma="+lemma + ", pos=" + pos + "ne=" + ne);                 }             }         }      }

here rules file should work:

ner = { type: "class", value: "edu.stanford.nlp.ling.coreannotations$namedentitytagannotation" }  $organization_titles = "/inc\.|corp\./"  { pattern: ([{pos: nnp}]+ $organization_titles), action: ( annotate($0, ner, "rule_found_org") ) }

i have made changes our code base make tokensregexannotator more accessible. need latest version github: https://github.com/stanfordnlp/corenlp

java -xmx8g edu.stanford.nlp.pipeline.stanfordcorenlp -annotators tokenize,ssplit,pos,lemma,ner,tokensregex -tokensregex.rules organization.rules -file samples.txt -outputformat text -tokensregex.caseinsensitive

if run command or equivalent java api call should work:

Search This Blog

MOno

macros - Blank results while using Tokens Regex rules to identify Named Entities -

Comments

Post a Comment

Popular posts from this blog

'hasOwnProperty' in javascript -

python - ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'> -

java - How to implement an entity bound odata action in olingo v4.3 -