org.gbif.ecat.parser
Class NameParser

java.lang.Object
  extended by org.gbif.ecat.parser.NameParser

public class NameParser
extends Object

Author:
markus

Field Summary
protected static String all_letters_numbers
           
protected static String AUTHOR
           
protected static String author_letters
           
protected static String AUTHOR_LETTERS
           
protected static String AUTHOR_PREFIXES
           
protected static String AUTHOR_TEAM
           
protected static Pattern AUTHOR_TEAM_PATTERN
           
static Pattern CANON_NAME_IGNORE_AUTHORS
           
protected static Pattern CULTIVAR
           
 boolean debug
           
protected static String EPHITHET
           
protected static String EPHITHET_PREFIXES
           
protected static Pattern EXTRACT_NOMSTATUS
           
static Pattern HYBRID_FORMULA_PATTERN
           
static String HYBRID_MARKER
           
protected static String INFRAGENERIC
           
static Pattern IS_VIRUS_PATTERN
           
protected static Logger log
           
protected static String MONOMIAL
           
protected static String name_letters
           
protected static String NAME_LETTERS
           
static Pattern NAME_PATTERN
           
protected static String RANK_MARKER_SPECIES
           
protected static String YEAR
           
 
Constructor Summary
NameParser()
           
 
Method Summary
 void addMonomials(Set<String> monomials)
           
protected static String cleanStrong(String name)
          A very optimistic cleaning intended for names potentially very very dirty
 Set<String> getMonomials()
           
static void main(String[] args)
           
static String normalize(String name)
          Carefully normalizes a scientific name trying to maintain the original as close as possible.
protected static String normalizeStrong(String name)
          Does the same as a normalize and additionally removes all ( ) and "und" etc Checks if a name starts with a blacklisted name part like "Undetermined" or "Uncertain" and only returns the blacklisted word in that case so its easy to catch names with blacklisted name parts.
<T> ParsedName<T>
parse(String scientificName)
          Fully parse the supplied name also trying to extract authorships.
 String parseToCanonical(String scientificName)
          parses the name without authorship and returns the ParsedName.canonicalName() string
protected static String preClean(String name)
          basic careful cleaning, trying to preserve all parsable name parts
 void readMonomialsRsGbifOrg()
          Read generic and suprageneric names from rs.gbif.org dictionaries and feed them into nameparser for monomial references.
 void setMonomials(Set<String> monomials)
          Provide a set of case insensitive words that indicate a true monomial to detect a taxonomic subrank instead of an author.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

log

protected static Logger log

debug

public boolean debug

NAME_LETTERS

protected static final String NAME_LETTERS
See Also:
Constant Field Values

name_letters

protected static final String name_letters
See Also:
Constant Field Values

AUTHOR_LETTERS

protected static final String AUTHOR_LETTERS
See Also:
Constant Field Values

author_letters

protected static final String author_letters
See Also:
Constant Field Values

all_letters_numbers

protected static final String all_letters_numbers
See Also:
Constant Field Values

AUTHOR_PREFIXES

protected static final String AUTHOR_PREFIXES
See Also:
Constant Field Values

AUTHOR

protected static final String AUTHOR
See Also:
Constant Field Values

AUTHOR_TEAM

protected static final String AUTHOR_TEAM
See Also:
Constant Field Values

AUTHOR_TEAM_PATTERN

protected static final Pattern AUTHOR_TEAM_PATTERN

YEAR

protected static final String YEAR
See Also:
Constant Field Values

RANK_MARKER_SPECIES

protected static final String RANK_MARKER_SPECIES

EPHITHET_PREFIXES

protected static final String EPHITHET_PREFIXES
See Also:
Constant Field Values

EPHITHET

protected static final String EPHITHET
See Also:
Constant Field Values

MONOMIAL

protected static final String MONOMIAL
See Also:
Constant Field Values

INFRAGENERIC

protected static final String INFRAGENERIC

HYBRID_MARKER

public static final String HYBRID_MARKER
See Also:
Constant Field Values

HYBRID_FORMULA_PATTERN

public static final Pattern HYBRID_FORMULA_PATTERN

CULTIVAR

protected static final Pattern CULTIVAR

IS_VIRUS_PATTERN

public static final Pattern IS_VIRUS_PATTERN

EXTRACT_NOMSTATUS

protected static final Pattern EXTRACT_NOMSTATUS

CANON_NAME_IGNORE_AUTHORS

public static final Pattern CANON_NAME_IGNORE_AUTHORS

NAME_PATTERN

public static final Pattern NAME_PATTERN
Constructor Detail

NameParser

public NameParser()
Method Detail

cleanStrong

protected static String cleanStrong(String name)
A very optimistic cleaning intended for names potentially very very dirty

Parameters:
name - To normalize
Returns:
The normalized name

main

public static void main(String[] args)

normalize

public static String normalize(String name)
Carefully normalizes a scientific name trying to maintain the original as close as possible. In particular the string is normalized by: - adding commas in front of years - trims whitespace around hyphens - unescapes unicode chars \\uhhhh, \\nnn, \xhh - pads whitespace around & - adds whitespace after dots following a genus abbreviation or rank marker - keeps whitespace before opening and after closing brackets - removes whitespace inside brackets - removes whitespace before commas - normalized hybrid marker to be the ascii multiplication sign - removes whitespace between hybrid marker and following name part in case it is NOT a hybrid formula - trims the string and replaces multi whitespace with single space - capitalizes all only uppercase words (authors are often found in upper case only)

Parameters:
name - To normalize
Returns:
The normalized name

normalizeStrong

protected static String normalizeStrong(String name)
Does the same as a normalize and additionally removes all ( ) and "und" etc Checks if a name starts with a blacklisted name part like "Undetermined" or "Uncertain" and only returns the blacklisted word in that case so its easy to catch names with blacklisted name parts.

Parameters:
name - To normalize
Returns:
The normalized name

preClean

protected static String preClean(String name)
basic careful cleaning, trying to preserve all parsable name parts

Parameters:
name -
Returns:

addMonomials

public void addMonomials(Set<String> monomials)

getMonomials

public Set<String> getMonomials()

parse

public <T> ParsedName<T> parse(String scientificName)
                    throws UnparsableException
Fully parse the supplied name also trying to extract authorships. It will ignore a conceptual sec reference, remarks or notes on the nomenclatural status. Please use parseExtended(String) if you need these. In some cases the authorship parsing proves impossible and this parser will return null. If you do not need the authorship please use parseIgnoreAuthors(String) instead which is more robust. - canonical name - final authorship - sec/auct "concept" references - nomenclatural status remarks - other informal remarks

Parameters:
scientificName -
Returns:
Throws:
UnparsableException

parseToCanonical

public String parseToCanonical(String scientificName)
parses the name without authorship and returns the ParsedName.canonicalName() string

Parameters:
scientificName -
Returns:

readMonomialsRsGbifOrg

public void readMonomialsRsGbifOrg()
Read generic and suprageneric names from rs.gbif.org dictionaries and feed them into nameparser for monomial references. Used to better disambiguate subgenera/genera and authors


setMonomials

public void setMonomials(Set<String> monomials)
Provide a set of case insensitive words that indicate a true monomial to detect a taxonomic subrank instead of an author. For example in "Chordata Vertebrata"

Parameters:
monomials -


Copyright © 2012 Global Biodiversity Information Facility. All Rights Reserved.