WSD Progress

  • (11/19) SemCor: WN tagged WSD test set
    • (11/19) File parser (semcor.php)
      • (11/21) Sense keys have embedded WordNet POS as first number (1:NOUN:n, 2:VERB:v, 3:ADJECTIVE:a, 4:ADVERB:r, 5:ADJECTIVE SATELLITE:s)
      • (11/21) Some words have multiple possible senses (marked as #1;#2 in number/sense)
    • (11/21) Mass parser (semcor-tags.php)
      • (11/21) Brown1: 11,058 sentences; 102,901 verifiable words
      • (11/21) Brown2: 8,847 sentences; 82,368 verifiable words
    • (11/22) Query extractor (wn-semcor.php)
      • (11/22) Takes output from the parser (semcor.php/semcor-tags.php) and references a SQLite WordNet database (wnlexical.php)
      • (11/22) Validates all queries; simplifies down to word, part-of-speech, options (synsets with tag-count, gloss), "correct" options (synsets)
      • (11/24) Produces MySQL/SQLite statements to facilitate data analysis
    • (11/24) Ambiguity analysis (report_wsd1.php)
    • (11/25) Part-of-Speech analysis (report_wsd0.php)
    • (11/25) Frequency analysis (report_wsd2.php)
  • (11/21) Extended WN-LEXICAL conversion script to support SQLite and MySQL (wnlexical.php)
    • (11/21) WN-LEXICAL is dirty!
      • (11/21) Duplicate sense chunks (5580 ~ 2.6%: only difference is chunk id):
        SELECT synset_id, w_num, word, ss_type, sense_number, COUNT(*) AS ct 
        FROM wn_chunk_s
        GROUP BY synset_id, w_num, word, ss_type, sense_number
        HAVING ct>1
        ORDER BY ct DESC, word ASC
      • (11/21) Double single quotes (now fixed in conversion script)
      • (11/21) Sense chunks that differ only in w-num and case of word (ex: "sun" vs. "Sun")
      • (11/22) Word case in sense chunks is unpredictable (added separate word-lower attribute to assist searching)
    • (11/25) Sense analysis (report_wn1.php)
  • (11/25) Porter Stemmer: to use for Lesk implementation
    • (11/25) PHP implementation passes test case