WordStat 5 New Features

WordStat 5.0 New Features

New pre-processing option allows one to create his own text pre-processing EXE or DLL (sample English porter stemmer and n-grams transformation are include).
A new lemmatization monitoring dialog allows reviewing substitutions, overriding existing ones by creating custom substitutions.
Disambiguation rules with Boolean (AND, OR, NOT) and proximity operators (NEAR, AFTER, BEFORE) may now be added to categorization dictionaries (click on thumbnail to see a screen shot).

"As shown" level setting allows setting the categorization level to the way the dictionary tree is displayed.
Ability to create unbreakable categories (overriding the level setting).
Categorization dictionaries may now be printed or exported to XML.
Ability to merge existing dictionaries.
Improved contextual menus for faster dictionary editing.

A new option allows one to include cases with missing values on independent variables (override existing listwise exclusion).
Feature to select a variable to weight cases.
New threshold to remove items occurring in more than a specified % of cases.

Ability to create files of keyword frequency norms and compare existing frequencies to previously saved norm files.
An entirely new keyword retrieval dialog allows one to extract documents, paragraphs or sentences with user defined combination of keywords. Retrieved text segments may optionally be tagged using QDA Miner codes (click on the thumbnails below to see screen shots).

The full categorization process may now be stored on disk and applied to documents using a standalone utility program (WS Document Classifier) or an optional DLL and command line program.
Optional colored grid lines.
Included items may be removed temporarily from further analysis.
Added TF*IDF column (term frequency x inverse document frequency).

Naive Bayes and k-nearest neighbors classification methods applied on occurrences, frequencies, percentage of words, etc.
Feature selection and feature weighting.
Crossvalidation methods (leave one out, n-folds, split sample).
Batch experiment module and history charts for model optimization.
Document classification on single texts, list of documents or database.
The classification model may be stored on disk and applied to external documents using a standalone utility program (WS Document Classifier) or an optional DLL and command line program,

Phrase finder page has been moved to the feature extraction page.
A new Unknown Words finder allows one to quickly identify misspelled words, acronyms, technical words, proper nouns and either replace, ignore or assign them to the categorization dictionary.

Added probabilistic versions of Jaccard and Sorensen (or Dice) coefficients.
Added second order clustering of keywords (based on the similarity of co-occurrence patterns rather than mere co-occurrences).
Ability to select a single cluster and retrieve associated documents.
New option to hide single item clusters in dendrograms and multidimensional scaling plots.

New option to retrieve documents or text segments containing two specific keywords.

The document conversion wizard can now extract text from PDF files.
Categorization models and classification rules may be saved on disk.
"Anchor to floor" lines on 3D charts (MDS and correspondence plot).
WS Document Classifier, a small standalone application to apply previously saved categorization and classification models to external documents.
Separately sold DLL and command line versions of WordStat for standalone content analysis and automatic classification of documents (not available yet).
Major speed improvements. The table below provides some speed comparisons between v4 and v5. This test was performed on a 1.2Ghz Pentium 3 computer.

TASK	VERSION 4.0	VERSION 5.0*	SPEED IMPROVEMENT
Word frequency of 11,314 newsgroup messages (3,249,029 words)	5m 52s	2m 59s	x2.0
- with lemmatization & stop list	6m 45s	3m 11s	x2.1
- categorized using Regressive Imagery Dictionary (RID)	10m 4s	2m 24s	x4.2
- categorized using Linguistic Inquiry and Word Count (LIWC)	10m 52s	2m 52s	x3.8