WordStat 5.0 New Features


  • New pre-processing option allows one to create his own text pre-processing EXE or DLL (sample English porter stemmer and n-grams transformation are include).
  • A new lemmatization monitoring dialog allows reviewing substitutions, overriding existing ones by creating custom substitutions.
  • Disambiguation rules with Boolean (AND, OR, NOT) and proximity operators (NEAR, AFTER, BEFORE) may now be added to categorization dictionaries (click on thumbnail to see a screen shot).

  • "As shown" level setting allows setting the categorization level to the way the dictionary tree is displayed.
  • Ability to create unbreakable categories (overriding the level setting).
  • Categorization dictionaries may now be printed or exported to XML.
  • Ability to merge existing dictionaries.
  • Improved contextual menus for faster dictionary editing.


  • A new option allows one to include cases with missing values on independent variables (override existing listwise exclusion).
  • Feature to select a variable to weight cases.
  • New threshold to remove items occurring in more than a specified % of cases.


  • Ability to create files of keyword frequency norms and compare existing frequencies to previously saved norm files.
  • An entirely new keyword retrieval dialog allows one to extract documents, paragraphs or sentences with user defined combination of keywords. Retrieved text segments may optionally be tagged using QDA Miner codes (click on the thumbnails below to see screen shots).


  • The full categorization process may now be stored on disk and applied to documents using a standalone utility program (WS Document Classifier) or an optional DLL and command line program.
  • Optional colored grid lines.
  • Included items may be removed temporarily from further analysis.
  • Added TF*IDF column (term frequency x inverse document frequency).


  • Naive Bayes and k-nearest neighbors classification methods applied on occurrences, frequencies, percentage of words, etc.
  • Feature selection and feature weighting.
  • Crossvalidation methods (leave one out, n-folds, split sample).
  • Batch experiment module and history charts for model optimization.
  • Document classification on single texts, list of documents or database.
  • The classification model may be stored on disk and applied to external documents using a standalone utility program (WS Document Classifier) or an optional DLL and command line program,


  • Optional colored grid lines.
  • Included items may be removed temporarily from further analysis.


  • The KWIC page may now be detached and displayed as a stay-on-top dialog.


  • Phrase finder page has been moved to the feature extraction page.
  • A new Unknown Words finder allows one to quickly identify misspelled words, acronyms, technical words, proper nouns and either replace, ignore or assign them to the categorization dictionary.


  • Added probabilistic versions of Jaccard and Sorensen (or Dice) coefficients.
  • Added second order clustering of keywords (based on the similarity of co-occurrence patterns rather than mere co-occurrences).
  • Ability to select a single cluster and retrieve associated documents.
  • New option to hide single item clusters in dendrograms and multidimensional scaling plots.


  • New option to retrieve documents or text segments containing two specific keywords.


  • The document conversion wizard can now extract text from PDF files.
  • Categorization models and classification rules may be saved on disk.
  • "Anchor to floor" lines on 3D charts (MDS and correspondence plot).
  • WS Document Classifier, a small standalone application to apply previously saved categorization and classification models to external documents.
  • Separately sold DLL and command line versions of WordStat for standalone content analysis and automatic classification of documents (not available yet).
  • Major speed improvements. The table below provides some speed comparisons between v4 and v5. This test was performed on a 1.2Ghz Pentium 3 computer.
Word frequency of 11,314 newsgroup messages (3,249,029 words) 5m 52s 2m 59s x2.0
- with lemmatization & stop list 6m 45s 3m 11s x2.1
- categorized using Regressive Imagery Dictionary (RID) 10m 4s 2m 24s x4.2
- categorized using Linguistic Inquiry and Word Count (LIWC) 10m 52s 2m 52s x3.8

* Speed improvements may differ on other computers and in the final v5.0 release.

New features in version 4 can be viewed here.

About KCS

Kovach Computing Services (KCS) was founded in 1993 by Dr. Warren Kovach. The company specializes in the development and marketing of inexpensive and easy-to-use statistical software for scientists, as well as in data analysis consulting.

Mailing list Join our mailing list

Home | Order | MVSP | Oriana | XLStat
QDA Miner | Accent Composer | Stats Books
Stats Links | Anglesey


Like us on Facebook Facebook

Get in Touch

  • Email:
  • Address:
    85 Nant y Felin
    Pentraeth, Isle of Anglesey
    LL75 8UY
    United Kingdom
  • Phone:
    (UK): 01248-450414
    (Intl.): +44-1248-450414