Content Analysis and Text Mining

WordStat Features


  • Content analysis on short alphanumeric variable (up to 255 characters) and longer ANSI or RTF document (several mb).
  • Dictionary moderated lemmatization and stemming (English, French, Italian and Spanish; contact us for other languages).
  • Ability to call external text pre-processing EXE or DLL (sample English porter stemmer and n-grams transformation are include)
  • Optional exclusion of pronouns, conjunctions, etc, by the use of user-defined exclusion lists (or stop list).
  • Categorization of words or phrases using existing or user-defined dictionaries.
  • Word categorization based on Boolean (AND, OR, NOT) and proximity rules (NEAR, AFTER, BEFORE)
  • Word and phrase substitution and scoring using wildcards and weighting.
  • Frequency analysis on keywords, phrases, derived categories or concepts, or user-defined codes entered manually within a text.
  • Interactive development and easy maintenance of hierarchical dictionaries, taxonomies, or categorization schema.
  • Drag and drop editor for easy assignments of words, phrases into categories!
  • Ability to restrict the analysis to specific portions of a text or to exclude comments and annotations.
  • Ability to perform an analysis on a random sample of cases.
  • Integrated spell-checking with support for different languages such as English, French, Spanish, etc.
  • Integrated thesaurus (English only) to assist the creation of taxonomies and comprehensive categorization schemas.
  • Powerful case filtering on any numeric or alphanumeric field and on code occurrence (with AND, OR, and NOT boolean operators)
  • Prints presentation quality tables
  • Imports MS Word, WordPerfect, RTF and HTML.
  • Exports any table to Excel, ASCII, Tab separated or comma separated value files, or HTML files.
  • Flexible keyword highlighting (the text editor can display all categories using different colors).

Screen shot


  • Univariate word frequency analysis (word or category count and record occurrence).
  • Word x word co-occurrence matrix.
  • Word x case data matrix.
  • Integrated multidimensional scaling with 2D and 3D maps.
  • Proximity plot.

Screen shot  Screen shot   Screen shot  

 Screen shot   Screen shot  Screen shot



  • Topic modeling tool automatically extract topics by applying factor analysis on word x segment matrices.
  • Vocabulary finder extracts technical terms, product and company names as well as common misspellings.
  • Pattern based named-entity extraction.
  • Phrase finder allows one to easily identify recurring phrases and expressions

Screen shot


  • Ability to create norm files based on frequency analysis of words or content categories.
  • Comparison of obtained frequencies to previously saved norm files.


  • A powerful keyword retrieval function allows identification of text units (documents, paragraph or sentences) containing one keyword or a combination of keywords with optional filtering of cases.
  • Ability to attach QDA Miner codes to retrieved segments.
  • Retrieved segments may be exported to disk in tabular format (Excel or delimited text files) or as text reports (Rich Text Format).

Screen shot    Screen shot    Screen shot


  • Integrated clustering and dendrogram display of keyword co-occurrence.
  • First- and second-order proximity analysis.
  • Proximity plot to easily identify all keywords that co-occurs with a target keyword.
  • 2D and 3D multidimensional scaling on either joint frequency or co-occurrence of words or categories.
  • Flexible keyword co-occurrence criteria (within a case, a sentence, a paragraph, a window of n words, a user-defined segment) as well as clustering methods (first- and second-order proximity, choice of similarity measures).
  • Easy text retrieval from dendrogram or proximity plots.


  • Hierarchical clustering, multidimensional scaling and proximity plot may be used to explore the similarity between documents or cases.


  • Can perform univariate frequency analysis and crosstabulation on information stored in several alphanumeric fields (memo or string variables).
  • Comparison of keyword occurrence between different fields.
  • Computes inter-raters agreement measures (pct. of agreement, Cohen's Kappa, Scott's Pi, Krippendorff's R and r-bar, free marginal) based on codes manually entered in different variables.


  • Bivariate comparison between any textual field and any nominal or ordinal variable (such as the sex of the respondent, specific subgroups, years of publication, etc.).
  • Choice between 11 different association measures to assess the relationship between word occurrence and nominal or ordinal variables (Chi-square, Likelihood ratio, Tau-a, Tau-b, Tau-c, symmetric Somers' D, asymmetric Somers' Dxy and Dyx, Gamma, Person's R, Spearman's Rho)
  • Computation statistics on either absolute or relative frequency
  • Ability to sort matrix in alphabetic order of words, by word frequency or word occurrence, on the obtained statistics or on its probability.
  • Visually compare items between subgroups using bar charts and line charts.

Screen shot    Screen shot   Screen shot Screen shot     

  • Correspondence analysis (statistics, 2D & 3D joint plots). This feature is accessible from the crosstab page and allows one to see graphically the relationship between nominal variables and codes resulting from a content analysis.
  • Heatmap plot (with dual-clustering of keywords and variables)

Screen shot     Screen shot   Screen shot


  • Machine learning algorithms (Naive Bayes and K-Nearest Neighbors) for document classification.
  • Flexible feature selection for automatic selection of best subsets of attributes.
  • Numerous validation methods (leave-but-one, n-fold crossvalidation, split sample).
  • Experimentation module allows easy comparison of predictive models and fine-tuning of classification models.
  • Classification models may be saved to disk and applied later using either a standalone document classification utility program, a command line program or a programming library . Note: The command line and the programming library are part of WordStat Software Developer's kit (SDK) which is sold separately.

Screen shot  Screen shot Screen shot Screen shot


  • Ability to display a KWIC table to examine the textual context of a word, word pattern, or category.
  • Ability to sort the table on any independent (numeric) variables.
  • Ability to jump from a KWIC keyword to the textual variable in order to view or edit the original text.
  • KWIC list can be saved in data files for further processing.
  • Customizable KWIC display (paragraph, sentence or user defined segment).
  • Concordance report (displays all hits as a list of paragraphs, sentences or user defined segments)

Screen shot


  • Alphanumeric variables can be stored in the same file as all other numeric variables.
  • Variable selection, statistical analysis and content analysis are performed within the same application program.
  • Matrix outputs are automatically added to existing statistical outputs.
  • New variables representing occurrence of words, keywords or concepts can be added to the existing data file or exported to a new data file in order to be submitted to further statistical analysis (such as cluster analysis on words or cases, principal coordinate analysis, correspondence analysis, multiple regression, etc.).
  • Data can be imported from and exported to different file format including dBase, Paradox, Excel, Quattro Pro, Lotus 1-2-3, SPSS for DOS, SPSS for Windows, comma or tab separated text files, etc.
  • Ability to perform numeric and alphanumeric transformation or to apply filters on records of the data file to restrict the analysis to specific subgroups. .


  • Dictionary building assistant to find related words (synonyms, antonyms, holonyms, meronyms, hypernyms, hyponyms) in aWordNet based thesaurus (English only). (100,000 synonyms, 120,000 root words)

Screen shot Screen shot Screen shot Screen shot

  • WS Document Classifier, a small standalone application to apply previously saved categorization and classification models to external documents.
  • WSTOOLS - Utility program to easily import documents of any size into Simstat database files.
    • Various file formats may be directly imported such as:
      • Plain text (with optional DOS ASCII to Windows ANSI conversion)
      • HTML (with or without removal of HTML tags)
      • RTF
      • MS Word
      • WordPerfect
      • Adobe PDF
    • Optional removal of leading and trailing spaced and hard returns.
    • Extraction of numeric and alphanumeric variables from documents.
    • Extraction options may be saved on disk and later retrieved.
    • Documents may be stored as plain ANSI text or as RTF documents.

About KCS

Kovach Computing Services (KCS) was founded in 1993 by Dr. Warren Kovach. The company specializes in the development and marketing of inexpensive and easy-to-use statistical software for scientists, as well as in data analysis consulting.

Mailing list Join our mailing list

Home | Order | MVSP | Oriana | QDA Miner
Stats Books | Stats Links | Anglesey


Like us on Facebook Facebook

Get in Touch

  • Email:
  • Address:
    85 Nant y Felin
    Pentraeth, Isle of Anglesey
    LL75 8UY
    United Kingdom