cat:tecdoc_wc

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
cat:tecdoc_wc [2016/02/03 12:55]
manuel
cat:tecdoc_wc [2016/02/25 14:06] (current)
Line 1: Line 1:
-====== Word count analysis ====== +delete ​this page
- +
-Word count analysis depend on the definition of "​word",​ which for the purposes of word count analyses of source texts in Latin alphabet is considered to be a unit of text between word boundaries (e.g. spacing and punctuation symbols). However, different tools will use different parameters to count words, so it's not realistic to expect reports from different sources to be always identical. If you add tags (e.g. <b>) or escaped tags (e.g. &​lt;​b&​gt;​) to the equation, ​this gets more complex. For example: +
- +
-  * **memoQ** has a simple approach: it simply counts as words chunks of text between spaces (\b|\s|^|$) and does not consider escaped tags as unitary entities. +
-  * **OmegaT** counts escaped tags as unitary entities but for example considers the Saxon genitives as independent words. +
-  * **Rainbow** ignores escaped tags and considers Saxon genitives as a suffix, thus as part of the preceding word. +
- +
-Therefore, the following string: +
- +
-<​code>​wall’s&​lt;​br /&​gt;</​code>​ +
- +
-will yield totally different counts in the three tools considered above: +
- +
-      +
- +
-^ memoQ                ^ OmegaT ​              ^ Rainbow ​                  ^ +
-| 2 words              | 3 words              | 1 word                    | +
-| <​html><​span style="​border:​2px solid red;">​wall’s&​amp;​lt;​br</​span>​ <span style="​border:​2px solid purple;">/&​amp;​gt;</​span></​html>​ | <​html><​span style="​border:​2px solid red;">​wall</​span><​span style="​border:​2px solid blue;">​’s</​span><​span style="​border:​2px solid purple;">&​amp;​lt;​br /&​amp;​gt;</​span></​html>​ | <​html><​span style="​border:​2px solid red;">​wall’s</​span>&​amp;​lt;​br /&​amp;​gt;</​html>​ | +
- +
-Other characters might differently estimated too. The example above does not mean to be comprehensive. ​+
  • cat/tecdoc_wc.1454500540.txt.gz
  • Last modified: 2016/02/03 13:55
  • (external edit)