koch-method-real-words/README.md

# What is this?

Koch method with real words.  Prepare a list of words to learn Morse
code by the Koch method, using real words.

## TL;DR

If you just want to learn Morse code:

* Find a programm that can play text files as Morse code.

* Choose your language. We have [German](de_DE), [British
  English](en_GB) and [American English](en_US) on offer.

* Let your program play `lesson_01_0.txt` to you at speed 12 wpm = 60
  cpm.  There are several blanks after each word, so you'll get a
  small break after each word to comprehend it.  You may need to copy
  to paper at first, but try to copy in your head only.

* Do not set up your program to introduce additional gaps between
  letters of the same word, so as to slow down the speeds of the words
  below 60 cpm.

* Next, let your program play `lesson_01_1.txt`, then
  `lesson_01_2.txt` and so on.  Keep track how many of the words you
  can copy.  If you are down to five non-copies per lesson file, you
  are done with lesson 01 and can continue to lesson 02.

* The initial lesson 01 teaches you the three letters.  All other
  lessons introduce one new letter per lesson.  Which letters these
  are depends on the language.  You can read up `learning_order.json`
  to find out the details.

* Skip to the next lesson whenever you are down to five or fewer
  non-copies per lesson file.  No reason to listen to all files from 0
  to 9 just because they are there.

* It is recommended to practice for 30 minutes per day.

* For the practice frequency, it is recommended to practice anything
  between daily down to twice a week.  This depending on how much in a
  hurry you are to learn Morse code and how much time you're willing
  to spend per week.

* On each new learning session (day), start by listening again to one
  file (or more, as needed) of the lesson you stopped with at the end
  of the previous learning session.  In that sense, it is not quite
  "one lesson per day".  Proceed further when you have five non-copies
  in that file or less.

With this system, you can expect to learn receiving the 26 letters of
the alphabet as Morse code within some 25 to 30 half-hour learning
sessions.  As you have always practiced at 12 wpm or 60 cpm, that is
the speed you will have mastered.

This speed is already above the dreaded 50 cpm "plateau of unlearning
of thinking", so you'll never get stuck there.

## What is that "Koch method"?

Ludwig Koch was a German scientist who developed, in his research, a
method of teaching Morse code.  This was published in February 1936:

> Ludwig Koch (Braunschweig): "Arbeitspsychologische Untersuchung der
> Tätigkeit bei der Aufnahme von Morsezeichen, zugleich ein neues
> Anlernverfahren für Funker" (Dissertation der Technischen Hochschule
> Braunschweig), Zeitschrift für angewandte Psychologie und
> Charakterkunde, Band 50 Heft 1 u. 2, Februar 1936.

The main results of his research and key points of his method are
(crudely) summarized as follows (all speeds in characters per minute):

* Speeds below 50 invite people to think while listening to Morse
  code.

* As long as thinking is involved, speeds above about 50 cannot be
  reached.

* It is important to hear Morse code as characters "automatically"
  without thinking.

* Thinking doesn't work at speeds above 50.  So all Morse code
  training should be at speeds above that, from the very beginning.

* Koch's research established his method as an improvement over the
  Fansworth method of sending individual letters at high speed and
  leaving gaps between them to reduce speed much below 50, which again
  facilitates thinking. For the record: At the time, the method wasn't
  called by its now popular name "Fansworth" yet, Koch calls it
  "Klangbildverfahren", but it was commonly used.

* A moderate speed of 60 is ideal for learning Morse code.  (He also
  experimented with higher initial speeds, but that lead to slower
  learning.)

* One should start initial Morse code practice by being offered two
  different characters only, at speed 60.

* During the initial minutes of the original Koch-method course, one
  listens to those Morse characters, but does not yet know which
  letters they represent.  One simply writes a dot for each letter
  heard.

* Then the characters are disclosed and are copied on paper as heard.

* Practice is always in groups of five letters, as common for the
  encrypted traffic of that day.

* When 90 % of the characters are copied correctly by the learning
  group as a whole, the next (third) letter is introduced.

* This continues throughout the course: Each time the group as a whole
  copies the letters learned thus far 90 % correctly, a new letter is
  added.

* Each new letter is introduced as sound only.  It is not initially
  disclosed which character the letter stands for.  At first, students
  copy a dot for each new letter heard.

* Throughout the course, it is stressed to always put a dot for a
  letter not immediately copied.  Trying to think when a letter is not
  immediately copied is discouraged: Thinking is not likely to help,
  but likely to hinder copying the next letter.

* Course lectures are given in half-hour units.  Koch gives reasons
  why this is the optimal time frame.

* Koch recommends against binge learning.  Even in the common military
  setting of the day, Koch recommends half-hour Morse code sessions
  should not be held more frequently than twice per day.  In his own
  experiments, volunteers from all walks of life were given half-hour
  sessions two or three times a week.

* When the first two half-hour sessions have been completed and the
  90 % rule has been followed, the typical course will have picked up
  4 different characters.  Some courses manage 5.

* Koch observes that people who have not picked up 4 characters after
  two half-hour sessions of learning, fairly typically do not manage
  to learn Morse code at all, even when putting in a lot of effort.
  He recommends using these first two sessions as a sort of entry
  exam.

* It should be mentioned that, in his day, the qualification target
  was speed 100 in copying cipher text and speed 125 in copying plain
  text.

* Each half-hour session starts with a repetition of the previously
  learned material.  Later in the course, in most session, only a
  single new character is introduced.

* The letters p x v y q z tend to be particularly difficult to pick
  up.  Some of these characters may require two sessions before the
  course copies them 90 % and the next character can be introduced.

* Koch gives no clear recommendation in which order to learn the
  letters.  The first letters introduced should be quite
  different-sounding from each other, that much is clear.  One of his
  courses used: h f a b g c d e ch l k i o t m p q r s n u w v y x z.
  This contradicts a recommendation he also gives: The more difficult
  letters should be introduced somewhat early in the course, after
  about a third or so, so they see some repetition later.

* Koch recommends dual pitch: Sending "dits" in a slightly lower pitch
  than "dahs", at first.  The different in pitch should be "quite low"
  (Koch is not specific here). The general impression is that
  difference was not rigorously controlled and may have varied
  somewhat from course session to course session. This dual pitch is
  held up only for the first half or third or so of the course.
  Thereafter, the pitch difference is gradually reduced; so normal
  single pitch is reached some time before the end of the course.
  Koch does not claim dual pitch to be essential, but offers a
  comparison: A dual-pitch course finished three sessions earlier than
  a single-pitch companion course.

* How long does it take to learn Morse code by the Koch method?
  During his research, Koch took the shortcut of teaching only the 26
  letters of the alphabet.  As his courses were taught to conduct
  research, not to train radio people, digits, punctuation, or pro
  signs were not introduced.  Teaching 26 alphabet characters usually
  took 24 to 28 half-hour sessions, apparently spread over some 8 to
  14 weeks (depending on twice or thrice a week schedule).  For the
  record: The text mentions teaching "ch" (no longer a character in
  today's international Morse code as specified by the ITU), but all
  the graphs show 26 characters being taught; the publication is not
  entirely clear here.

It strikes me that in his 70 page paper, Koch never even mentions
Morse code keying training.  Sending does not seem to be an issue that
needs a lot of concern, once reception is mastered.

That rather fits my own experience and that of many other radio
amateurs: Once solid code copy is mastered, keying is reduced to a
mere mechanical task.  There is some concern about avoiding cramping
and "glass fist", yes.  But Morse code knowledge straightforwardly
transfers from copy to keying.

## Koch and today's radio amateur

The speeds are comparable.  In Koch's age, the top-notch professional
radio men on board of ships or airplanes had passed exams demanding
solid copy of 100 characters per minute cipher text or 125 plain, for
five minutes.  Today's radio amateur "high speed club" demands those
same 125 characters per minute, plain text, but for half an hour,
mixed reception and sending.  In contests, many CW stations use
similar speed ranges.

Koch mentions less demanding environments of his days, where average
communication speeds of 60-90 were common.  Today, we have a lot of
Morse code communication going on in these slower 60+ speed ranges, in
particular on the lower bands.  Quite a few radio amateurs need such
relatively slow speeds in order to be able to join the party.

Unfortunately, many have learned to copy Morse code in a way that
involves some kind of thinking or mental processing, rather than
automatism.  Trying to get faster, they have thoroughly practiced, but
what they practice is exactly such mental processing. In the end, they
find themselves stuck at a speed level of maybe 50 or, with lots of
more practice, 60 characters per minute.

That's the end speed that can be reached with mental processing, with
thinking.  In contrast, higher speed requires getting rid of thinking,
replacing the processing with automatism.  This is by no means
impossible, but hard to do.

Many never break that barrier.

That same barrier at about speed 50 cpm was well-known in Koch's days.
His brilliant idea was to sneak new learners around it.

But Koch's method can also be used to break the barrier in those that
already struggle with it.  I happen to know from first-hand personal
experience.

Important differences between then and now exist.  In Koch's days,
radio operators copied to paper all they heard, to be forwarded to the
intended recipients.  Today's Morse code users, we radio amateurs, are
typically ourselves the intended recipients of messages we receive.
There is no need for paper copy of every character received, if we
manage to manage to comprehend the messages directly.

The traditional training back then involved copying five-letter groups
of random text to paper.  If comprehension independent of paper copy
is our goal, this might not be optimal.

I therefore propose doing away with meaningless five-letter groups. If
comprehension is to be trained, let us use comprehensible words.

Can we combine Koch' method with real words?  That's exactly what this
software is about.

## Koch with real words

Pretty much same as Koch, but with real words from your mother tongue
replacing traditional meaningless five-letter groups.

You may choose to copy them on paper.  But I recommend to simply try
to understand them in your head.  You'll know the word once you "have"
it.

Koch starts initial training with two characters.  We have to bite the
lemmon and start with three characters instead.  There is hope that we
can find a few meaningful words that can be spelled with some chosen
three letters.

Which initial three letters allow at least some words to be spelled?
How to choose?

That's where software comes in.

Given our alphabet of 26 letters, there are 2600 choices of three
different letters from that alphabet.  Software can simply and
systematically try out all these 2600 choices, and find the choice
that allows the most words to be spelled with just those three letters
chosen.

In German, those three letters are "a", "p", and "s".  21 short German
words can be spelled with just these three letters. In alphabetic
order, these are:

> aa aas app apps ass papa papas papp pass passa passas ppp pps ps sap
> saps sas sass spa spas spass

21 different words is not a whole lot.  But for the first 20 minutes
of practice or so, such a restricted set of words may just be good
enough.  Then, a new, fourth letter can be added.

Again, software to the rescue. The "t" is a good choice here, adding
it to gives 29 new German words. And so on.  For details, see
`learning_order.json` (of your chosen language).

## Disadvantages

* If you follow these suggestion, you'll learn only the letters and
  have to deal with numbers and other characters separately later.

* The present implementation is based on a dictionary.  So odd, funny
  words are practiced.  We might want an improvement that takes word
  frequency into account.

# How to run this?

## Installation

Have `docker` installed.

## Preparation

Run, in this directory (that contains this `README.md` file):

```
docker build -t registry.invalid/kmrw:latest .
```

_Beware! There's a `.` at the end of that line.  It is needed._

This prepares the software and a spell checking dictionary and packs
it all into the Docker image.

If you want to develop the code and rebuild frequently, it is
convenient to have a local HTTP proxy, in which case you'd use
something like

```
docker build --build-arg=http_proxy=http://172.18.0.1:3128/ -t registry.invalid/kmrw:latest .
```

If you don't know what a local HTTP proxy is, you can safely ignore
this.

### What's this `registry.invalid`?

(Skip this section if you only want the recipe, not the explanation.)

Docker uses a registry, which is a place in the cloud where Docker
images are served (and, more likely than not, the company that
operates it spies on you).  Out of the box, the `docker` command uses
`registry-1.docker.io` when you don't mention another one.

I would like a way to say "cut out this registry stuff, this
particular Docker image lives locally on my computer only and goes
nowhere".  I would much prefer a Docker image without a registry would
be just that, a Docker image without a registry, but that's not how
the `docker` command has been designed.  But I can achieve the same
effect by using a bogus registry name.

The `.invalid` top level domain has been set aside officially to be
just that - invalid.  A registry ending with ".invalid" will be
nowhere.  (Technically, DNS will not yield an IP for any hostname in
the `.invalid` domain.)

## Run

This is for the people who want to play with the system and create
their own lesson files.

I run the following in a `bash`.

The `docker build ...` command needs to happen in the directory that
has this `README.md` file and, more to the point, the `Dockerfile`.

In contrast, you can run everything that follows in any (one)
directory of your choosing.

Why "(one)"?  Later commands generally depend on files produced by
earlier ones.  So, for smoothest sailing, don't switch directories in
between calls.

Unless you want to switch languages. One directory per language is an
excellent idea.

Some of the reasonably fast running commands produce no output.  This
follows the time-honored "no news is good news" command line
tradition.  If you know about exit values, check those.

I run the commands in a `bash` on a Linux system.  If you are in a
non-Linux environment, you can probably remove the `--user="$(id
--user):$(id --group)"` stances.  If you remove those, there is some
concern: Both the user inside the Docker container and you on the
outside want access to the files.  But then, rumors have it that on
some non-Linux operating systems, the Docker installation handles this
for you automatically.  I don't know; in my spare time, I'm not
polyglot, generally, I never leave Linuxland unless I have to.

### All in one swoop

If you are like me and want to run the whole thing in one swoop, for
all three languages supported (which takes some 4,5 hours on my
laptop):

```
docker run --rm -ti --mount=type=bind,src="$(pwd)",dst=/fromhost --user="$(id --user):$(id --group)" --workdir=/fromhost registry.invalid/kmrw:latest invoke all
```

Or you can do stuff for only one language: Replace the `all`
in the above by `de-de`, or `en-gb`, or `en-us`.  This cuts the
time needed down to about 1,5 hours on my laptop.

### A word on my invoke tasks

Those invoke tasks tend to be conservative.  What one invocation
constructed, the next will not knowingly overwrite.  Even `invoke
--list` will show fewer and fewer possible tasks.

If you actually do want to run stuff again, delete the file that was
produced.

### Wordlist generation

This step generates a list of words, only ASCII characters, one word
per line, from the material of spell checking dictionaries.

Those spell checking dictionaries and accompanying software is in the
Docker container we built. Currently it has German, and the two
English flavors GB and US.

The German command line is:

```
docker run --rm -ti --mount=type=bind,src="$(pwd)",dst=/fromhost --user="$(id --user):$(id --group)" --workdir=/fromhost registry.invalid/kmrw:latest invoke de-de.mk-wordlist-de-de
```

For US-English, replace the `invoke` target `de-de.mk-wordlist-de-de`
with `en-us.mk-wordlist-en-us`, for British English, with
`en-gb.mk-wordlist-en-gb`.


This spills out a bit of output from the spell check software we use.

If all is well, you'd find a file `wordlist.txt` in the language
sub-directory that contains, one word a line, all words the spell
checker knows that are at most 6 characters long.

Reading words in your head gets much harder with increasing word
length.  Hence the restriction to at most 6 characters.

### Letter count file generation

Next, I use that to generate a count, for each letter, in how many
words that letter occurs.  The invoke targets are
`de-de.mk-lettercount-de-de`, `en-gb.mk-lettercount-en-gb`, and
`en-us.mk-lettercount-en-us`.

Each of these results in a plain text file `lettercount.txt`.  Have a
look, if you want to.

### Database generation

Now, the fun stuff with "databases" starts.  We generate a database
`letterset2count.kmrw`.  (The abbreviation `kmrw` is intended to stand
for "Koch method - real words".)  This database contains, for each set
of letters, the number of words that can be build from (spelled with)
those letters.

That file `letterset2count.kmrw` will be 268,435,456 bytes long,
2 ** 26 * 4.

An earlier version took roughly 10 hours on my laptop. After some
optimization, that's down to about 1 hour 20 minutes.

This program tries to keep you entertained by occasionally posting
estimates for when it'll finish. The time printed is (probably) UTC.
With each estimate, it also posts a number, counting down from 2**26 =
67108864. As the hard work is done first, all estimates tend to be
(decreasingly) pessimistic.  The initial estimate is outright
ridiculous.

The invoke targets are `de-de.mk-lettercount-de-de`,
`en-gb.mk-lettercount-en-gb`, and `en-us.mk-lettercount-en-us`.

Upon completion, this process leaves a database file `letterset2count.kmrw` in
your current directory.  For any possible choice of a handful of
letters from the alphabet a...z, that database gives you the number of
words that can be spelled with only the letters from your chosen
handful.

To be interpreted, that database requires the `lettercount.txt` to be
available as well.  The details:

## What's the deal with these database files?

You can skip this section if you are only interested in the results,
not the details.

### Database logic

The databases can be thought of as key/value maps.

The key is a set of letters a-z.  Order of letters is ignored, as is
repetition, so "aab" and "ba" come down to the same set of letters,
hence the same key.

The value is simply an integral number between 0 (inclusive) and
2^{32} (exclusive): The numbers of words than can be spelled with just
the letters from the set.

### Database file layout

Every individual letter is associated with a power of 2.

Which power of 2 depends in the number of words containing that
letter, hence on the language.  Common letters get higher powers of
two, less common letters get lower ones.  To see for yourself, change
to the language directory that has the `lettercount.txt` file and
there, execute this:

```
docker run --rm -ti --mount=type=bind,src=$(pwd),dst=/fromhost --user="$(id --user):$(id --group)" --workdir=/fromhost  registry.invalid/kmrw:latest letter2bitmask.py --lettercount lettercount.txt
```

Given a set of these letters, I simply "bitwise or" the corresponding
powers of two.  (If that sounds like a weird or interesting idea to
you, I recommend you research "bitmasks".)  (If you don't understand
"bitwise or", you can think "add".  As long as the same number isn't
added twice, it boils down to the same thing.)

The result is a number.  That number completely determines the set of
letters, hence the key into the dictionary.

If you have a `lettercount` file in your current directory, you can
explore this mapping from sets of letters to numbers and reverse at
the interactive `python3` prompt.  Here is one sample session.

Gory details:

* The stuff following ">>>" or "..." is what I typed at the Python
  prompt.  I typed two spaces before `map = ...`, and the next line
  was completely empty.  (If you know Python, this will not surprise
  you.)

* You may get different numbers if you use a `letterfile` different
  from mine.  (This particular example was done with the en-US
  wordlist.)

```
andreas@meise:~/amateurfunk/floss/python/koch-method-real-words/en_US
$ docker run --rm -ti --mount=type=bind,src=$(pwd),dst=/fromhost -e HOME=/tmp --user="$(id --user):$(id --group)" --workdir=/fromhost  registry.invalid/kmrw:latest python3
Python 3.7.3 (default, Jul 25 2020, 13:03:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from letters_rare_first import from_lettercount_file
from letters_rare_first import from_lettercount_file
>>> from letter2bitmask import Letter2Bitmask
from letter2bitmask import Letter2Bitmask
>>> with open('lettercount.txt', 'r') as file:
with open('lettercount.txt', 'r') as file:
...   map = Letter2Bitmask(from_lettercount_file(file))
  map = Letter2Bitmask(from_lettercount_file(file))
...

>>> map.number("kochmethod")
map.number("kochmethod")
34826368
>>> map.chars(34826368)
map.chars(34826368)
['c', 'd', 'e', 'h', 'k', 'm', 'o', 't']
>>> map.number(['c', 'd', 'e', 'h', 'k', 'm', 'o', 't'])
map.number(['c', 'd', 'e', 'h', 'k', 'm', 'o', 't'])
34826368
>>> [map.number(ch) for ch in ['c', 'd', 'e', 'h', 'k', 'm', 'o', 't']]
[map.number(ch) for ch in ['c', 'd', 'e', 'h', 'k', 'm', 'o', 't']]
[16384, 65536, 33554432, 2048, 128, 8192, 1048576, 131072]
>>> 16384 | 65536 | 33554432 | 2048 | 128 | 8192 | 1048576 | 131072
16384 | 65536 | 33554432 | 2048 | 128 | 8192 | 1048576 | 131072
34826368
>>> 16384 + 65536 + 33554432 + 2048 + 128 + 8192 + 1048576 + 131072
16384 + 65536 + 33554432 + 2048 + 128 + 8192 + 1048576 + 131072
34826368
>>> exit(0)
exit(0)
```

Letters common in the language are coded to larger numbers.  This
tremendously helps speed up the database generation and also makes
`kmrw` files more easily compressible (with `xz`, `zip`, and the
like).

So, what we now have is a key that still _represents_  a set of
letters, but actually _is_ a mere integral number.

The database format is rather simple.  Here is how it works:

* Multiply that number by 4 to get at a position.  E.g., the number
34826368 would yield a position of 139305472.

* Go to that position in the file, and read the next four bytes.

* Interpret those four bytes as a 32 bit integer - and voila, that is
the value.

In my example, that value happened to be 270:

```
$ docker run --rm -ti --mount=type=bind,src=$(pwd),dst=/fromhost --user="$(id --user):$(id --group)" --workdir=/fromhost  registry.invalid/kmrw:latest /bin/bash -c "dd if=letterset2count.kmrw bs=1 skip=139305472 count=4 | od -td"
4+0 records in
4+0 records out
0000000         270
0000004
4 bytes copied, 0.0213296 s, 0.2 kB/s
```

That can be verified by straightforwardly counting all words in the
`wordlist.txt` file that consist of those letters:

```
$ docker run --rm -ti --mount=type=bind,src=$(pwd),dst=/fromhost --user="$(id --user):$(id --group)" --workdir=/fromhost  registry.invalid/kmrw:latest /bin/bash -c "grep -P '^[cdehkmot]+$' wordlist.txt | wc -l"
270
```

So that's what a `.kmrw` file is.  (If you know what "memory mapped"
means: Internally in my Python program, it's a memory mapped array of
integers. That's all.)

Don't open a `.kmrw` file with standard tools (unless for you, a hex
viewer or similar qualifies as "standard tool").