kopia lustrzana https://gitlab.com/4ham/koch-method-real-words
616 wiersze
25 KiB
Markdown
616 wiersze
25 KiB
Markdown
# What is this?
|
|
|
|
Koch method with real words. Prepare a list of words to learn Morse
|
|
code by the Koch method, using real words.
|
|
|
|
## TL;DR
|
|
|
|
If you just want to learn Morse code:
|
|
|
|
* Find a programm that can play text files as Morse code.
|
|
|
|
* Choose your language. We have [German](de_DE), [British
|
|
English](en_GB) and [American English](en_US) on offer.
|
|
|
|
* Let your program play `lesson_01_0.txt` to you at speed 12 wpm = 60
|
|
cpm. There are several blanks after each word, so you'll get a
|
|
small break after each word to comprehend it. You may need to copy
|
|
to paper at first, but try to copy in your head only.
|
|
|
|
* Do not set up your program to introduce additional gaps between
|
|
letters of the same word, so as to slow down the speeds of the words
|
|
below 60 cpm.
|
|
|
|
* Next, let your program play `lesson_01_1.txt`, then
|
|
`lesson_01_2.txt` and so on. Keep track how many of the words you
|
|
can copy. If you are down to five non-copies per lesson file, you
|
|
are done with lesson 01 and can continue to lesson 02.
|
|
|
|
* The initial lesson 01 teaches you the three letters. All other
|
|
lessons introduce one new letter per lesson. Which letters these
|
|
are depends on the language. You can read up `learning_order.json`
|
|
to find out the details.
|
|
|
|
* Skip to the next lesson whenever you are down to five or fewer
|
|
non-copies per lesson file. No reason to listen to all files from 0
|
|
to 9 just because they are there.
|
|
|
|
* It is recommended to practice for 30 minutes per day.
|
|
|
|
* For the practice frequency, it is recommended to practice anything
|
|
between daily down to twice a week. This depending on how much in a
|
|
hurry you are to learn Morse code and how much time you're willing
|
|
to spend per week.
|
|
|
|
* On each new learning session (day), start by listening again to one
|
|
file (or more, as needed) of the lesson you stopped with at the end
|
|
of the previous learning session. In that sense, it is not quite
|
|
"one lesson per day". Proceed further when you have five non-copies
|
|
in that file or less.
|
|
|
|
With this system, you can expect to learn receiving the 26 letters of
|
|
the alphabet as Morse code within some 25 to 30 half-hour learning
|
|
sessions. As you have always practiced at 12 wpm or 60 cpm, that is
|
|
the speed you will have mastered.
|
|
|
|
This speed is already above the dreaded 50 cpm "plateau of unlearning
|
|
of thinking", so you'll never get stuck there.
|
|
|
|
## What is that "Koch method"?
|
|
|
|
Ludwig Koch was a German scientist who developed, in his research, a
|
|
method of teaching Morse code. This was published in February 1936:
|
|
|
|
> Ludwig Koch (Braunschweig): "Arbeitspsychologische Untersuchung der
|
|
> Tätigkeit bei der Aufnahme von Morsezeichen, zugleich ein neues
|
|
> Anlernverfahren für Funker" (Dissertation der Technischen Hochschule
|
|
> Braunschweig), Zeitschrift für angewandte Psychologie und
|
|
> Charakterkunde, Band 50 Heft 1 u. 2, Februar 1936.
|
|
|
|
The main results of his research and key points of his method are
|
|
(crudely) summarized as follows (all speeds in characters per minute):
|
|
|
|
* Speeds below 50 invite people to think while listening to Morse
|
|
code.
|
|
|
|
* As long as thinking is involved, speeds above about 50 cannot be
|
|
reached.
|
|
|
|
* It is important to hear Morse code as characters "automatically"
|
|
without thinking.
|
|
|
|
* Thinking doesn't work at speeds above 50. So all Morse code
|
|
training should be at speeds above that, from the very beginning.
|
|
|
|
* Koch's research established his method as an improvement over the
|
|
Fansworth method of sending individual letters at high speed and
|
|
leaving gaps between them to reduce speed much below 50, which again
|
|
facilitates thinking. For the record: At the time, the method wasn't
|
|
called by its now popular name "Fansworth" yet, Koch calls it
|
|
"Klangbildverfahren", but it was commonly used.
|
|
|
|
* A moderate speed of 60 is ideal for learning Morse code. (He also
|
|
experimented with higher initial speeds, but that lead to slower
|
|
learning.)
|
|
|
|
* One should start initial Morse code practice by being offered two
|
|
different characters only, at speed 60.
|
|
|
|
* During the initial minutes of the original Koch-method course, one
|
|
listens to those Morse characters, but does not yet know which
|
|
letters they represent. One simply writes a dot for each letter
|
|
heard.
|
|
|
|
* Then the characters are disclosed and are copied on paper as heard.
|
|
|
|
* Practice is always in groups of five letters, as common for the
|
|
encrypted traffic of that day.
|
|
|
|
* When 90 % of the characters are copied correctly by the learning
|
|
group as a whole, the next (third) letter is introduced.
|
|
|
|
* This continues throughout the course: Each time the group as a whole
|
|
copies the letters learned thus far 90 % correctly, a new letter is
|
|
added.
|
|
|
|
* Each new letter is introduced as sound only. It is not initially
|
|
disclosed which character the letter stands for. At first, students
|
|
copy a dot for each new letter heard.
|
|
|
|
* Throughout the course, it is stressed to always put a dot for a
|
|
letter not immediately copied. Trying to think when a letter is not
|
|
immediately copied is discouraged: Thinking is not likely to help,
|
|
but likely to hinder copying the next letter.
|
|
|
|
* Course lectures are given in half-hour units. Koch gives reasons
|
|
why this is the optimal time frame.
|
|
|
|
* Koch recommends against binge learning. Even in the common military
|
|
setting of the day, Koch recommends half-hour Morse code sessions
|
|
should not be held more frequently than twice per day. In his own
|
|
experiments, volunteers from all walks of life were given half-hour
|
|
sessions two or three times a week.
|
|
|
|
* When the first two half-hour sessions have been completed and the
|
|
90 % rule has been followed, the typical course will have picked up
|
|
4 different characters. Some courses manage 5.
|
|
|
|
* Koch observes that people who have not picked up 4 characters after
|
|
two half-hour sessions of learning, fairly typically do not manage
|
|
to learn Morse code at all, even when putting in a lot of effort.
|
|
He recommends using these first two sessions as a sort of entry
|
|
exam.
|
|
|
|
* It should be mentioned that, in his day, the qualification target
|
|
was speed 100 in copying cipher text and speed 125 in copying plain
|
|
text.
|
|
|
|
* Each half-hour session starts with a repetition of the previously
|
|
learned material. Later in the course, in most session, only a
|
|
single new character is introduced.
|
|
|
|
* The letters p x v y q z tend to be particularly difficult to pick
|
|
up. Some of these characters may require two sessions before the
|
|
course copies them 90 % and the next character can be introduced.
|
|
|
|
* Koch gives no clear recommendation in which order to learn the
|
|
letters. The first letters introduced should be quite
|
|
different-sounding from each other, that much is clear. One of his
|
|
courses used: h f a b g c d e ch l k i o t m p q r s n u w v y x z.
|
|
This contradicts a recommendation he also gives: The more difficult
|
|
letters should be introduced somewhat early in the course, after
|
|
about a third or so, so they see some repetition later.
|
|
|
|
* Koch recommends dual pitch: Sending "dits" in a slightly lower pitch
|
|
than "dahs", at first. The different in pitch should be "quite low"
|
|
(Koch is not specific here). The general impression is that
|
|
difference was not rigorously controlled and may have varied
|
|
somewhat from course session to course session. This dual pitch is
|
|
held up only for the first half or third or so of the course.
|
|
Thereafter, the pitch difference is gradually reduced; so normal
|
|
single pitch is reached some time before the end of the course.
|
|
Koch does not claim dual pitch to be essential, but offers a
|
|
comparison: A dual-pitch course finished three sessions earlier than
|
|
a single-pitch companion course.
|
|
|
|
* How long does it take to learn Morse code by the Koch method?
|
|
During his research, Koch took the shortcut of teaching only the 26
|
|
letters of the alphabet. As his courses were taught to conduct
|
|
research, not to train radio people, digits, punctuation, or pro
|
|
signs were not introduced. Teaching 26 alphabet characters usually
|
|
took 24 to 28 half-hour sessions, apparently spread over some 8 to
|
|
14 weeks (depending on twice or thrice a week schedule). For the
|
|
record: The text mentions teaching "ch" (no longer a character in
|
|
today's international Morse code as specified by the ITU), but all
|
|
the graphs show 26 characters being taught; the publication is not
|
|
entirely clear here.
|
|
|
|
It strikes me that in his 70 page paper, Koch never even mentions
|
|
Morse code keying training. Sending does not seem to be an issue that
|
|
needs a lot of concern, once reception is mastered.
|
|
|
|
That rather fits my own experience and that of many other radio
|
|
amateurs: Once solid code copy is mastered, keying is reduced to a
|
|
mere mechanical task. There is some concern about avoiding cramping
|
|
and "glass fist", yes. But Morse code knowledge straightforwardly
|
|
transfers from copy to keying.
|
|
|
|
## Koch and today's radio amateur
|
|
|
|
The speeds are comparable. In Koch's age, the top-notch professional
|
|
radio men on board of ships or airplanes had passed exams demanding
|
|
solid copy of 100 characters per minute cipher text or 125 plain, for
|
|
five minutes. Today's radio amateur "high speed club" demands those
|
|
same 125 characters per minute, plain text, but for half an hour,
|
|
mixed reception and sending. In contests, many CW stations use
|
|
similar speed ranges.
|
|
|
|
Koch mentions less demanding environments of his days, where average
|
|
communication speeds of 60-90 were common. Today, we have a lot of
|
|
Morse code communication going on in these slower 60+ speed ranges, in
|
|
particular on the lower bands. Quite a few radio amateurs need such
|
|
relatively slow speeds in order to be able to join the party.
|
|
|
|
Unfortunately, many have learned to copy Morse code in a way that
|
|
involves some kind of thinking or mental processing, rather than
|
|
automatism. Trying to get faster, they have thoroughly practiced, but
|
|
what they practice is exactly such mental processing. In the end, they
|
|
find themselves stuck at a speed level of maybe 50 or, with lots of
|
|
more practice, 60 characters per minute.
|
|
|
|
That's the end speed that can be reached with mental processing, with
|
|
thinking. In contrast, higher speed requires getting rid of thinking,
|
|
replacing the processing with automatism. This is by no means
|
|
impossible, but hard to do.
|
|
|
|
Many never break that barrier.
|
|
|
|
That same barrier at about speed 50 cpm was well-known in Koch's days.
|
|
His brilliant idea was to sneak new learners around it.
|
|
|
|
But Koch's method can also be used to break the barrier in those that
|
|
already struggle with it. I happen to know from first-hand personal
|
|
experience.
|
|
|
|
Important differences between then and now exist. In Koch's days,
|
|
radio operators copied to paper all they heard, to be forwarded to the
|
|
intended recipients. Today's Morse code users, we radio amateurs, are
|
|
typically ourselves the intended recipients of messages we receive.
|
|
There is no need for paper copy of every character received, if we
|
|
manage to manage to comprehend the messages directly.
|
|
|
|
The traditional training back then involved copying five-letter groups
|
|
of random text to paper. If comprehension independent of paper copy
|
|
is our goal, this might not be optimal.
|
|
|
|
I therefore propose doing away with meaningless five-letter groups. If
|
|
comprehension is to be trained, let us use comprehensible words.
|
|
|
|
Can we combine Koch' method with real words? That's exactly what this
|
|
software is about.
|
|
|
|
## Koch with real words
|
|
|
|
Pretty much same as Koch, but with real words from your mother tongue
|
|
replacing traditional meaningless five-letter groups.
|
|
|
|
You may choose to copy them on paper. But I recommend to simply try
|
|
to understand them in your head. You'll know the word once you "have"
|
|
it.
|
|
|
|
Koch starts initial training with two characters. We have to bite the
|
|
lemmon and start with three characters instead. There is hope that we
|
|
can find a few meaningful words that can be spelled with some chosen
|
|
three letters.
|
|
|
|
Which initial three letters allow at least some words to be spelled?
|
|
How to choose?
|
|
|
|
That's where software comes in.
|
|
|
|
Given our alphabet of 26 letters, there are 2600 choices of three
|
|
different letters from that alphabet. Software can simply and
|
|
systematically try out all these 2600 choices, and find the choice
|
|
that allows the most words to be spelled with just those three letters
|
|
chosen.
|
|
|
|
In German, those three letters are "a", "p", and "s". 21 short German
|
|
words can be spelled with just these three letters. In alphabetic
|
|
order, these are:
|
|
|
|
> aa aas app apps ass papa papas papp pass passa passas ppp pps ps sap
|
|
> saps sas sass spa spas spass
|
|
|
|
21 different words is not a whole lot. But for the first 20 minutes
|
|
of practice or so, such a restricted set of words may just be good
|
|
enough. Then, a new, fourth letter can be added.
|
|
|
|
Again, software to the rescue. The "t" is a good choice here, adding
|
|
it to gives 29 new German words. And so on. For details, see
|
|
`learning_order.json` (of your chosen language).
|
|
|
|
## Disadvantages
|
|
|
|
* If you follow these suggestion, you'll learn only the letters and
|
|
have to deal with numbers and other characters separately later.
|
|
|
|
* The present implementation is based on a dictionary. So odd, funny
|
|
words are practiced. We might want an improvement that takes word
|
|
frequency into account.
|
|
|
|
# How to run this?
|
|
|
|
## Installation
|
|
|
|
Have `docker` installed.
|
|
|
|
## Preparation
|
|
|
|
Run, in this directory (that contains this `README.md` file):
|
|
|
|
```
|
|
docker build -t registry.invalid/kmrw:latest .
|
|
```
|
|
|
|
_Beware! There's a `.` at the end of that line. It is needed._
|
|
|
|
This prepares the software and a spell checking dictionary and packs
|
|
it all into the Docker image.
|
|
|
|
If you want to develop the code and rebuild frequently, it is
|
|
convenient to have a local HTTP proxy, in which case you'd use
|
|
something like
|
|
|
|
```
|
|
docker build --build-arg=http_proxy=http://172.18.0.1:3128/ -t registry.invalid/kmrw:latest .
|
|
```
|
|
|
|
If you don't know what a local HTTP proxy is, you can safely ignore
|
|
this.
|
|
|
|
### What's this `registry.invalid`?
|
|
|
|
(Skip this section if you only want the recipe, not the explanation.)
|
|
|
|
Docker uses a registry, which is a place in the cloud where Docker
|
|
images are served (and, more likely than not, the company that
|
|
operates it spies on you). Out of the box, the `docker` command uses
|
|
`registry-1.docker.io` when you don't mention another one.
|
|
|
|
I would like a way to say "cut out this registry stuff, this
|
|
particular Docker image lives locally on my computer only and goes
|
|
nowhere". I would much prefer a Docker image without a registry would
|
|
be just that, a Docker image without a registry, but that's not how
|
|
the `docker` command has been designed. But I can achieve the same
|
|
effect by using a bogus registry name.
|
|
|
|
The `.invalid` top level domain has been set aside officially to be
|
|
just that - invalid. A registry ending with ".invalid" will be
|
|
nowhere. (Technically, DNS will not yield an IP for any hostname in
|
|
the `.invalid` domain.)
|
|
|
|
## Run
|
|
|
|
This is for the people who want to play with the system and create
|
|
their own lesson files.
|
|
|
|
I run the following in a `bash`.
|
|
|
|
The `docker build ...` command needs to happen in the directory that
|
|
has this `README.md` file and, more to the point, the `Dockerfile`.
|
|
|
|
In contrast, you can run everything that follows in any (one)
|
|
directory of your choosing.
|
|
|
|
Why "(one)"? Later commands generally depend on files produced by
|
|
earlier ones. So, for smoothest sailing, don't switch directories in
|
|
between calls.
|
|
|
|
Unless you want to switch languages. One directory per language is an
|
|
excellent idea.
|
|
|
|
Some of the reasonably fast running commands produce no output. This
|
|
follows the time-honored "no news is good news" command line
|
|
tradition. If you know about exit values, check those.
|
|
|
|
I run the commands in a `bash` on a Linux system. If you are in a
|
|
non-Linux environment, you can probably remove the `--user="$(id
|
|
--user):$(id --group)"` stances. If you remove those, there is some
|
|
concern: Both the user inside the Docker container and you on the
|
|
outside want access to the files. But then, rumors have it that on
|
|
some non-Linux operating systems, the Docker installation handles this
|
|
for you automatically. I don't know; in my spare time, I'm not
|
|
polyglot, generally, I never leave Linuxland unless I have to.
|
|
|
|
### All in one swoop
|
|
|
|
If you are like me and want to run the whole thing in one swoop, for
|
|
all three languages supported (which takes some 4,5 hours on my
|
|
laptop):
|
|
|
|
```
|
|
docker run --rm -ti --mount=type=bind,src="$(pwd)",dst=/fromhost --user="$(id --user):$(id --group)" --workdir=/fromhost registry.invalid/kmrw:latest invoke all
|
|
```
|
|
|
|
Or you can do stuff for only one language: Replace the `all`
|
|
in the above by `de-de`, or `en-gb`, or `en-us`. This cuts the
|
|
time needed down to about 1,5 hours on my laptop.
|
|
|
|
### A word on my invoke tasks
|
|
|
|
Those invoke tasks tend to be conservative. What one invocation
|
|
constructed, the next will not knowingly overwrite. Even `invoke
|
|
--list` will show fewer and fewer possible tasks.
|
|
|
|
If you actually do want to run stuff again, delete the file that was
|
|
produced.
|
|
|
|
### Wordlist generation
|
|
|
|
This step generates a list of words, only ASCII characters, one word
|
|
per line, from the material of spell checking dictionaries.
|
|
|
|
Those spell checking dictionaries and accompanying software is in the
|
|
Docker container we built. Currently it has German, and the two
|
|
English flavors GB and US.
|
|
|
|
The German command line is:
|
|
|
|
```
|
|
docker run --rm -ti --mount=type=bind,src="$(pwd)",dst=/fromhost --user="$(id --user):$(id --group)" --workdir=/fromhost registry.invalid/kmrw:latest invoke de-de.mk-wordlist-de-de
|
|
```
|
|
|
|
For US-English, replace the `invoke` target `de-de.mk-wordlist-de-de`
|
|
with `en-us.mk-wordlist-en-us`, for British English, with
|
|
`en-gb.mk-wordlist-en-gb`.
|
|
|
|
|
|
This spills out a bit of output from the spell check software we use.
|
|
|
|
If all is well, you'd find a file `wordlist.txt` in the language
|
|
sub-directory that contains, one word a line, all words the spell
|
|
checker knows that are at most 6 characters long.
|
|
|
|
Reading words in your head gets much harder with increasing word
|
|
length. Hence the restriction to at most 6 characters.
|
|
|
|
### Letter count file generation
|
|
|
|
Next, I use that to generate a count, for each letter, in how many
|
|
words that letter occurs. The invoke targets are
|
|
`de-de.mk-lettercount-de-de`, `en-gb.mk-lettercount-en-gb`, and
|
|
`en-us.mk-lettercount-en-us`.
|
|
|
|
Each of these results in a plain text file `lettercount.txt`. Have a
|
|
look, if you want to.
|
|
|
|
### Database generation
|
|
|
|
Now, the fun stuff with "databases" starts. We generate a database
|
|
`letterset2count.kmrw`. (The abbreviation `kmrw` is intended to stand
|
|
for "Koch method - real words".) This database contains, for each set
|
|
of letters, the number of words that can be build from (spelled with)
|
|
those letters.
|
|
|
|
That file `letterset2count.kmrw` will be 268,435,456 bytes long,
|
|
2 ** 26 * 4.
|
|
|
|
An earlier version took roughly 10 hours on my laptop. After some
|
|
optimization, that's down to about 1 hour 20 minutes.
|
|
|
|
This program tries to keep you entertained by occasionally posting
|
|
estimates for when it'll finish. The time printed is (probably) UTC.
|
|
With each estimate, it also posts a number, counting down from 2**26 =
|
|
67108864. As the hard work is done first, all estimates tend to be
|
|
(decreasingly) pessimistic. The initial estimate is outright
|
|
ridiculous.
|
|
|
|
The invoke targets are `de-de.mk-lettercount-de-de`,
|
|
`en-gb.mk-lettercount-en-gb`, and `en-us.mk-lettercount-en-us`.
|
|
|
|
Upon completion, this process leaves a database file `letterset2count.kmrw` in
|
|
your current directory. For any possible choice of a handful of
|
|
letters from the alphabet a...z, that database gives you the number of
|
|
words that can be spelled with only the letters from your chosen
|
|
handful.
|
|
|
|
To be interpreted, that database requires the `lettercount.txt` to be
|
|
available as well. The details:
|
|
|
|
## What's the deal with these database files?
|
|
|
|
You can skip this section if you are only interested in the results,
|
|
not the details.
|
|
|
|
### Database logic
|
|
|
|
The databases can be thought of as key/value maps.
|
|
|
|
The key is a set of letters a-z. Order of letters is ignored, as is
|
|
repetition, so "aab" and "ba" come down to the same set of letters,
|
|
hence the same key.
|
|
|
|
The value is simply an integral number between 0 (inclusive) and
|
|
2^{32} (exclusive): The numbers of words than can be spelled with just
|
|
the letters from the set.
|
|
|
|
### Database file layout
|
|
|
|
Every individual letter is associated with a power of 2.
|
|
|
|
Which power of 2 depends in the number of words containing that
|
|
letter, hence on the language. Common letters get higher powers of
|
|
two, less common letters get lower ones. To see for yourself, change
|
|
to the language directory that has the `lettercount.txt` file and
|
|
there, execute this:
|
|
|
|
```
|
|
docker run --rm -ti --mount=type=bind,src=$(pwd),dst=/fromhost --user="$(id --user):$(id --group)" --workdir=/fromhost registry.invalid/kmrw:latest letter2bitmask.py --lettercount lettercount.txt
|
|
```
|
|
|
|
Given a set of these letters, I simply "bitwise or" the corresponding
|
|
powers of two. (If that sounds like a weird or interesting idea to
|
|
you, I recommend you research "bitmasks".) (If you don't understand
|
|
"bitwise or", you can think "add". As long as the same number isn't
|
|
added twice, it boils down to the same thing.)
|
|
|
|
The result is a number. That number completely determines the set of
|
|
letters, hence the key into the dictionary.
|
|
|
|
If you have a `lettercount` file in your current directory, you can
|
|
explore this mapping from sets of letters to numbers and reverse at
|
|
the interactive `python3` prompt. Here is one sample session.
|
|
|
|
Gory details:
|
|
|
|
* The stuff following ">>>" or "..." is what I typed at the Python
|
|
prompt. I typed two spaces before `map = ...`, and the next line
|
|
was completely empty. (If you know Python, this will not surprise
|
|
you.)
|
|
|
|
* You may get different numbers if you use a `letterfile` different
|
|
from mine. (This particular example was done with the en-US
|
|
wordlist.)
|
|
|
|
```
|
|
andreas@meise:~/amateurfunk/floss/python/koch-method-real-words/en_US
|
|
$ docker run --rm -ti --mount=type=bind,src=$(pwd),dst=/fromhost -e HOME=/tmp --user="$(id --user):$(id --group)" --workdir=/fromhost registry.invalid/kmrw:latest python3
|
|
Python 3.7.3 (default, Jul 25 2020, 13:03:44)
|
|
[GCC 8.3.0] on linux
|
|
Type "help", "copyright", "credits" or "license" for more information.
|
|
>>> from letters_rare_first import from_lettercount_file
|
|
from letters_rare_first import from_lettercount_file
|
|
>>> from letter2bitmask import Letter2Bitmask
|
|
from letter2bitmask import Letter2Bitmask
|
|
>>> with open('lettercount.txt', 'r') as file:
|
|
with open('lettercount.txt', 'r') as file:
|
|
... map = Letter2Bitmask(from_lettercount_file(file))
|
|
map = Letter2Bitmask(from_lettercount_file(file))
|
|
...
|
|
|
|
>>> map.number("kochmethod")
|
|
map.number("kochmethod")
|
|
34826368
|
|
>>> map.chars(34826368)
|
|
map.chars(34826368)
|
|
['c', 'd', 'e', 'h', 'k', 'm', 'o', 't']
|
|
>>> map.number(['c', 'd', 'e', 'h', 'k', 'm', 'o', 't'])
|
|
map.number(['c', 'd', 'e', 'h', 'k', 'm', 'o', 't'])
|
|
34826368
|
|
>>> [map.number(ch) for ch in ['c', 'd', 'e', 'h', 'k', 'm', 'o', 't']]
|
|
[map.number(ch) for ch in ['c', 'd', 'e', 'h', 'k', 'm', 'o', 't']]
|
|
[16384, 65536, 33554432, 2048, 128, 8192, 1048576, 131072]
|
|
>>> 16384 | 65536 | 33554432 | 2048 | 128 | 8192 | 1048576 | 131072
|
|
16384 | 65536 | 33554432 | 2048 | 128 | 8192 | 1048576 | 131072
|
|
34826368
|
|
>>> 16384 + 65536 + 33554432 + 2048 + 128 + 8192 + 1048576 + 131072
|
|
16384 + 65536 + 33554432 + 2048 + 128 + 8192 + 1048576 + 131072
|
|
34826368
|
|
>>> exit(0)
|
|
exit(0)
|
|
```
|
|
|
|
Letters common in the language are coded to larger numbers. This
|
|
tremendously helps speed up the database generation and also makes
|
|
`kmrw` files more easily compressible (with `xz`, `zip`, and the
|
|
like).
|
|
|
|
So, what we now have is a key that still _represents_ a set of
|
|
letters, but actually _is_ a mere integral number.
|
|
|
|
The database format is rather simple. Here is how it works:
|
|
|
|
* Multiply that number by 4 to get at a position. E.g., the number
|
|
34826368 would yield a position of 139305472.
|
|
|
|
* Go to that position in the file, and read the next four bytes.
|
|
|
|
* Interpret those four bytes as a 32 bit integer - and voila, that is
|
|
the value.
|
|
|
|
In my example, that value happened to be 270:
|
|
|
|
```
|
|
$ docker run --rm -ti --mount=type=bind,src=$(pwd),dst=/fromhost --user="$(id --user):$(id --group)" --workdir=/fromhost registry.invalid/kmrw:latest /bin/bash -c "dd if=letterset2count.kmrw bs=1 skip=139305472 count=4 | od -td"
|
|
4+0 records in
|
|
4+0 records out
|
|
0000000 270
|
|
0000004
|
|
4 bytes copied, 0.0213296 s, 0.2 kB/s
|
|
```
|
|
|
|
That can be verified by straightforwardly counting all words in the
|
|
`wordlist.txt` file that consist of those letters:
|
|
|
|
```
|
|
$ docker run --rm -ti --mount=type=bind,src=$(pwd),dst=/fromhost --user="$(id --user):$(id --group)" --workdir=/fromhost registry.invalid/kmrw:latest /bin/bash -c "grep -P '^[cdehkmot]+$' wordlist.txt | wc -l"
|
|
270
|
|
```
|
|
|
|
So that's what a `.kmrw` file is. (If you know what "memory mapped"
|
|
means: Internally in my Python program, it's a memory mapped array of
|
|
integers. That's all.)
|
|
|
|
Don't open a `.kmrw` file with standard tools (unless for you, a hex
|
|
viewer or similar qualifies as "standard tool").
|