From dec88a368b3592514a4cb9f70245c585ac9cf843 Mon Sep 17 00:00:00 2001
From: geeksville <kevinh@geeksville.com>
Date: Tue, 11 Aug 2020 17:34:49 -0700
Subject: [PATCH] First attempt at better protocol docs.  Bug #308

@cyclomies thank you for the prodding and help.  I'm happy to add more
detail, can you insert a few questions inline?  Then I'll answer and
hopefully that will be enough to be useful for others.
---
 docs/software/mesh-alg.md | 90 ++++++++++++++++++++++++++++++---------
 1 file changed, 70 insertions(+), 20 deletions(-)

diff --git a/docs/software/mesh-alg.md b/docs/software/mesh-alg.md
index f9427cd2..2020dc53 100644
--- a/docs/software/mesh-alg.md
+++ b/docs/software/mesh-alg.md
@@ -1,5 +1,75 @@
 # Mesh broadcast algorithm
 
+## Current algorithm
+
+The routing protocol for Meshtastic is really quite simple (and suboptimal). It is heavily influenced by the mesh routing algorithm used in [Radiohead](https://www.airspayce.com/mikem/arduino/RadioHead/) (which was used in very early versions of this project). It has four conceptual layers.
+
+### A note about protocol buffers
+
+Because we want our devices to work across various vendors and implementations, we use [Protocol Buffers](https://github.com/meshtastic/Meshtastic-protobufs) pervasively. For information on how the protocol buffers are used wrt API clients see [sw-design](sw-design.md), for purposes of this document you mostly only
+need to consider the MeshPacket and Subpacket message types.
+
+### Layer 1: Non reliable zero hop messaging
+
+This layer is conventional non-reliable lora packet transmission. The transmitted packet has the following representation on the ether:
+
+- A 32 bit LORA preamble (to allow receiving radios to synchronize clocks and start framing). We use a longer than minimum (8 bit) preamble to maximize the amount of time the LORA receivers can stay asleep, which dramatically lowers power consumption.
+
+After the preamble the 16 byte packet header is transmitted. This header is described directly by the PacketHeader class in the C++ source code. But indirectly it matches the first portion of the "MeshPacket" protobuf definition. But notably: this portion of the packet is sent directly as the following 16 bytes (rather than using the protobuf encoding). We do this to both save airtime and to allow receiving radio hardware the option of filtering packets before even waking the main CPU.
+
+- to (4 bytes): the unique NodeId of the destination (or 0xffffffff for NodeNum_BROADCAST)
+- from (4 bytes): the unique NodeId of the sender)
+- id (4 bytes): the unique (wrt the sending node only) packet ID number for this packet. We use a large (32 bit) packet ID to ensure there is enough unique state to protect any encrypted payload from attack.
+- flags (4 bytes): Only a few bits are are currently used - 3 bits for for the "HopLimit" (see below) and 1 bit for "WantAck"
+
+After the packet header the actual packet is placed onto the the wire. These bytes are merely the encrypted packed protobuf encoding of the SubPacket protobuf. A full description of our encryption is available in [crypto](crypto.md). It is worth noting that only this SubPacket is encrypted, headers are not. Which leaves open the option of eventually allowing nodes to route packets without knowing the keys used to encrypt.
+
+NodeIds are constructed from the bottom four bytes of the macaddr of the bluetooth address. Because the OUI is assigned by the IEEE and we currently only support a few CPU manufacturers, the upper byte is defacto guaranteed unique for each vendor. The bottom 3 bytes are guaranteed unique by that vendor.
+
+To prevent collisions all transmitters will listen before attempting to send. If they hear some other node transmitting, they will reattempt transmission in x milliseconds. This retransmission delay is random between FIXME and FIXME (these two numbers are currently hardwired, but really should be scaled based on expected packet transmission time at current channel settings).
+
+### Layer 2: Reliable zero hop messaging
+
+This layer adds reliable messaging between the node and its immediate neighbors (only).
+
+The default messaging provided by layer-1 is extended by setting the "want-ack" flag in the MeshPacket protobuf. If want-ack is set the following documentation from mesh.proto applies:
+
+"""This packet is being sent as a reliable message, we would prefer it to arrive
+at the destination. We would like to receive a ack packet in response.
+
+Broadcasts messages treat this flag specially: Since acks for broadcasts would
+rapidly flood the channel, the normal ack behavior is suppressed. Instead,
+the original sender listens to see if at least one node is rebroadcasting this
+packet (because naive flooding algorithm). If it hears that the odds (given
+typical LoRa topologies) the odds are very high that every node should
+eventually receive the message. So FloodingRouter.cpp generates an implicit
+ack which is delivered to the original sender. If after some time we don't
+hear anyone rebroadcast our packet, we will timeout and retransmit, using the
+regular resend logic."""
+
+If a transmitting node does not receive an ACK (or a NAK) packet within FIXME milliseconds, it will use layer-1 to attempt a retransmission of the sent packet. A reliable packet (at this 'zero hop' level) will be resent a maximum of three times. If no ack or nak has been received by then the local node will internally generate a nak (either for local consumption or use by higher layers of the protocol).
+
+### Layer 3: (Naive) flooding for multi-hop messaging
+
+Given our use-case for the initial release, most of our protocol is built around [flooding](<https://en.wikipedia.org/wiki/Flooding_(computer_networking)>). The implementation is currently 'naive' - i.e. it doesn't try to optimize flooding other than abandoning retransmission once we've seen a nearby receiver has acked the packet. Therefore, for each source packet up to N retransmissions might occur (if there are N nodes in the mesh).
+
+Each node in the mesh, if it sees a packet on the ether with HopLimit set to a value other than zero, it will decrement that HopLimit and attempt retransmission on behalf of the original sending node.
+
+### Layer 4: DSR for multi-hop unicast messaging
+
+This layer is not yet fully implemented (and not yet used). But eventually (if we stay with our own transport rather than switching to QMesh or Reticulum)
+we will use conventional DSR for unicast messaging. Currently (even when not requiring 'broadcasts') we send any multi-hop unicasts as 'broadcasts' so that we can
+leverage our (functional) flooding implementation. This is suboptimal but it is a very rare use-case, because the odds are high that most nodes (given our small networks and 'hiking' use case) are within a very small number of hops. When any node witnesses an ack for a packet, it will realize that it can abandon its own
+broadcast attempt for that packet.
+
+## Misc notes on remaining tasks
+
+This section is currently poorly formatted, it is mostly a mere set of todo lists and notes for @geeksville during his initial development. After release 1.0 ideas for future optimization include:
+
+- Make flood-routing less naive (because we have GPS and radio signal strength as heuristics to avoid redundant retransmissions)
+- If nodes have been user marked as 'routers', preferentially do flooding via those nodes
+- Fully implement DSR to improve unicast efficiency (or switch to QMesh/Reticulum as these projects mature)
+
 great source of papers and class notes: http://www.cs.jhu.edu/~cs647/
 
 flood routing improvements
@@ -146,23 +216,3 @@ look into the literature for this idea specifically.
   build the most recent version of reality, and if some nodes are too far, then nodes closer in will eventually forward their changes to the distributed db.
 - construct non ambigious rules for who broadcasts to request db updates. ideally the algorithm should nicely realize node X can see most other nodes, so they should just listen to all those nodes and minimize the # of broadcasts. the distributed picture of nodes rssi could be useful here?
 - possibly view the BLE protocol to the radio the same way - just a process of reconverging the node/msgdb database.
-
-# Old notes
-
-FIXME, merge into the above:
-
-good description of batman protocol: https://www.open-mesh.org/projects/open-mesh/wiki/BATMANConcept
-
-interesting paper on lora mesh: https://portal.research.lu.se/portal/files/45735775/paper.pdf
-It seems like DSR might be the algorithm used by RadioheadMesh. DSR is described in https://tools.ietf.org/html/rfc4728
-https://en.wikipedia.org/wiki/Dynamic_Source_Routing
-
-broadcast solution:
-Use naive flooding at first (FIXME - do some math for a 20 node, 3 hop mesh. A single flood will require a max of 20 messages sent)
-Then move to MPR later (http://www.olsr.org/docs/report_html/node28.html). Use altitude and location as heursitics in selecting the MPR set
-
-compare to db sync algorithm?
-
-what about never flooding gps broadcasts. instead only have them go one hop in the common case, but if any node X is looking at the position of Y on their gui, then send a unicast to Y asking for position update. Y replies.
-
-If Y were to die, at least the neighbor nodes of Y would have their last known position of Y.