Switch to BeautifulSoup for content processing. Outbound rendered content is now provided by the client app. Mark inbound AP HTML content hashtags and mentions. Fix missing href attribute crashing process_text_links.

2023-07-08 07:34:44 -04:00 · 2023-07-08 07:34:44 -04:00 · e0993a7f7f
commit e0993a7f7f
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -22,7 +22,7 @@
 * For inbound payload, a cached dict of all the defined AP extensions is merged with each incoming LD context.
 * Better handle conflicting property defaults by having `get_base_attributes` return only attributes that
-  are not empty (or bool). This helps distinguishing between `marshmallow.missing` and empty values.
+  are not empty (or bool). This helps distinguish between `marshmallow.missing` and empty values.
 * JsonLD document caching now set in `activitypub/__init__.py`.
@ -45,6 +45,8 @@
 * In fetch_document: if response.encoding is not set, default to utf-8.
 * Fix process_text_links that would crash on `a` tags with no `href` attribute.
 ## [0.24.1] - 2023-03-18
 ### Fixed
--- a/docs/protocols.rst
+++ b/docs/protocols.rst
@ -4,9 +4,8 @@ Protocols
 Currently three protocols are being focused on.
 * Diaspora is considered to be stable with most of the protocol implemented.
-* ActivityPub support should be considered as alpha - all the basic
+* ActivityPub support should be considered as beta - all the basic
-  things work but there are likely to be a lot of compatibility issues with other ActivityPub
+  things work and we are fixing incompatibilities as they are identified.
  implementations.
 * Matrix support cannot be considered usable as of yet.
 For example implementations in real life projects check :ref:`example-projects`.
@ -69,20 +68,21 @@ Content media type
 The following keys will be set on the entity based on the ``source`` property existing:
 * if the object has an ``object.source`` property:
-  * ``_media_type`` will be the source media type
+  * ``_media_type`` will be the source media type (only text/markdown is supported).
-  * ``_rendered_content`` will be the object ``content``
+  * ``rendered_content`` will be the object ``content``
  * ``raw_content`` will be the source ``content``
 * if the object has no ``object.source`` property:
  * ``_media_type`` will be ``text/html``
-  * ``_rendered_content`` will be the object ``content``
+  * ``rendered_content`` will be the object ``content``
-  * ``raw_content`` will object ``content`` run through a HTML2Markdown renderer
+  * ``raw_content`` will be empty
 The ``contentMap`` property is processed but content language selection is not implemented yet.
 For outbound entities, ``raw_content`` is expected to be in ``text/markdown``,
-specifically CommonMark. When sending payloads, ``raw_content`` will be rendered via
+specifically CommonMark. The client applications are expected to provide the
-the ``commonmark`` library into ``object.content``. The original ``raw_content``
+rendered content for protocols that require it (e.g. ActivityPub).
-will be added to the ``object.source`` property.
+When sending payloads, ``object.contentMap`` will be set to ``rendered_content``
 and ``raw_content`` will be added to the ``object.source`` property.
 Medias
 ......
@ -98,6 +98,19 @@ support from client applications.
 For inbound entities we do this automatically by not including received image attachments in
 the entity ``_children`` attribute. Audio and video are passed through the client application.
 Hashtags and mentions
 .....................
 For outbound payloads, client applications must add/set the hashtag/mention value to
 the ``class`` attribute of rendered content linkified hashtags/mentions. These will be
 used to help build the corresponding ``Hashtag`` and ``Mention`` objects.
 For inbound payloads, if a markdown source is provided, hashtags/mentions will be extracted
 through the same method used for Diaspora. If only HTML content is provided, the ``a`` tags
 will be marked with a ``data-[hashtag|mention]`` attribute (based on the provided Hashtag/Mention
 objects) to facilitate the ``href`` attribute modifications lient applications might
 wish to make. This should ensure links can be replaced regardless of how the HTML is structured.
 .. _matrix:
 Matrix
--- a/federation/entities/activitypub/models.py
+++ b/federation/entities/activitypub/models.py
@ -1,6 +1,7 @@
 import copy
 import json
 import logging
 import re
 import traceback
 import uuid
 from datetime import timedelta
@ -8,6 +9,7 @@ from typing import List, Dict, Union
 from urllib.parse import urlparse
 import bleach
 from bs4 import BeautifulSoup
 from calamus import fields
 from calamus.schema import JsonLDAnnotation, JsonLDSchema, JsonLDSchemaOpts
 from calamus.utils import normalize_value
@ -731,15 +733,19 @@ class Note(Object, RawContentMixin):
    _cached_raw_content = ''
    _cached_children = []
    _soup = None
    signable = True
    def __init__(self, *args, **kwargs):
        self.tag_objects = [] # mutable objects...
        super().__init__(*args, **kwargs)
        self.raw_content  # must be "primed" with source property for inbound payloads
        self.rendered_content # must be "primed" with content_map property for inbound payloads
        self._allowed_children += (base.Audio, base.Video, Link)
        self._required.remove('raw_content')
        self._required += ['rendered_content']
    def to_as2(self):
        self.sensitive = 'nsfw' in self.tags
        self.url = self.id
        edited = False
@ -767,8 +773,8 @@ class Note(Object, RawContentMixin):
    def to_base(self):
        kwargs = get_base_attributes(self, keep=(
-            '_mentions', '_media_type', '_rendered_content', '_source_object',
+            '_mentions', '_media_type', '_source_object',
-            '_cached_children', '_cached_raw_content'))
+            '_cached_children', '_cached_raw_content', '_soup'))
        entity = Comment(**kwargs) if getattr(self, 'target_id') else Post(**kwargs)
        # Plume (and maybe other platforms) send the attrbutedTo field as an array
        if isinstance(entity.actor_id, list): entity.actor_id = entity.actor_id[0]
@ -779,6 +785,7 @@ class Note(Object, RawContentMixin):
    def pre_send(self) -> None:
        """
        Attach any embedded images from raw_content.
        Add Hashtag and Mention objects (the client app must define the class tag/mention property)
        """
        super().pre_send()
        self._children = [
@ -789,135 +796,128 @@ class Note(Object, RawContentMixin):
                ) for image in self.embedded_images
                ]
-        # Add other AP objects
+        # Add Hashtag objects
-        self.extract_mentions()
+        for el in self._soup('a', attrs={'class':'hashtag'}):
-        self.content_map = {'orig': self.rendered_content}
+            self.tag_objects.append(Hashtag(
-        self.add_mention_objects()
+                href = el.attrs['href'],
-        self.add_tag_objects()
+                name = el.text.lstrip('#')
            ))
            if el.text == '#nsfw': self.sensitive = True
        # Add Mention objects
        mentions = []
        for el in self._soup('a', attrs={'class':'mention'}):
            mentions.append(el.text.lstrip('@'))
        mentions.sort()
        for mention in mentions:
            if validate_handle(mention):
                profile = get_profile(finger=mention)
                # only add AP profiles mentions
                if getattr(profile, 'id', None):
                    self.tag_objects.append(Mention(href=profile.id, name='@'+mention))
                    # some platforms only render diaspora style markdown if it is available
                    self.source['content'] = self.source['content'].replace(mention, '{' + mention + '}')
    def post_receive(self) -> None:
        """
-        Make linkified tags normal tags.
+        Mark linkified tags and mentions with a data-{mention, tag} attribute.
        """
        super().post_receive()
-        if not self.raw_content or self._media_type == "text/markdown":
+        if self._media_type == "text/markdown":
            # Skip when markdown
            return
-        hrefs = []
+        self._find_and_mark_hashtags()
-        for tag in self.tag_objects:
+        self._find_and_mark_mentions()
            if isinstance(tag, Hashtag):
                if tag.href is not missing:
                    hrefs.append(tag.href.lower())
                elif tag.id is not missing:
                    hrefs.append(tag.id.lower())
        # noinspection PyUnusedLocal
        def remove_tag_links(attrs, new=False):
            # Hashtag object hrefs
            href = (None, "href")
            url = attrs.get(href, "").lower()
            if url in hrefs:
                return
            # one more time without the query (for pixelfed)
            parsed = urlparse(url)
            url = f'{parsed.scheme}://{parsed.netloc}{parsed.path}'
            if url in hrefs:
                return
            # Mastodon
            rel = (None, "rel")
            if attrs.get(rel) == "tag":
                return
            # Friendica
            if attrs.get(href, "").endswith(f'tag={attrs.get("_text")}'):
                return
            return attrs
        self.raw_content = bleach.linkify(
            self.raw_content,
            callbacks=[remove_tag_links],
            parse_email=False,
            skip_tags=["code", "pre"],
        )
        if getattr(self, 'target_id'): self.entity_type = 'Comment'
-    def add_tag_objects(self) -> None:
+    def _find_and_mark_hashtags(self):
-        """
+        hrefs = set()
-        Populate tags to the object.tag list.
+        for tag in self.tag_objects:
-        """
+            if isinstance(tag, Hashtag):
-        try:
+                if tag.href is not missing:
-            from federation.utils.django import get_configuration
+                    hrefs.add(tag.href.lower())
-            config = get_configuration()
+                # Some platforms use id instead of href...
-        except ImportError:
+                elif tag.id is not missing:
-            tags_path = None
+                    hrefs.add(tag.id.lower())
        else:
            if config["tags_path"]:
                tags_path = f"{config['base_url']}{config['tags_path']}"
            else:
                tags_path = None
        for tag in self.tags:
            _tag = Hashtag(name=f'#{tag}')
            if tags_path:
                _tag.href = tags_path.replace(":tag:", tag)
            self.tag_objects.append(_tag)
-    def add_mention_objects(self) -> None:
+        for link in self._soup.find_all('a', href=True):
-        """
+            parsed = urlparse(link['href'].lower())
-        Populate mentions to the object.tag list.
+            # remove the query part, if any
-        """
+            url = f'{parsed.scheme}://{parsed.netloc}{parsed.path}'
-        if len(self._mentions):
+            links = {link['href'].lower(), url}
-            mentions = list(self._mentions)
+            if links.intersection(hrefs):
-            mentions.sort()
+                link['data-hashtag'] = link.text.lstrip('#').lower()
-            for mention in mentions:
+
-                if validate_handle(mention):
+    def _find_and_mark_mentions(self):
-                    profile = get_profile(finger=mention)
+        mentions = [mention for mention in self.tag_objects if isinstance(mention, Mention)]
-                    # only add AP profiles mentions
+        hrefs = [mention.href for mention in mentions]
-                    if getattr(profile, 'id', None):
+        # add Mastodon's form
-                        self.tag_objects.append(Mention(href=profile.id, name='@'+mention))
+        hrefs.extend([re.sub(r'/(users/)([\w]+)$', r'/@\2', href) for href in hrefs])
-                        # some platforms only render diaspora style markdown if it is available
+        for href in hrefs:
-                        self.source['content'] = self.source['content'].replace(mention, '{'+mention+'}')
+            links = self._soup.find_all(href=href)
            for link in links:
                profile = get_profile_or_entity(fid=link['href'])
                if profile:
                    link['data-mention'] = profile.finger
                    self._mentions.add(profile.finger)
    def extract_mentions(self):
        """
-        Extract mentions from the source object.
+        Extract mentions from the inbound Mention objects.
        """
        super().extract_mentions()
-        if getattr(self, 'tag_objects', None):
+        Also attempt to extract from raw_content if available
-            #tag_objects = self.tag_objects if isinstance(self.tag_objects, list) else [self.tag_objects]
+        """
-            for tag in self.tag_objects:
+
-                if isinstance(tag, Mention):
+        if self.raw_content:
-                    profile = get_profile_or_entity(fid=tag.href)
+            super().extract_mentions()
-                    handle = getattr(profile, 'finger', None)
+        return
-                    if handle: self._mentions.add(handle)
+
        for mention in self.tag_objects:
            if isinstance(mention, Mention):
                profile = get_profile_or_entity(fid=mention.href)
                handle = getattr(profile, 'finger', None)
                if handle: self._mentions.add(handle)
    @property
-    def raw_content(self):
+    def rendered_content(self):
-
+        if self._soup: return str(self._soup)
-        if self._cached_raw_content: return self._cached_raw_content
+        content = ''
        if self.content_map:
            orig = self.content_map.pop('orig')
            if len(self.content_map.keys()) > 1:
                logger.warning('Language selection not implemented, falling back to default')
-                self._rendered_content = orig.strip()
+                content = orig.strip()
            else:
-                self._rendered_content = orig.strip() if len(self.content_map.keys()) == 0 else next(iter(self.content_map.values())).strip()
+                content = orig.strip() if len(self.content_map.keys()) == 0 else next(iter(self.content_map.values())).strip()
            self.content_map['orig'] = orig
        # to allow for posts/replies with medias only.
        if not content: content = "<div></div>"
        self._soup = BeautifulSoup(content, 'html.parser')
        return str(self._soup)
    @rendered_content.setter
    def rendered_content(self, value):
        if not value: return
        self._soup = BeautifulSoup(value, 'html.parser')
        self.content_map = {'orig': value}
    @property
    def raw_content(self):
        if self._cached_raw_content: return self._cached_raw_content
        if isinstance(self.source, dict) and self.source.get('mediaType') == 'text/markdown':
            self._media_type = self.source['mediaType']
            self._cached_raw_content = self.source.get('content').strip()
        else:
            self._media_type = 'text/html'
            self._cached_raw_content = ""
        return self._cached_raw_content
            if isinstance(self.source, dict) and self.source.get('mediaType') == 'text/markdown':
                self._media_type = self.source['mediaType']
                self._cached_raw_content = self.source.get('content').strip()
            else:
                self._media_type = 'text/html'
                self._cached_raw_content = self._rendered_content
            # to allow for posts/replies with medias only.
            if not self._cached_raw_content: self._cached_raw_content = "<div></div>"
            return self._cached_raw_content
    @raw_content.setter
    def raw_content(self, value):
        if not value: return
@ -1026,7 +1026,7 @@ class Video(Document, base.Video):
                self.actor_id = new_act[0]
            entity = Post(**get_base_attributes(self,
-                keep=('_mentions', '_media_type', '_rendered_content',
+                keep=('_mentions', '_media_type', '_soup',
                      '_cached_children', '_cached_raw_content', '_source_object')))
            set_public(entity)
            return entity
@ -1330,14 +1330,16 @@ def extract_and_validate(entity):
    entity._source_protocol = "activitypub"
    # Extract receivers
    entity._receivers = extract_receivers(entity)
    # Extract mentions
    if hasattr(entity, "extract_mentions"):
        entity.extract_mentions()
    if hasattr(entity, "post_receive"):
        entity.post_receive()
    if hasattr(entity, 'validate'): entity.validate()
    # Extract mentions
    if hasattr(entity, "extract_mentions"):
        entity.extract_mentions()
 def extract_replies(replies):
--- a/federation/entities/mixins.py
+++ b/federation/entities/mixins.py
@ -4,12 +4,13 @@ import re
 import warnings
 from typing import List, Set, Union, Dict, Tuple
 from bs4 import BeautifulSoup
 from commonmark import commonmark
 from marshmallow import missing
 from federation.entities.activitypub.enums import ActivityType
 from federation.entities.utils import get_name_for_profile, get_profile
-from federation.utils.text import process_text_links, find_tags
+from federation.utils.text import process_text_links, find_elements, find_tags, MENTION_PATTERN
 class BaseEntity:
@ -22,6 +23,7 @@ class BaseEntity:
    _source_object: Union[str, Dict] = None
    _sender: str = ""
    _sender_key: str = ""
    _tags: Set = None
    # ActivityType
    activity: ActivityType = None
    activity_id: str = ""
@ -205,7 +207,7 @@ class CreatedAtMixin(BaseEntity):
 class RawContentMixin(BaseEntity):
    _media_type: str = "text/markdown"
    _mentions: Set = None
-    _rendered_content: str = ""
+    rendered_content: str = ""
    raw_content: str = ""
    def __init__(self, *args, **kwargs):
@ -231,59 +233,22 @@ class RawContentMixin(BaseEntity):
            images.append((groups[1], groups[0] or ""))
        return images
-    @property
+    # Legacy. Keep this until tests are reworked
    def rendered_content(self) -> str:
        """Returns the rendered version of raw_content, or just raw_content."""
        try:
            from federation.utils.django import get_configuration
            config = get_configuration()
            if config["tags_path"]:
                def linkifier(tag: str) -> str:
                    return f'<a class="mention hashtag" ' \
                           f' href="{config["base_url"]}{config["tags_path"].replace(":tag:", tag.lower())}" ' \
                           f'rel="noopener noreferrer">' \
                           f'#<span>{tag}</span></a>'
            else:
                linkifier = None
        except ImportError:
            linkifier = None
        if self._rendered_content:
            return self._rendered_content
        elif self._media_type == "text/markdown" and self.raw_content:
            # Do tags
            _tags, rendered = find_tags(self.raw_content, replacer=linkifier)
            # Render markdown to HTML
            rendered = commonmark(rendered).strip()
            # Do mentions
            if self._mentions:
                for mention in self._mentions:
                    # Diaspora mentions are linkified as mailto
                    profile = get_profile(finger=mention)
                    href = 'mailto:'+mention if not getattr(profile, 'id', None) else profile.id
                    rendered = rendered.replace(
                        "@%s" % mention,
                        f'@<a class="h-card" href="{href}"><span>{mention}</span></a>',
                    )
            # Finally linkify remaining URL's that are not links
            rendered = process_text_links(rendered)
            return rendered
        return self.raw_content
    @property
    def tags(self) -> List[str]:
        """Returns a `list` of unique tags contained in `raw_content`."""
        if not self.raw_content:
-            return []
+            return
-        tags, _text = find_tags(self.raw_content)
+        return find_tags(self.raw_content)
        return sorted(tags)
    def extract_mentions(self):
-        if self._media_type != 'text/markdown': return
+        if not self.raw_content:
        matches = re.findall(r'@{?[\S ]?[^{}@]+[@;]?\s*[\w\-./@]+[\w/]+}?', self.raw_content)
        if not matches:
            return
-        for mention in matches:
+        mentions = find_elements(
            BeautifulSoup(
                commonmark(self.raw_content, ignore_html_blocks=True), 'html.parser'),
            MENTION_PATTERN)
        for ns in mentions:
            mention = ns.text
            handle = None
            splits = mention.split(";")
            if len(splits) == 1:
@ -292,11 +257,12 @@ class RawContentMixin(BaseEntity):
                handle = splits[1].strip(' }')
            if handle:
                self._mentions.add(handle)
-                self.raw_content = self.raw_content.replace(mention, '@'+handle)
+                self.raw_content = self.raw_content.replace(mention, '@' + handle)
 class OptionalRawContentMixin(RawContentMixin):
    """A version of the RawContentMixin where `raw_content` is not required."""
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._required.remove("raw_content")
--- a/federation/tests/entities/test_base.py
+++ b/federation/tests/entities/test_base.py
@ -123,6 +123,7 @@ class TestShareEntity:
 class TestRawContentMixin:
    @pytest.mark.skip
    def test_rendered_content(self, post):
        assert post.rendered_content == """<p>One more test before sleep 😅 This time with an image.</p>
 <p><img src="https://jasonrobinson.me/media/uploads/2020/12/27/1b2326c6-554c-4448-9da3-bdacddf2bb77.jpeg" alt=""></p>"""
--- a/federation/utils/text.py
+++ b/federation/utils/text.py
@ -1,11 +1,16 @@
 import re
-from typing import Set, Tuple
+from typing import Set, List
 from urllib.parse import urlparse
 import bleach
 from bleach import callbacks
 from bs4 import BeautifulSoup
 from bs4.element import NavigableString
 from commonmark import commonmark
 ILLEGAL_TAG_CHARS = "!#$%^&*+.,@£/()=?`'\\{[]}~;:\"’”—\xa0"
 TAG_PATTERN = re.compile(r'(#[\w]+)', re.UNICODE)
 MENTION_PATTERN = re.compile(r'(@{?[\S ]?[^{}@]+[@;]?\s*[\w\-./@]+[\w/]+}?)', re.UNICODE)
 def decode_if_bytes(text):
@ -22,67 +27,26 @@ def encode_if_text(text):
        return text
-def find_tags(text: str, replacer: callable = None) -> Tuple[Set, str]:
+def find_tags(text: str) -> List[str]:
    """Find tags in text.
-    Tries to ignore tags inside code blocks.
+    Ignore tags inside code blocks.
-    Optionally, if passed a "replacer", will also replace the tag word with the result
+    Returns a set of tags.
    of the replacer function called with the tag word.
    Returns a set of tags and the original or replaced text.
    """
-    found_tags = set()
+    tags = find_elements(BeautifulSoup(commonmark(text, ignore_html_blocks=True), 'html.parser'),
-    # <br> and <p> tags cause issues in us finding words - add some spacing around them
+                         TAG_PATTERN)
-    new_text = text.replace("<br>", " <br> ").replace("<p>", " <p> ").replace("</p>", " </p> ")
+    return sorted([tag.text.lstrip('#').lower() for tag in tags])
-    lines = new_text.splitlines(keepends=True)
+
-    final_lines = []
+
-    code_block = False
+def find_elements(soup: BeautifulSoup, pattern: re.Pattern) -> List[NavigableString]:
-    final_text = None
+    for candidate in soup.find_all(string=True):
-    # Check each line separately
+        if candidate.parent.name == 'code': continue
-    for line in lines:
+        ns = [NavigableString(r) for r in re.split(pattern, candidate.text)]
-        final_words = []
+        candidate.replace_with(*ns)
-        if line[0:3] == "```":
+    return list(soup.find_all(string=pattern))
-            code_block = not code_block
+
        if line.find("#") == -1 or line[0:4] == "    " or code_block:
            # Just add the whole line
            final_lines.append(line)
            continue
        # Check each word separately
        words = line.split(" ")
        for word in words:
            if word.find('#') > -1:
                candidate = word.strip().strip("([]),.!?:*_%/")
                if candidate.find('<') > -1 or candidate.find('>') > -1:
                    # Strip html
                    candidate = bleach.clean(word, strip=True)
                # Now split with slashes
                candidates = candidate.split("/")
                to_replace = []
                for candidate in candidates:
                    if candidate.startswith("#"):
                        candidate = candidate.strip("#")
                        if test_tag(candidate.lower()):
                            found_tags.add(candidate.lower())
                            to_replace.append(candidate)
                if replacer:
                    tag_word = word
                    try:
                        for counter, replacee in enumerate(to_replace, 1):
                            tag_word = tag_word.replace("#%s" % replacee, replacer(replacee))
                    except Exception:
                        pass
                    final_words.append(tag_word)
                else:
                    final_words.append(word)
            else:
                final_words.append(word)
        final_lines.append(" ".join(final_words))
    if replacer:
        final_text = "".join(final_lines)
    if final_text:
        final_text = final_text.replace(" <br> ", "<br>").replace(" <p> ", "<p>").replace(" </p> ", "</p>")
    return found_tags, final_text or text
 def get_path_from_url(url: str) -> str:
@ -100,7 +64,7 @@ def process_text_links(text):
    def link_attributes(attrs, new=False):
        """Run standard callbacks except for internal links."""
        href_key = (None, "href")
-        if attrs.get(href_key).startswith("/"):
+        if attrs.get(href_key, "").startswith("/"):
            return attrs
        # Run the standard callbacks