Lieu an alternative search engine / webring

Go to file

makeworld bcddc10517 Respect robots.txt		2025-02-23 11:58:27 +01:00
crawler	Respect robots.txt	2025-02-23 11:58:27 +01:00
data	Removed "note:" from heuristics as too many sites are affected negatively by this	2022-11-22 13:52:32 +01:00
database	Removed debugging outputs	2022-12-06 12:02:14 +01:00
docs	tweak language for new search docs	2022-12-06 12:11:32 +01:00
html	Added some pretty liberal limits on query length to make it more difficult to cause a DOS condition.	2022-12-06 12:02:14 +01:00
ingest	go fmt	2022-11-22 14:08:59 +01:00
server	Documented the theming part a bit better	2022-12-06 12:14:52 +01:00
types	go fmt	2022-11-22 14:08:59 +01:00
util	tweak wording and minor details relating to preview queries	2022-11-22 14:08:44 +01:00
.gitignore	Allows the configuration of a proxy (#9 )	2022-03-29 14:36:48 +02:00
LICENSE	Create LICENSE	2021-02-03 18:56:45 +01:00
README.md	Update README.md	2023-05-10 15:43:01 +02:00
cli.go	improve results from /random	2021-03-12 18:00:24 +01:00
go.mod	launch	2021-02-03 09:12:30 +01:00
go.sum	ignore data/ dir, clarify instructions in README, update go.sum	2021-05-01 12:08:09 +00:00
lieu.toml	Added new configuration option top lieu.toml	2022-11-22 13:52:32 +01:00
release.sh	improve release script, add fts5 tag flag	2022-03-07 11:24:20 +01:00

README.md

Lieu

an alternative search engine

Created in response to the environs of apathy concerning the use of hypertext search and discovery. In Lieu, the internet is not what is made searchable, but instead one's own neighbourhood. Put differently, Lieu is a neighbourhood search engine, a way for personal webrings to increase serendipitous connexions.

Goals

Enable serendipitous discovery
Support personal communities
Be reusable, easily

Usage

How to search

For the full search syntax (including how to use site: and -site:), see the search syntax and API documentation. For more tips, read the appendix.

Getting Lieu running

$ lieu help
Lieu: neighbourhood search engine

Commands
- precrawl  (scrapes config's general.url for a list of links: <li> elements containing an anchor <a> tag)
- crawl     (start crawler, crawls all urls in config's crawler.webring file)
- ingest    (ingest crawled data, generates database)
- search    (interactive cli for searching the database)
- host      (hosts search engine over http)

Example:
    lieu precrawl > data/webring.txt
    lieu crawl > data/crawled.txt
    lieu ingest
    lieu host

Lieu's crawl & precrawl commands output to standard output, for easy inspection of the data. You typically want to redirect their output to the files Lieu reads from, as defined in the config file. See below for a typical workflow.

Workflow

Edit the config
Add domains to crawl in config.crawler.webring
- If you have a webpage with links you want to crawl:
- Set the config's url field to that page
- Populate the list of domains to crawl with precrawl: lieu precrawl > data/webring.txt
Crawl: lieu crawl > data/crawled.txt
Create database: lieu ingest
Host engine: lieu host

After ingesting the data with lieu ingest, you can also use lieu to search the corpus in the terminal with lieu search.

Theming

Tweak the theme values of the config, specified below.

Config

The config file is written in TOML.

[general]
name = "Merveilles Webring"
# used by the precrawl command and linked to in /about route
url = "https://webring.xxiivv.com"
# used by the precrawl command to populate the Crawler.Webring file;
# takes simple html selectors. might be a bit wonky :)
webringSelector = "li > a[href]:first-of-type"
port = 10001

[theme]
# colors specified in hex (or valid css names) which determine the theme of the lieu instance
# NOTE: If (and only if) all three values are set lieu uses those to generate the file html/assets/theme.css at startup.
# You can also write directly to that file istead of adding this section to your configuration file
foreground = "#ffffff"
background = "#000000"
links = "#ffffff"

[data]
# the source file should contain the crawl command's output 
source = "data/crawled.txt"
# location & name of the sqlite database
database = "data/searchengine.db"
# contains words and phrases disqualifying scraped paragraphs from being presented in search results
heuristics = "data/heuristics.txt"
# aka stopwords, in the search engine biz: https://en.wikipedia.org/wiki/Stop_word
wordlist = "data/wordlist.txt"

[crawler]
# manually curated list of domains, or the output of the precrawl command
webring = "data/webring.txt"
# domains that are banned from being crawled but might originally be part of the webring
bannedDomains = "data/banned-domains.txt"
# file suffixes that are banned from being crawled
bannedSuffixes = "data/banned-suffixes.txt"
# phrases and words which won't be scraped (e.g. if a contained in a link)
boringWords = "data/boring-words.txt"
# domains that won't be output as outgoing links
boringDomains = "data/boring-domains.txt"
# queries to search for finding preview text
previewQueryList = "data/preview-query-list.txt"

For your own use, the following config fields should be customized:

name
url
port
source
webring
bannedDomains

The following config-defined files can stay as-is unless you have specific requirements:

database
heuristics
wordlist
bannedSuffixes
previewQueryList

For a full rundown of the files and their various jobs, see the files description.

Developing

Build a binary:

# this project has an experimental fulltext-search feature, so we need to include sqlite's fts engine (fts5)
go build --tags fts5
# or using go run
go run --tags fts5 .

Create new release binaries:

./release.sh

License

Source code AGPL-3.0-or-later, Inter is available under SIL OPEN FONT LICENSE Version 1.1, Noto Serif is licensed as Apache License, Version 2.0.