September 01, 2016

URL: https://bitbucket.org/wolfpld/usenetarchive

Usenet Archive Toolkit project aims to provide a set of tools to process various sources of usenet messages into a coherent, searchable archive.

Motivation

Usenet is dead. You may believe it's not, but it really is.

People went away to various forums, facebooks and twitters and seem fine there. Meanwhile, the old discussions slowly rot away. Google groups is a sad, unusable joke. Archive.org dataset, at least with regard to polish usenet archives, is vastly incomplete. There is no easy way to get the data, browse it, or search it. So, maybe something needs to be done. How hard can it be anyway? (Not very: one month for a working prototype, another one for polish and bugfixing.)

Advantages

Why use UAT? Why not use existing solutions, like google groups, archives from archive.org or NNTP servers with long history?
  • UAT is designed for offline work. You don't need network connection to access data in "the cloud". You don't need to wait for a reply to your query, or, god forbid, endure "web 2.0" interfaces.
  • UAT archives won't suddenly disappear. You have them on your disk. Google groups are deteriorating with each new iteration of the interface. Also, google is known for shutting down services they no longer feel viable. Google reader, google code search, google code, etc. Other, smaller services are one disk crash away from completly disappearing from the network.
  • UAT archive format is designed for fast access and efficient search. Each message is individually compressed, to facilitate instant access, but uses whole-archive dictionary for better compression. Search is achieved through a database similar in design to google's original paper. Total archive size is smaller than uncompressed collection of messages.
  • Multiple message sources may be merged into a single UAT archive, without message duplication. This way you can fill blanks in source A (eg. NNTP archive server) with messages from source B (eg. much smaller archive.org dump). Archives created in such way are the most complete collection of messages available.
  • UAT archives do not contain duplicate messages (which is common even on NNTP servers), nor stray messages from other groups (archive.org collections contain many bogus messages).
  • Other usenet archives are littered with spam messages. UAT can filter out spam, making previously unreadable newsgroups a breeze to read. Properly trained spam database has very low false positive and false negative percentage.
  • All messages are transcoded to UTF-8, so that dumb clients may be used for display. UAT tries very hard to properly decode broken and/or completly invalid headers, messages without specified encoding or with bad encoding. HTML parts of message are removed. You also don't need to worry about parsing quoted-printable content (most likely malformed). And don't forget about search. Have fun grepping that base64 encoded message without UAT.
  • UAT archives contain precalculated message connectivity graph, which removes the need to parse "references" headers (often broken), sort messages by date, etc. UAT can also "restore" missing connectivity that is not indicated in message headers, through search for quoted text in other messages.
  • Access to archives is available through a trivial libuat interface.
  • UAT archives are mapped to memory and 100% disk backed. In high memory pressure situations archive pages may just be purged away and later reloaded on demand. No memory allocations are required during normal libuat operation, other than:
    • Small, static growing buffer used to decompress single message into.
    • std::vectors used during search operation.

Toolkit description

UAT provides a multitude of utilities, each specialized for its own task. You can find a brief description of each one below.

Import Formats

Usenet messages may be retrieved from a number of different sources. Currently we support:
  • import-source-slrnpull --- Import from a directory where each file is a separate message (slrnpull was chosen because of extra-simple setup required to get it working).
  • import-source-slrnpull-7z --- Import from a slrnpull directory compressed into a single 7z compressed file.
  • import-source-mbox --- Archive.org keeps its collection of usenet messages in a mbox format, in which all posts are merged into a single file.
Imported messages are stored in a per-message LZ4 compressed meta+payload database.

Data Processing

Raw imported messages have to be processed to be of any use. We provide the following utilities:
  • extract-msgid --- Extracts unique identifier of each message and builds reference table for fast access to any message through its ID.
  • extract-msgmeta --- Extracts "From" and "Subject" fields, as a quick reference for archive browsers.
  • merge-raw --- Merges two imported data sets into one. Does not duplicate messages.
  • utf8ize --- Converts messages to a common character encoding, UTF-8.
  • connectivity --- Calculate connectivity graph of messages. Also parses "Date" field, as it's required for chronological sorting.
  • threadify --- Some messages do not have connectivity data embedded in headers. Eg. it's a common artifact of using news-email gateways. This tool parses top-level messages, looking for quotations, then it searches other messages for these quotes and creates (not restores! it was never there!) missing connectivity between children and parents.
  • repack-zstd --- Builds a common dictionary for all messages and recompresses them to a zstd meta+payload+dict database.
  • repack-lz4 --- Converts zstd database to LZ4 database.
  • package --- Packages all databases into a single file. Supports unpacking.

Data Filtering

Raw data right after import is highly unfit for direct use. Messages are duplicated, there's spam. These utilities help clean it up:
  • kill-duplicates --- Removes duplicate messages. It is relatively rare, but data sets from even a single NNTP server may contain the same message twice.
  • filter-newsgroups --- Some data sources (eg. Archive.org's giganews collection) contain messages that were not sent to the collection's newsgroup. This utility will remove such bogus messages.
  • filter-spam --- Learns which messages look like spam and removes them.
Search in archive is performed with the help of a word lexicon. The following tools are used for its preparation:
  • lexicon --- Build a list of words and hit-tables for each word.
  • lexopt --- Optimize lexicon string database.
  • lexstats --- Display lexicon statistics.
  • lexdist --- Calculate distances between words (unused).
  • lexhash --- Prepare lexicon hash table.
  • lexsort --- Sort lexicon data.

Data Access

These tools provide access to archive data:
  • query-raw --- Implements queries on LZ4 database. Requires results of extract-msgid utility. Supports:
    • Message count.
    • Listing of message identifiers.
    • Query message by identifier.
    • Query message by database record number.
  • libuat --- Archive access library. Operates on zstd database.
  • query --- Testbed for libuat. Exposes all provided functionality.

End-user Utilities

  • browser --- Graphical browser of archives.


Future work ideas

Here are some viable ideas that I'm not really planning to do any time soon, but which would be nice to have:
  • Implement messages extractor, for example in mbox format. Would need to properly encode headers and add content encoding information (UTF-8 everywhere).
  • Implement a read-only NNTP server. Would need to properly encode headers and add content encoding information. 7-bit cleanness probably would be nice, so also encode as quoted-printable. Some headers may need to be rewritten (eg. "Lines", which most probably won't be true, due to MIME processing). Message sorting by date may be necessary to put some sense into internal message numbers, which currently have no meaning at all.
  • Implement pan-group search mechanism.
  • Query google groups for missing messages present in "references" header.

Workflow

Usenet Archive Toolkit operates on a couple of distinct databases. Each utility requires a specific set of these databases and produces its own database, or creates a completly new database indexing schema, which invalidates rest of databases.

slrnpull directory → import-source-slrnpull → produces: LZ4
slrnpull compressed → import-source-slrnpull-7z → produces: LZ4
mbox file → import-source-mbox → produces: LZ4
LZ4kill-duplicates → produces: LZ4
LZ4extract-msgid → adds: msgid
LZ4, msgidconnectivity → adds: conn
LZ4, connfilter-newsgroups → produces: LZ4
LZ4, msgid, conn, strfilter-spam → produces: LZ4
LZ4extract-msgmeta → adds: str
(LZ4, msgid) + (LZ4, msgid) → merge-raw → produces: LZ4
LZ4utf8ize → produces: LZ4
LZ4repack-zstd → adds: zstd
zstdrepack-lz4 → adds: LZ4
LZ4, connlexicon → adds: lex
lexlexopt → modifies: lexlexlexhash → adds: lexhash
lexlexsort → modifies: lex
lexlexdist → adds: lexdist (unused)
lexlexstats → user interaction
LZ4, msgidquery-raw → user interaction
zstd, msgid, conn, str, lex, lexhashlibuat → user interaction
everything but LZ4packageone file archive
everything but LZ4threadify → modifies: conn, invalidates: lex, lexhash

Additional, optional information files, not created by any of the above utilities, but used in user-facing programs:
  • name --- Group name.
  • desc_short --- A short description about the purpose of the group (per 7.6.6 in RFC 3977).
  • desc_long --- Group charter. (Some newsgroups regularly post a description to the group that describes its intention. These descriptions are posted by the people involved with the newsgroup creation and/or administration. If the group has such a description, it almost always includes the word "charter", so you can quickly find it by searching the newsgroup for that word. A charter is the "set of rules and guidelines" which supposedly govern the users of that group.)

Notes

Be advised that some utilities (repack-zstd, lexicon) do require enormous amounts of memory. Processing large groups (eg. 2 million messages, 3 GB data) will swap heavily on a 16 GB machine.

utf8ize doesn't compile on MSVC. Either compile it on cygwin, or have fun banging glib and gmime into submission. Your choice.

UAT only works on 64 bit machines.

License

GNU AGPL.