Blog

IMULA – a filesystem for IMmUtable LArge files

5 years ago, I’ve led the design of a local filesystem tuned for large objects. Today the sketch of it was published under the name IMULA.

The design came to life when I worked for 9LivesData, a company developing HydraSTOR, a crazy-efficient distributed storage for backups. Back there we needed a filesystem for keeping data locally, so that we could build a distributed filesystem on top of it. Initially, ext3 was used, but its performance was suboptimal to us. ext3 ​fsync()s were too costly, it was getting fragmented over time and controlling it was a nightmare, so we thought “how hard can it be?” and decided to build our own. It turned out to have not been that hard and worked really nicely!

I’m happy that it is now partially public and I totally think the article is worth a read.

fetchmail, IMAP IDLE & systemd generators

TL;DR;

If you want to spin up a fetchmail process per ​/etc/fetchmailrc.d/*.config entry, build and install fetchmail-conf-d package.

The problem

The problem which I wanted to solve was fetching my email from multiple servers with relatively low latency. You might ask why fetch your email in the first place, but that’s out of scope of this post – I just want to use it as an excuse for expressing my admiration for systemd anyway. For years the go-to solution for fetching email was fetchmail.

Design decisions

fetchmail allows you to list multiple accounts and run as a daemon to scrape them every once in a while. This polling-style solution is not a pretty one. Fortunately, IMAP has a feature called IDLE, which basically makes an email client keep a connection open and the server notify clients as soon as it gets a new message. fetchmail supports it, however, only for one account. The obvious solution is to spin up a separate fetchmail for every account and this is what I decided to use systemd generators for.

Overview

systemd generators are basically templated services. They comprise a unit file template and its instantiations. The template is specified just like any other systemd unit file, except with a ‘@’ suffix. Instantiations are generated by a script or program, which you provide.

We’ll just put per-account configuration in /etc/fetchmailrc.d/some_config.config and the script will instantiate the unit file with some_config or something analogous for every file in /etc/fetchmailrc.d  with the suffix .config.

Service template

In ubuntu 18.04 fetchmail comes with an old school init.d script, so I had to start from scratch. This is what I came up with:

[Unit]
Description=Fetchmail for %i
After=network-online.target

[Service]
User=root
ExecStart=/usr/bin/fetchmail \
    -d180 \
    -l104857699 \
    -f /etc/fetchmailrc.d/%i.config \
    -N \
    --syslog \
    --sslcertck \
    --pidfile /var/run/fetchmail/%i.pid
Restart=always
RuntimeDirectory=fetchmail
RuntimeDirectoryMode=0750

[Install]
WantedBy=multi-user.target

Notice the use of %i. This is the instantiation – it will be substituted by whatever our script generates. Based on this, /etc/fetchmailrc.d/%i.config will be used for configuration and /var/run/fetchmail/%i.pid will be used to store the PID. Other than this, it’s a standard unit file.

Putting this file in a proper place is already enough to run systemctl start fetchmail@myconfig and it will create a service which will try to use /etc/fetchmailrc.d/myconfig.config.

Instantiating the template

Our script has to make sure thatmulti-user.target depends on fetchmail@something for every file we have in /etc/fetchmailrc.d. That way the fetchmails will be started at boot time. The systemd generator infrastructure allows you to write scripts for exactly that. Here is what I came up with:

#!/bin/bash
set -e

normal=${1?}
early=${2?}
late=${3?}

log() {
  echo "$@" > /dev/stderr
}

wantdir="$normal/multi-user.target.wants"
mkdir -p "$wantdir"

for config in /etc/fetchmailrc.d/*.config ; do
  [ -f "$config" ] || continue
  basename="$(basename "$config")"
  conf_name="${basename%.config}"
  ln -s "/lib/systemd/system/fetchmail@.service" \
      "${wantdir}/fetchmail@${conf_name}.service"
done

The script basically creates symlinks of ​fetchmail@.service unit file to a directory subdirectory of a directory provided by an argument. That’s how you fit in systemd’s generator infrastructure.

Putting it all together

All you have to do is put the unit file and the script to where they live. I went for /lib/systemd/system/fetchmail@.service for the unit file because I have put it in a debian package. If you want to just copy the file then /etc/systemd/user/ might be a better choice, but remember to update the generator script to resemble that.

Similarly for the generator script: I have put it in /lib/systemd/system-generators/system-fetchmail-generator but you mind find it more convenient to put it in /etc/systemd/user-generators/ directory.

In order to regenerate the symlinks just run systemctl daemon-reload.

You can even run systemctl status fetchmail@* to see the status of all the running instances.

If you want a ready solution, you can get the debian package source which I prepared, build it by running debuild -us -uc -F and install by running dpkg -i.

 

C++ structured bindings’ power

I’ve recently been polishing my pet-project (dupa – duplicate analyzer). It uses SQLite3 under the hood, so I figured I’d come up with a really simple C++ wrapper around the official SQLite3 C library. Certainly, it is not a full-fledged product, but I think this is how a modern C++ interface to a small database should look like, because you can unleash the expressiveness of C++ structured bindings.

This is an example snippet from dupa showing what I mean:

for (const auto &[path, cksum, size, mtime] :
     db.Query<std::string, Cksum, off_t, time_t>(
         "SELECT path, cksum, size, mtime FROM FileList")) {
  // do stuff with path, cksum, size, mtime
}

The most important thing to me was to make it possible to specify the types of the columns next to the query string and not repeat those type specifications further. This is still not LINQ-smart, but I think it’s as good as it gets in C++.

Here are examples from different approaches (from SQLiteCpp and sqlite3cc):

SQLite::Statement query(db,
    "SELECT path, cksum, size, mtime FROM FileList");
while (query.executeStep())
{
  const std::string &path = query.getColumn(0);
  const Cksum &cksum = query.getColumn(1);
  const off_t &size = query.getColumn(2);
  const time_t &mtime = query.getColumn(3);
  // do stuff with path, cksum, size, mtime
}
for(const auto &i : sqlite::query(conn,
    "SELECT path, cksum, size, mtime FROM FileList")) {
  const std::string &path;
  const Cksum &cksum;
  const off_t &size;
  const time_t &mtime;
  i >> path >> cksum >> size >> mtime;
  // do stuff with path, cksum, size, mtime
}

They respectively overload the cast operators or stream operators. While this allows you to achieve what I wanted to, it doesn’t force you to – you can still accidentally access the same column of a query result as 2 different types or at the very least, makes the code more verbose.

Writing to a database with my library is also C++ish and type-safe (simplified snippet from dupa):

auto out = db.Prepare<uintptr_t, size_t, double>(
    "INSERT INTO EqClass(id, nodes, weight, interesting) "
    "VALUES(?, ?, ?, 0)");
for (const auto &eq_class : classes) {
  out->Write(reinterpret_cast<uintptr_t>(eq_class.get()),
                   eq_class->GetNumNodes(), eq_class->GetWeight());
}

Alternatively, if you’re a fan of functional programming, you can write it that way too:

auto out = db.Prepare<uintptr_t, size_t, double>(
    "INSERT INTO EqClass(id, nodes, weight, interesting) "
    "VALUES(?, ?, ?, 0)");
std::transform(classes.begin(), classes.end(), out->begin(),
               [](const std::unique_ptr<EqClass> &eq_class) {
                 return std::make_tuple(
                     reinterpret_cast<uintptr_t>(eq_class.get()),
                     eq_class->GetNumNodes(), eq_class->GetWeight());
               });

How it’s achieved? The cornerstone of this SQLite wrapper is the decision that we’ll be binding the types of columns with input or output streams to the database. It is achieved by DBInStream and DBOutStream being variadically templated by the column types. These streams are created on Query and Prepare invocations. An straight-forward implication of this decision is that an iterator over DBInStream has to return a tuple typed the same way as DBInStream. The rest is just C++17 awesomeness. There are some gory details obviously. Take a look at src/db_lib* in dupa and if you have questions, I’ll happily reply.

If you’re interested in moving this database library forward, please let me know and I’ll rip it out and start a new project with it.

dupa – duplicate analyzer

I’ve finally managed to polish and publish my pet-project for cleaning my home directory from duplicates. It’s here: https://github.com/dopiera/dupa. It has actually proven useful to me multiple times, so I thought I’d share it broader.

I bet that I am not the only person who have repeatedly downloaded photos from my camera/phone without removing them from the device, so downloaded all of them again every time I wanted to download only the newest ones. I also bet I’m not the only person in the world who has copied the data between computers and ended up with 2 mostly similar data sets. This tool helped me get out of this situation.

It works by computing hashes from files and then uses some heuristics to find similar directories or directories which contain mostly duplicates of files scattered elsewhere (think of a big dump of photos, most of which are in other directories sorted by your trips).

The code is available on github: https://github.com/dopiera/dupa. Help yourselves. It actually has a man page, so you can read on how to use it and how it works there.