Pseudonymization in 21.05: The new feature you'll want to learn to spell and pronounce

by Andrew Fuerste-Henry on Sep 1, 2021

This new feature was actually added in Koha version 20.11, but most ByWater partner libraries will see it for the first time when they get upgraded to 21.05. You can see the bug for this feature here.

Privacy and Statistics

Within libraries, we perpetually balance our patrons' right to privacy against our organizational need for usage data. The best way to ensure patron data is not compromised is to not keep it in the first place, but we need some level of data to ensure we know where our collection is and who's responsible for it. Further, we want to maintain some broader statistical data about which types of items get checked out when so we can make larger planning decisions.

Koha manages that tension with its use of the issues/old_issues and statistics tables. As items get checked out, that transaction is recorded in the issues table:

This records the itemnumber of the item and the borrowernumber of the patron, thereby linking the item and patron. If this patron has their reading history retention set to Never, as soon as they return this item we move this checkout from the issues table to the old_issues table and, in doing so, we replace the actual patron's borrowernumber with the borrowernumber for the "anonymous patron." That's literally just a patron in your system that's named "Anonymous" and gets credit for all the checkouts done for patrons who don't want their history remembered. Here's that same checkout after the item was returned:

At this point, the issues and old_issues tables know that this checkout happened and when it happened, but they can't accurately tell me anything about the patron who was involved.

Meanwhile, those same transactions were recorded in the statistics table:

Statistics is repeating some of the data from issues and old_issues -- we've got the itemnumber, borrowernumber, and date and time of checkout again. Notice this borrowernumber is retained, even though we're anonymizing reading history for this patron. Technically speaking, we're maintaining a link in our data between this patron and item here even as we work to remove that same link in the issues and old_issues tables.

We keep that link in order to have some statistical borrower data to work with. Want to see how many patrons of a certain category checked items out in a date range? The statistics table doesn't record that category directly, but you can join in the borrowers table using that borrowernumber to find the patrons' categories. The same is generally true for item data. We don't record everything about your items here in statistics. If you wanted to count checkouts of items in a specific call number range, you'd need to use the itemnumber to join in the items table to find those values.

For both borrowers and items, those joins get more complicated (or even flat-out impossible) if the items or patrons in question have been deleted. The deleteditems table should have all of our deleted item records, but a report that looks at both items and deleteditems is inherently going to be fairly unwieldy and slow. The same would be true of a report that looks at both borrowers and deletedborrowers, though in that case we also have the complication that it's entirely possible to delete borrowers such that they bypass the deletedborrowers table entirely.

One final complication here: if we're joining in the borrowers or items tables we should also be aware that that's reporting on these borrowers and items as they exist *now* rather than as they existed when the checkout happened. For example, I may be looking at checkouts from the previous calendar year for patrons in an Adult category. As the date exists now, I would surely end up counting checkouts to patrons who were Children at the time of their checkout and have since turned 18 and changed to Adult. Likewise, in my call number example above, I'd be getting those items' call numbers as they are now, not as they were at checkout.

The exceptions to this are the few item values we do write into the statistics table: itemtype, location, and ccode. Those values are what the item had at point of checkout. We don't record any patron data in this fashion, because Koha generally avoids saving any more identifiable patron data than it needs to.

So, overall, the data we have available is a bit too sparse, a bit too identifiable, and a bit too subject to loss due to patron and item deletion.

More and better data

The pseudonymization feature gives us a way to tell Koha to store a lot more data about our transactions and to do so in a way that cannot be connected back to a specific patron. Once you're upgraded, look for the system preference "Pseudonymization" under Patrons and Security. This will be turned off by default. There's an enable/disable switch and then two dropdowns in which to specify which data you want to keep. For now, I'm just selecting all data points for retention.

Turning on pseudonymization will not change anything about the data processes in the issues, old_issues, and statistics tables I detailed above. However, for each transaction you perform Koha will also record data in the new pseudonymized_transactions table. Here are the values in that table for a checkout and return to the same example patron I used above:

Here we've recorded all of the transaction, borrower, and item values that we told the system preferences we wanted to retain. But instead of the actual borrowernumber, we have hashed_borrowernumber. That's a bunch of gibberish that we cannot connect back to an actual patron. The hashed value is unique and consistent, so we can use it to count how many distinct borrowers we had in a date range or how many transactions this specific borrower had, but we can't even figure out exactly who they are.

We can also record extended patron attributes in this manner. To do so, we need to indicate in the attribute setup that it should be retained for pseudonymization. If the attribute is set for retention and the patron has a value for it, that value will get recorded in pseudonymized_borrower_attributes at the point of transaction. Note that this table joins back to pseudonymized_transactions on transaction_id.

All of these pseudonymized fields are recording values as they existed at point of checkout. If the item later moves to a new collection code or the patron later changes categories, the values in these pseudonymized tables will not change. Further, these tables are not impacted by deletions of items or patrons.

Altogether, if you turn on pseudonymization and set up some new reports to use this new data, you will have a more robust and stable collection of statistical data than you could achieve without this feature.

More deletion (and thereby more privacy)

Once you've started saving all this exciting new pseudonymized data, then you have the option of letting go of old patron-identifying data that you may have been holding onto for reporting purposes. That might mean your library can start allowing patrons to opt out of reading history retention or that you can feel free to delete old patron records. We expect a lot of partners will elect to start deleting old entries in the statistics table, since that data can link a patron to an item. Koha's known how to automate that deletion using the cleanup_database cron since 20.05. With this new pseudonymization feature we can put it to use!

Read more by Andrew Fuerste-Henry

Tags reports, upgrades, 21.05, 20.11