-
Task
-
Resolution: Won't Fix
-
Low
-
None
The "Anonymise" plugin https://github.com/moodlehq/moodle-local_anonymise processes a Moodle site and hashes most potentially identifying fields, replacing varchar text with hashed strings. However, numerical fields such as grades remain untouched, potentially allowing a user to be identified by someone with access to the original site (e.g. a student from the donor institution could identify themselves, which would then allow them to have a fair chance of identifying their instructors and possibly other students, etc.)
Additionally, the resulting anonymised site is awkward to use for testing, because the hashed strings are long and distracting.
This request is in two parts:
1 - Replace hashed VarChar values (as identified in Anonymiser) with serialized readable values concatenated with the context type, eg "Alpha Bravo Course" "Delta Foxtrot Section" or "Zebra Charlie Assignment"
2 - For a set of numerical fields to be identified, introduce a small amount of variance (approximately 1/10 standard deviation of the existing values). In other words, aggregate across all the values of that context, calculate the standard deviation (SD), divide by 10, then multiply this variation by a random value between -1 and 1 and add to the original value. This will maintain the general structure of the data, allowing analytics replications, etc., but make it extremely difficult to identify individuals.
Once this is done, our donated data sets can be used to generate meaningful sample data for QA, as well as being safe to publish for external research purposes.