Basic perturbation requires two main components to work:
- R Keys in the unit records. These help to make the results of the perturbation consistent and repeatable.
- The p-Table. This controls the actual cell adjustments, and is key to ensuring that no bias is introduced into the results.
R Keys in Unit Records
In order to use perturbation, each unit record in your dataset must have a random number value added to it. This is referred to as an R Key:
The R Keys are what makes perturbation repeatable and consistent. They are used as part of the calculation for perturbing any given cell value, ensuring that any given combination of unit records will always be perturbed the same way.
The p-table is a lookup table that contains all the possible cell adjustments:
As the adjustments are calculated in advance, the p-Table provides full control over how values get adjusted. This allows you to ensure that no bias is introduced into the results.
Calculating the Cell Adjustment
To calculate how to perturb a cell, SuperSTAR uses a combination of:
- The R Keys in the unit records that contributed to that cell value.
- The unperturbed cell value.
- The p-Table.
The following diagram shows an example of how the perturbation for a given cell value is calculated:
- SuperSERVER finds all the records that contributed to the value and obtains their R Keys from the unit records. In the example below the highlighted cell value is 3, meaning that there are 3 unit records contributing to the count, so SuperSERVER finds the R Keys for these 3 unit records.
- SuperSERVER runs a calculation that combines these R Key values and results in a value between 1 and 256. This identifies a row in the p-Table that will be used to select the cell adjustment. In the example below, the calculation has resulted in a value of 6.
- SuperSERVER uses the unperturbed cell value to select a column in the p-Table. In the example below, the unperturbed value is 3, so it will use the third column in the p-Table.
- SuperSERVER finds the cell in the p-Table at the intersection of this row and column. This is the cell adjustment that will be applied. In the example below, the unperturbed value of 3 would be adjusted by -2, giving a perturbed cell value of 1.
Although this example shows the calculation for just one cell in the table, the perturbation algorithm runs on all of the cells in the table (including large values). This is very important to help prevent differencing attacks on the table values.
Automatically Generate R Keys
As described above, the R Keys are crucial to ensuring that perturbation is consistent and repeatable. Provided the R Keys on the individual unit records do not change, any given combination of unit records will always lead to the same row in the p-Table, and hence the same cell value adjustment, no matter how a user structures the table.
You have two choices for generating the R Keys:
- Manually add a column called
<fact_table>_Rkeyto your fact table in the source data and populate each record with a random integer. Include this column in your channelling project with the Usage type set to Measure.
- Use the option in SuperCHANNEL to have the R Keys generated automatically during the channelling process.
If you are likely to be updating your SXV4 in future with modified data, then you are recommended to add the R Keys to the source data manually. This is because the R Key for any given record needs to stay the same to ensure that perturbation is consistent. Adding the R Keys yourself ensures that each unit record will always keep the same R Key.
If you choose the automatic method, then there are some situations where SuperCHANNEL will assign the same R Keys when you rebuild the SXV4, but in other cases this will not happen, meaning that the perturbed results would not be consistent before and after the SXV4 update.
Given the same seed value, SuperCHANNEL will always generate R Keys in the same sequence, so if the row order does not change (i.e., your only change to the data is to append new records to the end) then you can use the automatic method and the results will be consistent. However, if you are likely to be removing records, or making other changes that alter the order of the records in the source, then SuperCHANNEL will assign different R Keys when you rebuild the SXV4.