-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.Rmd
306 lines (206 loc) · 20.8 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
---
title: "Statistical Disclosure Control using sdcTable"
author: "[Home](https://brendanjodowd.github.io)"
output:
html_document:
css: style.css
toc: true
toc_float: true
toc_collapsed: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{=html}
<style>
div.blue { background-color:#e6f0ff; border-radius: 5px; padding: 20px;}
</style>
```
## What this guide is about
This is a guide for implementing Statistical Disclosure Controls (SDC) on tabular data using R with the [sdcTable package](https://cran.r-project.org/web/packages/sdcTable/vignettes/sdcTable.html). It will include a range of different circumstances that I have encountered.
This guide does not include an introduction into what SDC is all about. The CSO have produced a very good document on [Best Practice for Statistical Disclosure Control of Tabular Data](https://www.cso.ie/en/media/csoie/aboutus-new/dataforresearchers/CSO_Guidance_on_Tabular_SDC.docx) (docx, 365KB) which includes an overview of the subject.
Statistical Disclosure Controls are measures that protect against the publication of information relating to an individual person, company or organisation. Implementing robust SDC provides assurance to data providers that their data won't be inadvertently made public. As mentioned already, this guide covers cell suppression in tabular outputs, which is generally required where certain cross-tabulations are sparsely populated. 'Unsafe cells' are identified for *primary suppression*. Usually other cells need to be suppressed too, so that the value of an unsafe cell cannot be deduced using the aggregate totals and remaining data, and this is called *secondary suppression*.
## Setup
As well as the [sdcTable package](https://cran.r-project.org/web/packages/sdcTable/vignettes/sdcTable.html) for R, we will be using the [tidyverse](https://tidyverse.tidyverse.org/) collection of packages throughout.
```{r packages , message=FALSE, warning=FALSE}
library(tidyverse)
library(sdcTable)
```
The guide uses a couple of datasets of fictional companies which can be downloaded from [here](https://github.com/brendanjodowd/sdc_guide/tree/main/input) in csv format.
We will start with a dataset called `company_sales_1` which can be imported into R as a csv. This dataset contains sales information for 60 fictional companies located in Connaught. There are five columns, the ID and Name of each company, then the County in which it is located, a Y/N indicator to indicate if the company received some kind of Aid, and the value of Sales for the year.
```{r import_1 , message=FALSE, warning=FALSE}
company_sales_1 <- read_csv("input/company_sales_1.csv")
head(company_sales_1)
```
## Simple run-through
Here we will produce aggregate sales data broken down across the two dimensions, County and Aid. We want aggregates for all Counties ("Connaught") and all companies including those that did and did not receive Aid.
The key steps are as follows:
1. **Create hierarchies using `hier_create`**
2. **Create sdcProblem object using `makeProblem`**
3. **Primary suppression using `primarySuppression`**
4. **Secondary suppression using `protectTable`**
5. **Review outputs using `getInfo`**
Later sections will show more complex cases which will be presented as variations of this simple example.
The objects returned by the functions `makeProblem`, `primarySuppression` and `protectTable` have a class and structure which is specific to the sdcTable package. Briefly, these objects contain information on the raw data, the dimensions and the suppression process. We will see how to extract information from them as the process goes on.
### 1. Defining hierachies
We need to define a hierarchy for each of our two dimensions `County` and `Aid` which we shall call `dimCounty` and `dimAid`. These are created using the `hier_create` function from the `sdcTable` package as shown below. The total category is identified as the 'root' of the hierarchy. Note that this total name/classification does not exist in the microdata. The underlying classifications are then created using the `nodes` argument to the function. There are other ways of creating hierarchies, either as a matrix or by creating them in a csv and importing that, which could be useful if you have a hierarchy with a large number of classifications.
```{r hier_1 , message=FALSE, warning=FALSE}
dimCounty <- hier_create(root = "Connaught",
nodes = c("Galway",
"Mayo",
"Roscommon",
"Sligo",
"Leitrim"))
dimAid <- hier_create(root = "All companies",
nodes = c("Y", "N"))
```
We can view our hierarchies using the function `hier_display`.
```{r display_1 , message=FALSE, warning=FALSE}
hier_display(dimCounty)
hier_display(dimAid)
```
### 2. Setting up the SDC problem
We use the function `makeProblem` to create an object of class `sdcProblem` as shown below. We are making use of three arguments:
- `data`, which is the dataset containing the microdata;
- `dimList`, which is a list object containing the hierarchies that we have created, each identified using a name which matches the name of the variable as it appears in the microdata; and
- `numVarInd`, the name of the variable containing the numerical data that we want to aggregate.
```{r prob_1 , message=FALSE, warning=FALSE}
sales_sdc_problem <- makeProblem(data = company_sales_1,
dimList = list(County = dimCounty, Aid = dimAid),
numVarInd = "Sales")
```
The object `sales_sdc_problem` has a specific class called 'sdcProblem'. It contains information about the microdata, the hierarchies, and protection parameters generated by the function `makeProblem`. We can see the structure if we click on the object in Rstudio, but there are two useful functions for extracting information from these objects that you should know about. The first is the base function `print`, which outputs a summary of the suppression process so far.
```{r show_1 , collapse=T}
print(sales_sdc_problem)
```
The second useful function for examining sdcProblem objects is `sdcProb2df`, which comes with the sdcTable package. As the name suggests, it converts the sdcProblem object into a dataframe, which is convenient. You can assign the output to a new object if you like, here I am just showing the output. You will see that there is a column called 'sdcStatus'. Each value is equal to 's' which indicates that for now each cell is considered 'safe'. We will see later that some cells will be marked with 'u' for 'unsafe' by the primary suppression. Later there will be cells marked 'x' which are deemed unsafe by the secondary suppression.
```{r sdcProb2df_1 , message=FALSE, warning=FALSE}
sdcProb2df(sales_sdc_problem)
```
There is an additional argument to `sdcProb2df` that you might find useful which is `addNumVars`. If it is set to `TRUE` (it is `FALSE` by default) will include the aggregated numerical variable as an extra column in the output.
### 3. Primary suppression
Primary suppression is carried out using the function `primarySuppression`. The first argument is the sdcProblem object created by the function `makeProblem` (in our case this is `sales_sdc_problem`). The second argument is `type`, and this can take one of four values:
- `"freq"` for frequency rule with additional parameter `maxN` for the maximum number of entities in an unsafe cell (default value is 3)
- `"nk"` for nk-dominance rule with additional parameters `n` and `k` (defaults of 2 and 85 respectively, meaning that in a safe cell two entities cannot contribute more than 85% of the aggregate value)
- `"p"` for p-percent rule with additional parameter `p` (default value of 80, meaning that in a safe cell, if the two largest contributors were removed then the remaining value has to be greater than 80% of the largest contributor)
- `"pq"` for the pq-rule with additional parameter `pq` which is a vector of length two. (The default value of pq is `c(25,50)`, meaning that in a safe cell, if the two largest contributors are removed then 50% of the remaining value should be greater than 25% of the largest contributor.)
For the `"nk"`, `"p"` and `"pq"` options there is one further mandatory argument called `numVarName` which is required to define the name of numerical variable.
Here we'll use the nk-dominance rule with default parameters:
```{r prim_1 , message=FALSE, warning=FALSE}
sales_sdc_primary <- primarySuppression(sales_sdc_problem,
type = "nk",
numVarName = "Sales")
```
The output returned by the function `primarySuppression` is also of a class sdcProblem, so we can use the functions `print` and `sdcProb2df` on `sales_sdc_primary`. Let's look at the output from `print`, which reveals that one cell is unsafe and will require primary suppression. If you look at the dataframe returned by `sdcProb2df(sales_sdc_primary)` you can see which cell is marked 'u' for 'unsafe'.
```{r show_2 , collapse=T}
print(sales_sdc_primary)
```
### 4. Secondary suppression
Secondary suppression is carried out using the function `protectTable` which has two main arguments. The first argument is the sdcProblem object to which primary suppression has already been applied (`sales_sdc_primary` in our case). The second main argument is `method` which allows us to choose an algorithm for secondary suppression. There are four options, and here I am copying the descriptions from the [Help page for this function](https://www.rdocumentation.org/packages/sdcTable/versions/0.32.2/topics/protectTable):
- "OPT": protect the complete problem at once using a cut and branch algorithm. The optimal algorithm should be used for small problem-instances only.
- "HITAS": split the overall problem in smaller problems. These problems are protected using a top-down approach.
- "HYPERCUBE": protect the complete problem by protecting sub-tables with a fast heuristic that is based on finding and suppressing geometric structures (n-dimensional cubes) that are required to protect primary sensitive table cells.
- "SIMPLEHEURISTIC" and "SIMPLEHEURISTIC_OLD": heuristic procedures which might be applied to large(r) problem instances.
There are some other arguments for parameters to these algorithms that can be varied but I have not used them. You can investigate them yourself on the [Help page for `protectTable`](https://www.rdocumentation.org/packages/sdcTable/versions/0.32.2/topics/protectTable).
Here we will use the HITAS method. The resulting object has both primary and secondary suppression applied. The next section shows how to view the results.
```{r sec_1 , message=FALSE, warning=FALSE}
sales_sdc_secondary <- protectTable(sales_sdc_primary, method = "HITAS")
```
### 5. Reviewing outputs
As already described, the functions `makeProblem` and `primarySuppression` returned objects of type 'sdcProblem', and these were analysed using the functions `print` and `sdcProb2df`. The class of object returned by `protectTable` depends on which version of the sdcTable package you are using. In the current version (0.32) `protectTable` also returns a sdcProblem object, but in earlier versions (up to 0.31) it returns a slightly different class of object called 'safeObj', and this cannot be examined using `print` and `sdcProb2df`. However, when it comes to the output from the secondary suppression I find that the best tool to analyse the output is the `getInfo` function, and this can be used in the same way regardless of which version of the package you are using.
I am going to create the output in a dataframe called `sdc_output` using `getInfo` applied to the output from the secondary suppression (`sales_sdc_secondary`) and an additional argument `type = "finalData"` to request the final data from the suppression.
In the dataframe `sdc_output` you will see that there are five columns. The first two are our two dimensions, County and Aid. Note that these classifications appear in the order that they were defined when the hierarchies were created. Then there is `Freq` for the frequency or number of companies in each cross-tabulation. Next is `Sales`, which is the aggregate value of Sales for each cross-tabulation. Finally there is `sdcStatus` which shows the suppression status. We have already seen that it can take values of `s` for 'safe' cells and `u` for cells deemed 'unsafe' by the primary suppression. Now we see a third value, `x`, for cells that require secondary suppression. There is one 'unsafe' cell for companies in Leitrim that received Aid, and a related secondary suppression for Leitrim companies that did not receive Aid. Secondary suppression of one other county is also required so that the Leitrim cells cannot be calculated from the total (Connaught) line. In this case the algorithm has chosen Sligo (rows 14 and 15).
```{r getInfo_final , message=FALSE, warning=FALSE}
sdc_output <- getInfo(sales_sdc_secondary , type = "finalData")
print(sdc_output)
```
The variable `sdcStatus` can also take a value `z`, and we will see later that this can come about when we override the secondary suppression to publish certain cross-tabulations.
## Multi-level hierarchies
Let's examine a very similar situation to the first example, except now instead of having 20 companies from 'Galway' we have some which are from 'Galway City' and others which are from 'County Galway'. The remaining counties of Mayo, Roscommon, Sligo and Leitrim are the same.
```{r import_2 , message=FALSE, warning=FALSE}
company_sales_2 <- read_csv("input/company_sales_2.csv")
head(company_sales_2, 12)
```
In our output tables we want to have a subtotal line for Galway as a whole as well as the total line for Connaught. This is accomplished by creating the hierarchy 'dimCounty' as before (we'll refer to Galway as 'Galway Total' this time), but then breaking down the node 'Galway Total' into two further nodes using the function `hier_add`. The function `hier_add` is very similar to `hier_create` except that it requires the name of the hierarchy being edited as its first argument. Note that 'Galway Total' will appear in the output table but, as with 'Connaught', it does not appear in the microdata.
You can add additional breakdowns through repeated use of `hier_add`, each time choosing an existing node to use as a root with lower nodes attached.
```{r hier_2 , message=FALSE, warning=FALSE}
dimCounty_GalwayBreakdown <- hier_create(root = "Connaught",
nodes = c("Galway Total",
"Mayo",
"Roscommon",
"Sligo",
"Leitrim"))
dimCounty_GalwayBreakdown <- hier_add(dimCounty_GalwayBreakdown,
root = "Galway Total",
nodes = c("Galway City",
"Galway County"))
hier_display(dimCounty_GalwayBreakdown)
```
Now let's run the same suppression as before. We'll do it in a single step this time by piping the various functions together:
```{r sec_2 , message=FALSE, warning=FALSE}
sales_sdc_secondary <- makeProblem(data = company_sales_2,
dimList = list(County = dimCounty_GalwayBreakdown,
Aid = dimAid),
numVarInd = "Sales") %>%
primarySuppression(type = "nk" , numVarName = "Sales") %>%
protectTable(method = "HITAS")
print(getInfo(sales_sdc_secondary , type = "finalData"))
```
## Statistics that are also hierarchical
Let's examine a situation where we have statistics which we want to publish, but are inter-related. In the file `company_data_1` we have 'Sales', 'Input Costs' and 'Value Added', where 'Value Added' is equal to 'Sales' minus 'Input Costs'.
```{r import_3 , message=FALSE, warning=FALSE}
company_data_1 <- read_csv("input/company_data_1.csv")
head(company_data_1)
```
Because of this relationship, an unsafe cell in one statistic will require that secondary suppression be applied to at least one of the other two statistics, otherwise the unsafe cell would be easily calculated using the remaining data. We need to consider 'Statistic' as a dimension and create a hierarchy for it. But first we need to pivot the data into a longer format:
```{r pivot_1 , message=FALSE, warning=FALSE}
company_data_1 <- company_data_1 %>%
pivot_longer(cols = 4:6 , names_to = "Statistic")
head(company_data_1)
```
Now we can make our hierarchy for Statistic. Note that 'Sales' is the root since it is the sum of the other two.
```{r hier_3 , message=FALSE, warning=FALSE}
dimStatistic <- hier_create(root = "Sales",
nodes = c("Input Costs",
"Value Added"))
hier_display(dimStatistic)
```
Ok, let's use this and carry out the primary and secondary suppressions. For the County dimension we will use the same hierarchy as in the first example, `dimCounty`.
```{r sec_3 , message=FALSE, warning=FALSE}
company_data_suppressed <- makeProblem(data = company_data_1 ,
dimList = list(County = dimCounty,
Statistic = dimStatistic),
numVarInd = "value") %>%
primarySuppression(type = "nk" , numVarName = "value") %>%
protectTable(method = "HITAS")
print(getInfo(company_data_suppressed , type = "finalData"))
```
Recall that the root of the hierarchy `dimCounty`, which is 'Connaught', does not appear in the microdata, and that these total values are calculated in the course of the data suppression process. However here we can see that the root of the hierarchy `dimStatistic`, 'Sales', does appear in the microdata, we have not filtered it out. The sdcTable package is able to handle both circumstances. If we had filtered out the Statistic 'Sales' from the dataframe `company_data_1` before starting the suppression process then the results would have been the same.
## Dimensions without a sum
Sometimes we want to provide a breakdown for a dimension but we don't want to publish a total figure. This could be the case with time series data, where we need tabular data for each day or month or year, but we don't plan on publishing some total like 'All years'. In principle, we could create a hierarchy for a variable like 'Year' with nodes representing each individual year and a root called 'All years', and then we could just discard the data for 'All years' and keep the data for the individual years. The problem with this approach is that it would lead to unnecessary secondary suppressions, because the algorithm would assume that there are more routes for calculating unsafe cells than there actually are. Instead, the data for each year needs to be suppressed separately. This could be done in a loop, as the following example shows.
Let's look at `company_sales_3` which includes sales data for 60 companies across five years from 2010 to 2014.
```{r import_4 , message=FALSE, warning=FALSE}
company_data_3 <- read_csv("input/company_sales_3.csv")
head(company_data_3)
```
We want to create tabular data with dimensions of County and Year, but with totals for County only. The way I would do this is to use a hierarchy for County (`dimCounty` as defined earlier) and loop over Year. As we loop over the years we can store all the suppression data in a list. So we start by making an empty list, which we call `suppression_list`. The suppression data that I am storing is the `"finalData"` output from the `getInfo` function. Before I store it in the list I need to add the Year using the `mutate` function because it is lost through the suppression process. At the very end the list is converted into a single dataframe called `suppressed_data` using the function `bind_rows`.
```{r years_loop , message=FALSE, warning=FALSE}
suppression_list <- list() # make an empty list to store the suppression data
years_vector <- 2010:2014 # create a vector of the years
number_of_years <- length(years_vector) # this is just equal to 5
for (i in 1:number_of_years){ # loop over the years
temp_microdata <- company_data_3 %>%
filter(Year == years_vector[i]) # filter out the data for this year
suppression_list[[i]] <- temp_microdata %>% # carry out an "nk" type suppression
makeProblem(dimList = list(County = dimCounty), numVarInd = "value") %>%
primarySuppression(type = "nk" , numVarName = "value") %>%
protectTable(method = "HITAS") %>%
getInfo(type = "finalData") %>% # extract the final data
mutate(Year = years_vector[i]) # create the variable Year again
}
# convert the list into a single dataframe
suppressed_data <- bind_rows(suppression_list)
head(suppressed_data)
```
Note that I am looping over 1-5 rather than 2010-2014 so that I can use the index of the loop as the index of the list that I am creating (`suppression_list`).
## Protecting certain cross-tabluations from secondary suppression